Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home


Technical Reports


printable version
» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Content starts here

  Click here for full text: PDF

Recovery of Memory and Process in DSM Systems: HA Issue #1

Zhang, Zheng

HPL-2001-76

Keyword(s): multiprocessor; shared memory; high availability

Abstract: In this report, we discuss the recovery of memory and processes on the platform of a shared-memory DSM system. We divide the problem into recovery of unaffected memory (RUM), and recovery of affected processes (RAP). We point out that specially designed fault-tolerant, non-volatile memory is neither sufficient nor necessary to solve the problem of RUM. It is not sufficient that the system can go down when one node goes away, which can be a result of many types of faults: power failure is but one of them. It is not necessary either, because the system is distributed in nature; information redundancy across fault units can be realized, therefore, without using special memory. We discuss several ways of implementing a fault-tolerant memory system using plain memory by modifying the write-back protocols in DSM systems. The proposed techniques include mirroring and RAIM, which stands for Redundant Array of Independent Memory. The fault-tolerant memory system lays the foundation for other HA solutions, in addition to attack the problem of RUM. We use a novel approach to survey the space of transparent rollback recovery alternatives as our means to target RAP. There are two axes that constitute our space. The first axis is the fraction of fault-tolerant memory system which is part of the reliable storage. This, in many ways, determines the cost of the system as well as the checkpoint bandwidth. The second axis is how and when the checkpoint image is established and committed. The three options, built-on-the-fly, stop- and-forward and copy-on-write, have different system complexity and performance implications.

16 Pages

Back to Index

»Technical Reports

» 2009
» 2008
» 2007
» 2006
» 2005
» 2004
» 2003
» 2002
» 2001
» 2000
» 1990 - 1999

Heritage Technical Reports

» Compaq & DEC Technical Reports
» Tandem Technical Reports
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.