|
Click here for full text:
Recovery of Memory and Process in DSM Systems: HA Issue #1
Zhang, Zheng
HPL-2001-76
Keyword(s): multiprocessor; shared memory; high availability
Abstract: In this report, we discuss the recovery of memory and processes on the platform of a shared-memory DSM system. We divide the problem into recovery of unaffected memory (RUM), and recovery of affected processes (RAP). We point out that specially designed fault-tolerant, non-volatile memory is neither sufficient nor necessary to solve the problem of RUM. It is not sufficient that the system can go down when one node goes away, which can be a result of many types of faults: power failure is but one of them. It is not necessary either, because the system is distributed in nature; information redundancy across fault units can be realized, therefore, without using special memory. We discuss several ways of implementing a fault-tolerant memory system using plain memory by modifying the write-back protocols in DSM systems. The proposed techniques include mirroring and RAIM, which stands for Redundant Array of Independent Memory. The fault-tolerant memory system lays the foundation for other HA solutions, in addition to attack the problem of RUM. We use a novel approach to survey the space of transparent rollback recovery alternatives as our means to target RAP. There are two axes that constitute our space. The first axis is the fraction of fault-tolerant memory system which is part of the reliable storage. This, in many ways, determines the cost of the system as well as the checkpoint bandwidth. The second axis is how and when the checkpoint image is established and committed. The three options, built-on-the-fly, stop- and-forward and copy-on-write, have different system complexity and performance implications.
16 Pages
Back to Index
|