Click here for full text:
High Availability Issues in DSM Systems: Research Opportunities
Zhang, Zheng
HPL-2001-78
Keyword(s): shared memory multiprocessors; high availability
Abstract: This report documents a first-cut understanding of the HA issues in DSM systems. We discuss the general HA strategy, advocate for minimizing fault propagation, system reconfiguration time and performance degradation as the distinctive goals for the three stages that the system goes through after the occurrence of a fault till full recovery. We show the possibility of estimating the impact of a fault through hierarchical component dependency analysis. We point out that coherent protocols should be extended and transactions be made closed in order to detect the fault and maintain data integrity. In particular, we propose source-buffering to augment dirty data transfer protocol in preparing for possible data loss and corruption. N+1 stand-by system is suggested as the ultimate HA solution. Further research opportunities are discussed. This report skims through a broad range of issues, but it does not attempt to treat each of them in depth.
19 Pages
Back to Index
|