FY2010 - Oak Ridge National Laboratory
FY2010 - Oak Ridge National Laboratory
FY2010 - Oak Ridge National Laboratory
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Director’s R&D Fund—<br />
Ultrascale Computing and Data Science<br />
capabilities, and demonstrating the utility of this approach in a small number of scientific simulation<br />
codes, we will provide a foundation for developing fault-tolerant simulation codes that will benefit from<br />
the enormous potential these emerging platforms provide.<br />
Results and Accomplishments<br />
We proposed extensions and modifications of the MPI standard, for example, the ability for MPI<br />
communicators to shrink when processor recovery takes place (users define the recovery policies), the<br />
specification of the behavior of several MPI functions, as well as modifications to support recovery from<br />
process failure (including functions to set communicator recovery policy, and to aid restarted processes to<br />
rejoin existing communicators). Backwards compatibility with the existing MPI standard is maintained.<br />
In addition, users have some control over the performance penalties paid for the fault-tolerance support,<br />
which primarily affects the use of collective communications, and the communicator recovery mode<br />
(local or global).<br />
We also focused on failure detection, separating the monitoring for failure detection from the detection<br />
itself, respectively, via detection and consensus mechanisms. These mechanisms are organized using a<br />
graph-based topology, each node having the capability of hosting detectors and/or the consensus<br />
mechanism. The consensus mechanism implements a methodology for failure determination,<br />
interacts with detectors to get monitoring information, and prevents recovery actions during normal<br />
termination. Detectors are independent; they can run simultaneously to enable composition; they can<br />
perform local or remote monitoring; and they provide monitoring information using an abstract<br />
communication mechanism.<br />
Our prototype includes a central threshold-based consensus mechanism based upon suspicion data using a<br />
mesh topology. We implemented two remote detectors: a TCP-based keep-alive detector and a probeacknowledge<br />
detector. We used the Open MPI’s Modular Component Architecture framework system<br />
whereby users can reuse existing capabilities and easily extend the system with new algorithms.<br />
As a result of the work on this project, we have been invited by Professor Yutaka Ishikawa from the<br />
University of Tokyo to submit a joint proposal to the Japan Science and Technology Agency to provide<br />
the research in support of a fault-tolerant communication library (probably not full MPI) for the Japanese<br />
10 PF system, and the follow-on to that. Professor Ishikawa is the lead for computer system research for<br />
those platforms.<br />
Information Shared<br />
Graham, R. L. 2010. “Towards Support for Fault Tolerance in the MPI Standard.” SIAM Conference on<br />
Parallel Processing for Scientific Computing, Seattle.<br />
72