09.05.2014 Views

FY2010 - Oak Ridge National Laboratory

FY2010 - Oak Ridge National Laboratory

FY2010 - Oak Ridge National Laboratory

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Director’s R&D Fund—<br />

Ultrascale Computing and Data Science<br />

capabilities, and demonstrating the utility of this approach in a small number of scientific simulation<br />

codes, we will provide a foundation for developing fault-tolerant simulation codes that will benefit from<br />

the enormous potential these emerging platforms provide.<br />

Results and Accomplishments<br />

We proposed extensions and modifications of the MPI standard, for example, the ability for MPI<br />

communicators to shrink when processor recovery takes place (users define the recovery policies), the<br />

specification of the behavior of several MPI functions, as well as modifications to support recovery from<br />

process failure (including functions to set communicator recovery policy, and to aid restarted processes to<br />

rejoin existing communicators). Backwards compatibility with the existing MPI standard is maintained.<br />

In addition, users have some control over the performance penalties paid for the fault-tolerance support,<br />

which primarily affects the use of collective communications, and the communicator recovery mode<br />

(local or global).<br />

We also focused on failure detection, separating the monitoring for failure detection from the detection<br />

itself, respectively, via detection and consensus mechanisms. These mechanisms are organized using a<br />

graph-based topology, each node having the capability of hosting detectors and/or the consensus<br />

mechanism. The consensus mechanism implements a methodology for failure determination,<br />

interacts with detectors to get monitoring information, and prevents recovery actions during normal<br />

termination. Detectors are independent; they can run simultaneously to enable composition; they can<br />

perform local or remote monitoring; and they provide monitoring information using an abstract<br />

communication mechanism.<br />

Our prototype includes a central threshold-based consensus mechanism based upon suspicion data using a<br />

mesh topology. We implemented two remote detectors: a TCP-based keep-alive detector and a probeacknowledge<br />

detector. We used the Open MPI’s Modular Component Architecture framework system<br />

whereby users can reuse existing capabilities and easily extend the system with new algorithms.<br />

As a result of the work on this project, we have been invited by Professor Yutaka Ishikawa from the<br />

University of Tokyo to submit a joint proposal to the Japan Science and Technology Agency to provide<br />

the research in support of a fault-tolerant communication library (probably not full MPI) for the Japanese<br />

10 PF system, and the follow-on to that. Professor Ishikawa is the lead for computer system research for<br />

those platforms.<br />

Information Shared<br />

Graham, R. L. 2010. “Towards Support for Fault Tolerance in the MPI Standard.” SIAM Conference on<br />

Parallel Processing for Scientific Computing, Seattle.<br />

72

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!