National Energy Research Scientific Computing Center
BcOJ301XnTK
BcOJ301XnTK
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
Innovations 25<br />
Operational Excellence<br />
Thruk: A New Fault-Monitoring Web Interface<br />
Cori Phase I installation in the<br />
new Wang Hall computer room.<br />
Image: Kelly Owen, Lawrence<br />
Berkeley <strong>National</strong> Laboratory<br />
The move to Wang Hall prompted NERSC’s Operational Technology Group (OTG)—the team that<br />
ensures site reliability of NERSC systems, storage, facility environment and ESnet—to implement<br />
several new fault-monitoring efficiencies. Foremost among them is Thruk, an enhancement to<br />
OTG’s nagios fault-monitoring software that uses the Livestatus API to combine all instances of<br />
systems and storage into a single view.<br />
Implementing Thruk provided OTG with an efficient method for correlating various alerts and<br />
determining if an issue affects systems throughout Wang Hall or just in certain parts of the facility.<br />
The most important advantage of this new tool is that it allows the OTG to monitor multiple aspects<br />
of systems in a single view instead of multiple tabs or multiple url interfaces. This has decreased the<br />
possibility of missing critical alerts. Other advantages include:<br />
• A single location to view alerts to multiple systems<br />
• A configuration interface that is flexible, allowing the OTG to easily add and remove systems<br />
• The ability to send multiple commands at once without waiting for a response<br />
• Multiple filtering schemes