06.07.2016 Views

National Energy Research Scientific Computing Center

BcOJ301XnTK

BcOJ301XnTK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Innovations 25<br />

Operational Excellence<br />

Thruk: A New Fault-Monitoring Web Interface<br />

Cori Phase I installation in the<br />

new Wang Hall computer room.<br />

Image: Kelly Owen, Lawrence<br />

Berkeley <strong>National</strong> Laboratory<br />

The move to Wang Hall prompted NERSC’s Operational Technology Group (OTG)—the team that<br />

ensures site reliability of NERSC systems, storage, facility environment and ESnet—to implement<br />

several new fault-monitoring efficiencies. Foremost among them is Thruk, an enhancement to<br />

OTG’s nagios fault-monitoring software that uses the Livestatus API to combine all instances of<br />

systems and storage into a single view.<br />

Implementing Thruk provided OTG with an efficient method for correlating various alerts and<br />

determining if an issue affects systems throughout Wang Hall or just in certain parts of the facility.<br />

The most important advantage of this new tool is that it allows the OTG to monitor multiple aspects<br />

of systems in a single view instead of multiple tabs or multiple url interfaces. This has decreased the<br />

possibility of missing critical alerts. Other advantages include:<br />

• A single location to view alerts to multiple systems<br />

• A configuration interface that is flexible, allowing the OTG to easily add and remove systems<br />

• The ability to send multiple commands at once without waiting for a response<br />

• Multiple filtering schemes

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!