06.07.2016 Views

National Energy Research Scientific Computing Center

BcOJ301XnTK

BcOJ301XnTK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

26 NERSC ANNUAL REPORT 2015<br />

Screen shot of the new<br />

Thruk interface.<br />

Thruk has also been instrumental in supporting the OTG’s efforts to improve workflow automation.<br />

In 2015 OTG staff implemented an automated workflow for timely reporting of user issues that uses<br />

program event handlers to manage multiple instances of issues and multiple action items. For<br />

example, alerts that called for a staff person to open a new ticket and assign it to a group can now<br />

be automated based on a filtering mechanism. The new interface saves the group’s workflow<br />

approximately 17 hours per week, in addition to tickets now being opened accurately on a more<br />

consistent basis. In the past, the language and the information included on the ticket depended on<br />

which staff member opened the ticket.<br />

Another advantage of the automation process involves opening tickets with vendors as a result of an<br />

alert. In the past, a staff person would either e-mail or call the vendor call center to report a problem.<br />

The phone call could take anywhere from 15 minutes to 45 minutes, depending on the vendor, and<br />

the time it took for the vendor tech to respond could be anywhere from one to four hours. In<br />

contrast, the automated process sends an e-mail to the vendor call center, which usually puts a<br />

priority on clearing out their inbox of requests. Thus a ticket sent to the vendor can be opened within<br />

15 minutes, providing detailed information such as diagnostic logs as an attachment to the e-mail.<br />

This has resulted in improving vendor tech response times to under one hour. In addition, with the<br />

appropriate information provided with each alert, vendors spend less time on diagnosis and more<br />

time on problem resolution.<br />

The event handlers can also create and assign work tickets to staff. For example, an alert that says a<br />

drive has failed and needs replacement can activate an event handler that opens a ticket and assigns it<br />

to a staff person who can replace that drive, work with vendors for parts under warranty and allow the<br />

staff to bring the node back into production sooner. One of the areas that allowed this process to be<br />

automated involves centralizing the location of the warrantied parts with their purchase date, serial<br />

number, vendor information and warranty expiration date. With the verification process handled<br />

automatically, it now takes only seconds to decide whether to exchange or acquire a part. This has<br />

resulted in compute nodes coming back to production in a shorter amount of time—in some instances<br />

as little as four to eight hours, where previously it could sometimes take twice that long.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!