06.07.2016 Views

National Energy Research Scientific Computing Center

BcOJ301XnTK

BcOJ301XnTK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Innovations 27<br />

Environmental Data Collection Projects<br />

With the move to Wang Hall, NERSC is also taking the opportunity to collect environmental data<br />

from areas of the machine room where we had not previously done so at the Oakland <strong>Scientific</strong> Facility.<br />

We’ve installed sensors in different areas of the computational floor such as underneath the racks and<br />

even on the ceiling. We felt that getting a wide range of data on the environment will assist us in<br />

correlating between events that could affect the center and jobs running on the computing resources.<br />

For example, we are collecting data from sensors at the substation, through the different levels of<br />

PDUs available in the building down to the node level if possible, including the UPS/generator setup.<br />

The data is collected from the subpanels and passed to the central data collection system via a Power<br />

Over Ethernet (PoE) network setup, since many of these switches and panels are located far away from<br />

the traditional network infrastructure. In one case, the substation might not include normal 110 volt<br />

power for the sensors and hardware, which is why PoE is an easy way to get both networking and<br />

power to these remote locations. We plan to eventually install more than 3,000 sensors, with each<br />

sensor collecting 10 data points every second.<br />

Because Wang Hall is so dependent on the environment, we can leverage all the environmental data<br />

to determine optimal times to run computation-intensive jobs or small jobs or potentially predict how<br />

the environment affects computation on a hot summer day.<br />

Centralizing Sensor Data Collection<br />

In addition to environmental sensors, NERSC's move to Wang Hall also prompted opportunities for<br />

the OTG to install other types of sensors in areas where this had not been done previously. These<br />

include temperature sensors from all parts of the computer floor and from all levels (floor to ceiling),<br />

power sensors from the substations to the PDUs in the rack, water flow and temperature sensors for<br />

the building and large systems and host and application measurement from all running systems. The<br />

data collected from these sensors gives us more ways to correlate causes to issues than was previously<br />

possible. We believe that this data should be able to answer questions such as power consumption of a<br />

job, power efficiency of systems for processing jobs or how we can leverage a schedule based on size of<br />

job to time of day or even the current weather.<br />

However, the additional sensors generate a large amount of data that needs to be stored, queried,<br />

analyzed and displayed, and our existing infrastructure needed to be updated to meet this challenge.<br />

Thus, during 2015 the OTG created a novel approach to collecting system and environmental data<br />

and implemented a centralized infrastructure based on the open source ELK (Elasticsearch, Logstash<br />

and Kibana) stack. Elasticsearch is a scalable framework that allows for fast search queries of data;<br />

Logstash is a tool for managing events and parsing, storing and collecting logs; and Kibana is a data<br />

visualization plugin that we use on top of the indexed content collected by Elasticsearch.<br />

Using the ELK stack allows us to store, analyze and query the sensor data, going beyond system<br />

administration and adding the ability to extend the analysis to improve the operational efficiency of an

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!