12.03.2016 Views

Anomaly Detection for Monitoring

anomaly-detection-monitoring

anomaly-detection-monitoring

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

• It can reduce the surface area or search space when trying to<br />

diagnose a problem that has been detected. In a world of millions<br />

of metrics, being able to find metrics that are behaving<br />

unusually at the moment of a problem is a valuable way to narrow<br />

the search.<br />

• It can reduce the need to calibrate or recalibrate thresholds<br />

across a variety of different machines or services.<br />

• It can augment human intuition and judgment, a little bit like<br />

the Iron Man’s suit augments his strength.<br />

<strong>Anomaly</strong> detection cannot do a lot of things people sometimes think<br />

it can. For example:<br />

• It cannot provide a root cause analysis or diagnosis, although it<br />

can certainly assist in that.<br />

• It cannot provide hard yes or no answers about whether there is<br />

an anomaly, because at best it is limited to the probability of<br />

whether there might be an anomaly or not. (Even humans are<br />

often unable to determine conclusively that a value is anomalous.)<br />

• It cannot prove that there is an anomaly in the system, only that<br />

there is something unusual about the metric that you are<br />

observing. Remember, the metric isn’t the system itself.<br />

• It cannot detect actual system faults (failures), because a fault is<br />

different from an anomaly. (See the previous point again.)<br />

• It cannot replace human judgment and experience.<br />

• It cannot understand the meaning of metrics.<br />

• And in general, it cannot work generically across all systems, all<br />

metrics, all time ranges, and all frequency scales.<br />

This last item is quite important to understand. There are pathological<br />

cases where every known method of anomaly detection, every<br />

statistical technique, every test, every false positive filter, everything,<br />

will break down and fail. And on large data sets, such as those you<br />

get when monitoring lots of metrics from lots of machines at high<br />

resolution in a modern application, you will find these pathological<br />

cases, guaranteed.<br />

In particular, at a high resolution such as one-second metrics resolution,<br />

most machine-generated metrics are extremely noisy, and will<br />

12 | Chapter 2: A Crash Course in <strong>Anomaly</strong> <strong>Detection</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!