Anomaly Detection for Monitoring
anomaly-detection-monitoring
anomaly-detection-monitoring
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
• It can reduce the surface area or search space when trying to<br />
diagnose a problem that has been detected. In a world of millions<br />
of metrics, being able to find metrics that are behaving<br />
unusually at the moment of a problem is a valuable way to narrow<br />
the search.<br />
• It can reduce the need to calibrate or recalibrate thresholds<br />
across a variety of different machines or services.<br />
• It can augment human intuition and judgment, a little bit like<br />
the Iron Man’s suit augments his strength.<br />
<strong>Anomaly</strong> detection cannot do a lot of things people sometimes think<br />
it can. For example:<br />
• It cannot provide a root cause analysis or diagnosis, although it<br />
can certainly assist in that.<br />
• It cannot provide hard yes or no answers about whether there is<br />
an anomaly, because at best it is limited to the probability of<br />
whether there might be an anomaly or not. (Even humans are<br />
often unable to determine conclusively that a value is anomalous.)<br />
• It cannot prove that there is an anomaly in the system, only that<br />
there is something unusual about the metric that you are<br />
observing. Remember, the metric isn’t the system itself.<br />
• It cannot detect actual system faults (failures), because a fault is<br />
different from an anomaly. (See the previous point again.)<br />
• It cannot replace human judgment and experience.<br />
• It cannot understand the meaning of metrics.<br />
• And in general, it cannot work generically across all systems, all<br />
metrics, all time ranges, and all frequency scales.<br />
This last item is quite important to understand. There are pathological<br />
cases where every known method of anomaly detection, every<br />
statistical technique, every test, every false positive filter, everything,<br />
will break down and fail. And on large data sets, such as those you<br />
get when monitoring lots of metrics from lots of machines at high<br />
resolution in a modern application, you will find these pathological<br />
cases, guaranteed.<br />
In particular, at a high resolution such as one-second metrics resolution,<br />
most machine-generated metrics are extremely noisy, and will<br />
12 | Chapter 2: A Crash Course in <strong>Anomaly</strong> <strong>Detection</strong>