Anomaly Detection for Monitoring

More documents

Recommendations

Info

• It can reduce the surface area or search space when trying to diagnose a problem that has been detected. In a world of millions of metrics, being able to find metrics that are behaving unusually at the moment of a problem is a valuable way to narrow the search. • It can reduce the need to calibrate or recalibrate thresholds across a variety of different machines or services. • It can augment human intuition and judgment, a little bit like the Iron Man’s suit augments his strength. Anomaly detection cannot do a lot of things people sometimes think it can. For example: • It cannot provide a root cause analysis or diagnosis, although it can certainly assist in that. • It cannot provide hard yes or no answers about whether there is an anomaly, because at best it is limited to the probability of whether there might be an anomaly or not. (Even humans are often unable to determine conclusively that a value is anomalous.) • It cannot prove that there is an anomaly in the system, only that there is something unusual about the metric that you are observing. Remember, the metric isn’t the system itself. • It cannot detect actual system faults (failures), because a fault is different from an anomaly. (See the previous point again.) • It cannot replace human judgment and experience. • It cannot understand the meaning of metrics. • And in general, it cannot work generically across all systems, all metrics, all time ranges, and all frequency scales. This last item is quite important to understand. There are pathological cases where every known method of anomaly detection, every statistical technique, every test, every false positive filter, everything, will break down and fail. And on large data sets, such as those you get when monitoring lots of metrics from lots of machines at high resolution in a modern application, you will find these pathological cases, guaranteed. In particular, at a high resolution such as one-second metrics resolution, most machine-generated metrics are extremely noisy, and will 12 | Chapter 2: A Crash Course in Anomaly Detection
cause most anomaly detection techniques to throw off lots and lots of false positives. Are Anomalies Rare? Depending on how you look at it, anomalies are either rare or common. The usual definition of an anomaly uses probabilities as a proxy for unusualness. A rule of thumb that shows up often is three standard deviations away from the mean. This is a technique that we will discuss in depth later, but for now it suffices to say that if we assume the data behaves exactly as expected, 99.73% of observations will fall within three sigmas. In other words, slightly less than three observations per thousand will be considered anomalous. That sounds pretty rare, but given that there are 1,440 minutes per day, you’ll still be flagging about 4 observations as anomalous every single day, even in one minute granularity. If you use one second granularity, you can multiply that number by 60. Suddenly these rare events seem incredibly common. One might even call them noisy, no? Is this what you want on every metric on every server that you manage? You make up your own mind how you feel about that. The point is that many people probably assume that anomaly detection finds rare events, but in reality that assumption doesn’t always hold. How Can You Use Anomaly Detection? To apply anomaly detection in practice, you generally have two options, at least within the scope of things considered in this book. Option one is to generate alerts, and option two is to record events for later analysis but don’t alert on them. Generating alerts from anomalies in metrics is a bit dangerous. Part of this is because the assumption that anomalies are rare isn’t as true as you may think. See the sidebar. A naive approach to alerting on anomalies is almost certain to cause a lot of noise. Our suggestion is not to alert on most anomalies. This follows directly from the fact that anomalies do not imply that a system is in a bad state. In other words, there is a big difference between an anomalous observation in a metric, and an actual system fault. If you can guarantee that an anomaly reliably detects a serious prob‐ How Can You Use Anomaly Detection? | 13
Page 3 and 4: Anomaly Detection for Monitoring A
Page 5 and 6: Table of Contents Foreword. . . . .
Page 7: Foreword Monitoring is currently un
Page 10 and 11: tion” to anomaly detection is imp
Page 12 and 13: Why do we assume these things? Are
Page 14 and 15: Conclusions If you are like most of
Page 17 and 18: CHAPTER 2 A Crash Course in Anomaly
Page 19: How can you achieve similar results
Page 23 and 24: CHAPTER 3 Modeling and Predicting A
Page 25 and 26: (say, the size of the drill bit), a
Page 27 and 28: lies. To fix this problem, the cont
Page 29 and 30: ally decaying window. This is made
Page 31 and 32: ack again, meaning that they smooth
Page 33 and 34: emains consistent across all of the
Page 35 and 36: Evaluating Predictions One of the m
Page 37 and 38: No. You’ve stumbled into statisti
Page 39: As an aside, there’s a rumor goin
Page 42 and 43: Dealing with Trend Trends break mod
Page 44 and 45: pletely out of phase with the seaso
Page 46 and 47: mon situations. You can probably gu
Page 49 and 50: CHAPTER 5 Practical Anomaly Detecti
Page 51 and 52: If you can get close to that, you m
Page 53 and 54: We previously discussed how frequen
Page 55 and 56: First, observe how odd this metric
Page 57 and 58: After differencing, it looks like w
Page 59 and 60: Perhaps, instead of differencing, w
Page 61 and 62: CHAPTER 6 The Broader Landscape As
Page 63 and 64: We could apply a EWMA control chart
Page 65 and 66: figure out how unlikely it was for
Page 67 and 68: predict or find structure in data w
Page 69 and 70: Graphite and RRDTool Graphite and R
Page 71 and 72:
APPENDIX A Appendix Code Control Ch
Page 73 and 74:
var ma = new movingAverage(0.5); ma
Page 75:
Acknowledgments We’d like to than
show all

Anomaly Detection for Monitoring

Create successful ePaper yourself

Delete template?

Save as template?