Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data

Date:

An important part of site reliability is identifying and eliminating the causes of outages. Good problem management requires good problem definition and theme identification. Historically, this has been a largely inefficient human process, but problem management should never be driven solely by manual review of individual postmortems or a limited study of top-level metrics. If we want to scale, we must be systematic.

Machine Learning is a key component in this process. However, fitting models is only a small piece of the pie. Without good data sets you will learn precious little. We’ll talk though some of challenges we’ve identified when collecting and cleaning useful datasets for problem identification. How do you categorize? What is an outage theme? What is at risk for repeating and what problems have already been firmly left in the past?

On top of it all is the issue of success measurement. When we make reliability investments, how do we know that our actions are making a positive difference? We’ll address some of the challenges we’ve encountered in measuring success (and reliability) in an environment that is ever-evolving. Join us as we discuss our vision for the future and the share our journey so far.

Presentation