Talks and presentations
See a map of all the places I've given a talk!
June 15, 2023
Tutorial, SREcon23 APAC, Singapore
Nobody’s system works exactly the way they think it does. On top of that, systems of people and software are constantly changing, resulting in a regular need to update our limited understanding of how things actually work - where the sources of our success are, where our risks are, and how things behave.
June 14, 2023
Talk, SREcon23 APAC, Singapore
Outage pattern analysis is hard! There have been many attempts to learn across multiple incidents. Folks look for categories, tags, causes, etc. to identify what’s brittle or risky in their system, sometimes even using statistical models to help make sense of the data. However, their results often prove unsatisfying, non-actionable, or don’t tell you anything you didn’t already know from other sources.
February 15, 2023
Tutorial, LFIConf23, Denver, Colorado
The Functional Resonance Analysis Method (FRAM) is a method for studying complex systems, including sociotechnical systems. Outcome agnostic, it models these systems in terms of their functions, dependencies, and interactions - identifying variance in function outputs (which can be good too!) instead of a “success/failure” paradigm. This approach allows for a better understanding of how systems work and - importantly - how they interact.
November 15, 2022
Talk, FRAMily2022, Kyoto, JP
Complex software systems grow ever increasingly integrated with our work and lives. Large, multi-component, dynamical software systems and their responsible teams form an ever-evolving, compelling object of study. Studies of incident command and facilitation in similar contexts has proven fruitful for understanding broader patterns and principles. We now turn to functional analysis of the systems themselves, building models thereof out of interviews, systems of record, transcripts of incident response and other artifacts. Findings illuminate the dynamics of such systems and inform operational strengths and weaknesses.
June 14, 2019
Talk, SREcon19 APAC, Singapore
As much as we often wish we could eliminate that “squishy humans” from the loop in order to maximize our system reliability, automation usually has unintended consequences. “The Ironies of Automation,” a seminal paper on the problems that automation, spelled these out quite clearly and still stands the test of time—over 30 years later.
June 12, 2019
Talk, SREcon19 APAC, Singapore
Many companies become frustrated with their postmortem and incident review process, feeling that it is a burden, or that it does not provide meaningful insights, or that the repairs and learnings generated do not help prevent repeats or other incidents. Fortunately, there is a better way to do things, backed by decades of scientific rigor and proven in industries where outages can mean a lot worse than lost revenue.
June 07, 2018
Talk, Open West 2018, South Jordan, Utah
The DAO hack of 2016 shook the cryptocurrency world, lost many people a lot of money, and resulted in a major schism in the second most popular blockchain in history (Ethereum). The code, however, was Open Source.
March 28, 2018
Talk, SREcon18 Americas, Santa Clara, CA
An important part of site reliability is identifying and eliminating the causes of outages. Good problem management requires good problem definition and theme identification. Historically, this has been a largely inefficient human process, but problem management should never be driven solely by manual review of individual postmortems or a limited study of top-level metrics. If we want to scale, we must be systematic.
May 01, 2015
Talk, Openwest 2015, Provo, UT
Modern networks are both complex and important, requiring excellent and vigilant system administration. By implementing a practical data mining infrastructure, administrators gain much more knowledge about and power over their systems, saving them resources and time in the long run.