Talks and presentations

See a map of all the places I've given a talk!

Functional Resonance Analysis: Diagramming Your System

June 15, 2023

Tutorial, SREcon23 APAC, Singapore

Nobody’s system works exactly the way they think it does. On top of that, systems of people and software are constantly changing, resulting in a regular need to update our limited understanding of how things actually work - where the sources of our success are, where our risks are, and how things behave.

Patterns, Not Categories: Learning Across Incidents

June 14, 2023

Talk, SREcon23 APAC, Singapore

Outage pattern analysis is hard! There have been many attempts to learn across multiple incidents. Folks look for categories, tags, causes, etc. to identify what’s brittle or risky in their system, sometimes even using statistical models to help make sense of the data. However, their results often prove unsatisfying, non-actionable, or don’t tell you anything you didn’t already know from other sources.

Functional Resonance Analysis in Sociotechnical Systems

February 15, 2023

Tutorial, LFIConf23, Denver, Colorado

The Functional Resonance Analysis Method (FRAM) is a method for studying complex systems, including sociotechnical systems. Outcome agnostic, it models these systems in terms of their functions, dependencies, and interactions - identifying variance in function outputs (which can be good too!) instead of a “success/failure” paradigm. This approach allows for a better understanding of how systems work and - importantly - how they interact.

Functional Dynamics of Sociotechnical Software Systems

November 15, 2022

Talk, FRAMily2022, Kyoto, JP

Complex software systems grow ever increasingly integrated with our work and lives. Large, multi-component, dynamical software systems and their responsible teams form an ever-evolving, compelling object of study. Studies of incident command and facilitation in similar contexts has proven fruitful for understanding broader patterns and principles. We now turn to functional analysis of the systems themselves, building models thereof out of interviews, systems of record, transcripts of incident response and other artifacts. Findings illuminate the dynamics of such systems and inform operational strengths and weaknesses.

Ironies of Automation: A Comedy in Three Parts

June 14, 2019

Talk, SREcon19 APAC, Singapore

As much as we often wish we could eliminate that “squishy humans” from the loop in order to maximize our system reliability, automation usually has unintended consequences. “The Ironies of Automation,” a seminal paper on the problems that automation, spelled these out quite clearly and still stands the test of time—over 30 years later.

A Tale of Two Postmortems: A Human Factors View

June 12, 2019

Talk, SREcon19 APAC, Singapore

Many companies become frustrated with their postmortem and incident review process, feeling that it is a burden, or that it does not provide meaningful insights, or that the repairs and learnings generated do not help prevent repeats or other incidents. Fortunately, there is a better way to do things, backed by decades of scientific rigor and proven in industries where outages can mean a lot worse than lost revenue.

When Many Eyes Fail: Open Source Security and The Fall of The DAO

June 07, 2018

Talk, Open West 2018, South Jordan, Utah

The DAO hack of 2016 shook the cryptocurrency world, lost many people a lot of money, and resulted in a major schism in the second most popular blockchain in history (Ethereum). The code, however, was Open Source.

Learning at Scale Is Hard! Outage Pattern Analysis and Dirty Data

March 28, 2018

Talk, SREcon18 Americas, Santa Clara, CA

An important part of site reliability is identifying and eliminating the causes of outages. Good problem management requires good problem definition and theme identification. Historically, this has been a largely inefficient human process, but problem management should never be driven solely by manual review of individual postmortems or a limited study of top-level metrics. If we want to scale, we must be systematic.

ELK Stack

May 01, 2015

Talk, Openwest 2015, Provo, UT

Modern networks are both complex and important, requiring excellent and vigilant system administration. By implementing a practical data mining infrastructure, administrators gain much more knowledge about and power over their systems, saving them resources and time in the long run.