Incidents

No one likes causing incidents. It’s one of those very sobering experiences that probably every software developer encounters at some point during their journey.

For me, there are different mental and emotional stages of dealing with an incident that I go through. The following is purely a personal account of the mental state of “Developer A” in those postmortems, in order of appearance.

  • Sheer panic and terror. I admit, it is still my first feeling when I suspect that an ongoing incident is caused by my changes. This is really not great as it practically blocks my mental capacities for a minute or so. My pulse skyrockets and I start to sweat. It is soon overtaken by the next step.
  • Anger at my past self. How could I miss that trivial function call, or that other DB column?
  • Shame as I realise that I made my team less efficient, I made them work more, I made them less successful.
  • Anxiety while frantically working with the team and looking for potential causes, trying to pinpoint the problem at 130 heartbeats per minute, looking at metrics, logs, Sentry errors, git commits.
  • Relief when the error is rectified and rolled back (or a fix is deployed). At this point I know that at least the system will work going forward, and that I will have time investigate the ramifications of the incident when things calm down.
  • Acceptance as I admit to myself that I can’t undo the problem I did.
  • Reflection as I try to formulate the key learnings from the event and try to make sure I really learn from them. This is a key point, as I often feel like an incident is caused by “bad code” and that it is hard to exactly tell what the root cause is. Is it insufficient test coverage? What is sufficicent then? Is it lack of QA? Is it lack of thorough code reviews? Is it lack of conventions? Most probably, it is hard to say. But when you have a clear message (for instance, “dont’t modify the default behaviour of a public function without checking with all other callers that it’s OK” is a quite specific one), it’s valuable and actionable. It is at this phase that I might try to rationalise the mistake I’ve made - yes, maybe even try to convince myself that it’s not all that bad. After all, you can only have so much cognitive dissonance before you go totally crazy from guilt. I think at this stage, it’s important not to downplay the mistake, but to accept it fully and really try to remember it.

I don’t think these feelings will ever really change. I just hope that if I stress the last point enough - reflection and learning - then I might be able to minimise the occurrence of incidents my code changes cause.

Further reading: Postmortem Culture: Learning from Failure

Written on August 17, 2019

If you notice anything wrong with this post (factual error, rude tone, bad grammar, typo, etc.), and you feel like giving feedback, please do so by contacting me at samubalogh@gmail.com. Thank you!