A few months ago I was on a call which led to quitting my job.
I wasn’t angry, the call went great. I actually loved my job, had great colleagues and was having a ball solving complex problems.
Instead, the call made me obsess about the new world that was coming and how we all needed to prepare for it. If engineering systems is your jam, read on.
It was a regular review call with engineering leadership to review production issues in the past month. On this particular call, one leader dominated the discussion. An issue from his team had needed over 8 hours to figure out.
Here’s what happened: Team A made a new change that resulted in a subtle change in API behaviour. The new deployment went through, and all checks passed. Team B saw service degradation and was at a loss to understand the cause. Many hours to find the cause, and mere minutes to resolve it.
We ended the call with the usual strategies. Improving comms on deploys, adding more observability, alerts etc.
But two things kept bugging me:
We need a better way to reason about complex production systems. The complexity isn’t going away and I didn’t see how existing tools cut it.
There’s no happy ending to give you but the trajectory is already positive. Simplification is cool again in the engineering world. It’s also the best strategy to improve reliability BUT the most difficult.
I quit my job thinking that the latter feels like a problem worth solving.