Dissecting an Interesting Failure and complete multiday outage of derekriemer.com, a postmortem investigation

What happened?

On January 3rd, 2023, I made a routine change to derekriemer.com, changing some draft changes to a new article I had written. The changes were then attempted to deploy to staging.derekriemer.com, but unknown to me, several days earlier I accidentally destroyed the entire site with a big oopsie. A combination of code as config, shell variable expansion acting in strange ways, and python F-strings ended up being the cause of the outage. The outage would have been isolated to my staging environment, but a push process that does not properly isolate the staging environment of derekriemer.com from the production environment caused the blast radius to include the entire site, including my production domain, instead of just my staging subdomain. The outage was not detected for days because I left a migration intending to provide separate draft deployment configs for staging and production in an borked state, not realizing what I had done. The fact that derekriemer.com is not particularly important meant I had left myself with no monitoring of the sites actual status, and nobody emailed me wondering where my site went, so in effect, I let my site cease to exist for over a week (I don't have records of exactly the date of the failure, because I'm lazy with my version control for derekriemer.com, because its simply not that important).

Read Dissecting an Interesting Failure and complete multiday outage of derekriemer.com, a postmortem investigation… (11 paragraphs remaining).