16 April 2016
rm -rf /
Okay, this story has been doing the rounds over the past few days, and I think it's time to talk about it:
Man accidentally 'deletes his entire company' with one line of bad code.
The gist of the story is that he had a script which did something along the lines of:
#!/bin/sh foo=$(something) bar=$(something) rm -rf ${foo}/${bar}
Unfortunately, the $(something)
evaluated to nothing, so what he ended up running was just plain "rm -rf /
". This script was automated, and ran on all of his servers. Oh, and his backups were NFS mounted, read-write (of course), on the machines. So all of the backups got deleted, too.
Now, this story has turned out to be a hoax, and has been deleted from serverfault.com, but still, it raises some interesting points. One being that I did something remarkably similar myself, a few years ago. In an environment where there was no budget for any kind of Linux / Unix infrastructure (Puppet master, systems monitoring, etc etc), What we did have was backups (phew), a rapidly-growing server estate of around 100 servers at that point, and just me to manage them all. I had a script which would gather various vital statistics on all the servers - CPU, Memory, free disk space, etc. It saved all the info into a temporary directory (created by mktemp
), tarred and zipped up the results, and copied them back to the server I ran the script from. There wasn't an admin server, as I mentioned, so even that script was run from one of the less-busy of the active servers. This script was nothing special, but it ran fine on the various Solaris (versions 8,9,10), Red Hat Enterprise Linux (versions 4,5), and HP-UX servers that we had. I was actually quite pleased with how OS-agnostic it was, and how adept at dealing with the differences. The example below does not do it justice, honestly!
Then we got a bunch of AIX servers. It turns out that AIX does not have the "mktemp
" command. It is a great command, it creates a uniquely named temporary file (or directory, with the "-d
" switch), and echoes out the name of the newly-created file (or directory). Something along the lines of "/tmp/tmp.SWYw9ZejQk
". The idea is that you can run one command to create a unique temporary directory:
TEMPDIR=$(mktemp -d)
Once that one command has run, the directory has been created, and ${TEMPDIR}
contains its name. So the script could save the results there, and of course, safely tidy up afterwards with "rm -rf "${TEMPDIR}
", a bit like this:
#!/bin/sh function gather_data() { # Gather data df -lP && df -lP > df-lP.out df -lh && df -lh > df-lh.out [ -f /proc/swaps ] && cat /proc/swaps > proc_swaps.out [ -f /proc/cpuinfo ] && cat /proc/cpuinfo > cpuinfo.out which free && free -m > free-m.out which prtdiag && prtdiag > prtdiag.out [ -f /etc/redhat-release ] && cp /etc/redhat-release . [ -f /etc/issue ] && cp /etc/issue . which swap && swap -l > swap-l.out # etc etc } TEMPDIR=$(mktemp -d) cd "${TEMPDIR}" gather_data tar cf /tmp/logs.tar "${TEMPDIR}" gzip -9 /tmp/logs.tar cd / rm -rf "${TEMPDIR}"/
Now, I don't remember the exact details, but the actual result was only that the application directory got wiped, I think maybe I'd created the temp directory in there. But still, there were people actively working on configuring those applications, and they were not best pleased that I had deleted their work. I was made to feel like the worst sysadmin in the world, of course. And I felt it. I'm a contractor and as such I move around quite a bit in work, and although I left that project about a year later, they have asked me back a few times since, so I think that all is now probably just about forgiven.
I have got a few take-aways from my nasty experience, including:
- If I, having written a book on shell scripting, can make such a silly mistake, then I'm sure that other people can do it, too. I'm not claiming that it's not a silly mistake, nor denying that it should never have happened. However, anybody who claims never to have made a mistake is either a liar, or has never, ever, tried to do anything new.
- Backups are essential; they saved a lot of this disaster, although:
- Even when you have nightly backups, you can still stand to lose the day's work that was done since the most recent backup was taken.
- Expecting flawless results without paying for suitable infrastructure is an unrealistic management method. If we had had proper tools for monitoring the state of the servers, this quickly thrown-together script would not have been used. Similarly, if there had been the time to test it on the AIX servers, this bug would have been removed before it did any damage at all.
From a technical perspective, firstly, of course it is vital to know the technology of the systems that you are working on (like, does AIX include the mktemp
command), but secondly, it is essential to test everything. What particularly irritates me about this mistake, was that the main part of the script was incredibly pedantic about checking that each command would work on a given OS, given that its job was to gather data about many different types of *nix system. On a Linux box, it would gather /proc/cpuinfo
to count the CPUs, on a Solaris box it would run prtdiag
, and so on. It was careful to run on any Bourne-like shell (the "functon fname() {...}
" stuff can be problematic). It didn't assume that "tar xzf
" would work. But the trivial admin stuff which surrounded the main task, the silly little bit of making a temporary directory, didn't get any real consideration at all. There was no checking that the mktemp
command had succeeded; I have never come across a situation where it would fail. Unless, of course, the binary does not exist on the server. Yeah, that would fail. I know that now.
PS:
I have written about this before, it is entirely avoidable. GNU's rm
(i.e., that included in virtually every Linux distro) now has the "--preserve-root
" option set by default. Older distros will not have this protection, so do be careful. Any such mechanisms are never fool-proof. And if you don't think that you are a fool, well... nor did I.