I’ve worked at a few places that had a large number of Linux boxes. I’m
talking about well over a million. When you have that many cats which
need herding, sometimes you have to do things to big groups of them at
once. Once in a while, you even have to touch all of them at once.
It’s been my experience that companies which possess such massive fleets
tend to create tooling which will let them do exactly that. These tools
have different names, but the gist of it is about the same: ssh in as
root, run some command, and maybe return the exit code and/or output.
For certain situations, this is exactly what is needed to put out a
fire, and that’s when you’re thankful it exists.
This post, however, is not about that. This post is about the other
side of the coin which is where someone uses one of these tools and
creates a problem. Maybe they decide to roll out a “flag flip”
this way instead of using best practices (tests, canaries, percentage
rollouts, that kind of thing). Perhaps they decide to push a new binary
to every machine at once and so they all drop out at the same time,
leaving no capacity to run the actual site.
There’s something I’ve asked people to put into their tools to prevent
certain kinds of disasters. It’s intended to address the specific
situation where someone runs the command and targets far too many
machines. Maybe they wanted to touch a rack of test hosts (40 or so),
but accidentally selected all of them.
Once you have tooling like this, errors like this will happen too.
My request is simple enough: if you’re going to generate a confirmation
prompt as a sanity check, *don’tmake it a Y/N type of thing.
Instead, ask them to read a number and plug it back in.
It’ll look like this:
Blah blah blah 123456 machines will be affected by this. Proceed? Enter number of machines to confirm:
Your options are then to type in exactly “123456” to let it go, or
anything else to abort.
The idea is to force you to take in that number through your usual input
devices (I’d say eyes, but some people are using text-to-speech stuff
or similar, and they count too), chew on it with your wetware, and then
feed it back into the computer somehow. Adding a few extra steps like
this will hopefully activate enough of your brain to make you stop short
before blowing off your entire leg with a giant foot-gun.
Of course, if you run into this a lot and you actually intend to hit
that many machines, someone might start cutting and pasting the number.
In that case, I would say that you’re using that tool far too often, and
should take a look at changing the way things are done to avoid having
to rely on it this much.
Now, reality being what it is, “stop using it” might not be easily done
in a given company. If that’s what’s going on, then it might be
interesting to split up the number a bit so it can’t just be pasted in
and has to make a round-trip through the human doing the work.
This might be as simple as printing the number with your locale’s
version of numerical separators, like “123,456” or “123.456” or “123
456” or whatever else you might use where you are. The trick is then to
NOT accept that as input, but instead demand that they remove the
separator and jam it in as just digits.
Blah blah blah 123,456 machines will be affected. Proceed? Enter number of machines to confirm: 123456 OK! Continuing.
I’ve seen this technique save people on multiple occasions, and share it
here in the hopes it helps others. If you’re designing something rather
powerful, consider making it safe this way.
Just think: the numbers come off the screen, leap into the person,
bounce around the inside of their head, and go back into the computer.
Nothing but net.