Saturday, July 20, 2013

Promoting an environment of clean rooms

In a previous post, I talked about how to take something that's a mess and keeping it from getting worse. You may not be able to fix it, but you can at least not contribute to the mess. Today is about what a clean room gets you.

The concept of "cleanroom" work (notice the pun?) appears in a lot of disciplines. All computing hardware, especially CPUs and disks, are built in cleanrooms. The tolerances for these (and most other computing hardware) are so small that any dust or electrostatic charge would make the final product useless. (Even with this, Intel is rumored to have at best a 95% yield, much worse on newer processes.)

Like so many other things, the concept originates with medicine. When the germ theory was just starting to gain acceptance, early disease control would distinguish between "cleanroom" and "sickroom". The protocol (official process) was that anything in the cleanroom could be taken into the sickroom, but nothing from the sickroom could be taken into the cleanroom without being sterilized.  This isolates the germs in the sick areas and prevents contamination. This protocol is still used today, both by parents dealing with sick children and as the basis for how epidemiologists work with highly infectious diseases like Ebola and meningitis.

The best way to think about this is with bubble children. Due to no immune system, they have to live in a completely isolated and sterile environment - a bubble. Everything (diapers, food, water, books) has to be sterilized and kept sterile before it can be introduced into the bubble. The moment anything non-sterile (a dirty fork) touches something meant for the bubble (a book), the book is now back at square 1, having to go through the whole sterilization procedure again.

In IT, we promote our applications from environment to environment (notice the second pun?). Developers do their work in a "development" or "dev" environment (ideally on their local machines). When they have finished, the work is promoted to a "test" or "beta" environment to be evaluated by the QA staff. When they have finished, the work is promoted (possibly through a "staging" environment for regression testing) to "production". Production is the bubble.

This idea of successive steps of verification is exactly the same as disinfecting anything that leaves a sickroom. Instead of dealing with infections, we deal with bugs. When someone writes a line of code, that line is maximally buggy. The more tests we apply to it (code review, unit-tests, integration tests, regression tests, etc), the more we can say "This line of code doesn't have that bug." We achieve a relatively high degree of confidence that the new line of code is sufficiently bug-free that it can move into a "cleaner environment". Eventually, the new line of code goes through enough cleanings that it can be introduced into the production bubble.

Environments, like cleanrooms, must be isolated from anything that could impact them. In an odd twist, cleanrooms have their own infectious nature. In order for a cleanroom to remain clean, anything that can feed into or affect the cleanroom has to be at the same level of cleanliness. If you touch something in the bubble, you're part of the bubble. If something touches you, then that something is now also part of the bubble. Being part of the bubble means you have to qualify for the exact same stringent requirements for cleanliness, or bug-free-ness.

So, that one cronjob server that is used, among other things, to control when the production backups occur? It's officially part of the prod bubble and has to be treated as such.

Finally, connection points between environments must be dead-drops. This is a concept from spycraft where one person (usually the traitor) puts the stolen files somewhere. The other person (usually the handler) picks up the files later. The two people are never in contact with each other, reducing the chance that a counter-espionage team will be able to figure out the hole in their security. In IT terms, updates to a cleaner environment (such as production) are pushed to a central location (such as an internal package server for rpm or apt). Agents within the production environment (such as Puppet or Chef) are then able to retrieve and deploy the updates on their schedule. Possibly after doing checks to make sure the updates are safe to deploy.