Friday, August 30, 2013

Provisions to last the journey

In the last post, I talked about Vagrant as half of the most important tool IT organizations have gained in the past decade. This post talks about the other half - the provisioner.

The problem we need to solve is replicating the build of a server exactly over and over. In the past (some 10 years ago), I would use Norton Ghost (a backup utility) to clone a server once I had it setup perfectly, then restore that clone to the other servers. And that worked great, so long as I never needed to change what was on that server. For example, a web-server might have had Apache, various modules (mod_perl, mod_proxy, mod_rewrite, etc), and the MySQL client software. Then, we would install the language dependencies (at the time, I was writing Perl apps) from CPAN. We would take a Ghost at that point, replicate that out, then deploy the application using SVN. If we needed a new module or a new version of a module, that required a new Ghost. If we needed a new Apache module or an upgrade, that required a new Ghost. It only took an hour or two, but it was very manual.

This worked great, for production. All of our production servers would be exactly the same, because they were clones of the same Ghost. But, since the production configuration would be on the Ghost, we couldn't use that in QA or in development.

The other problem was that we had no record of what we were doing. Nothing was in source control, largely because there was nothing to put in source control. SVN (and now Git) are only really useful with text files. (Yes, they take binary files, but only as undifferentiable blobs. Not useful.) This meant no code reviews, no history, and no controls. Everyone had to be a sysadmin.

I've heard of other places using a master package (rpm or deb) that does nothing but require all the other packages necessary for the server to be setup properly. And, this works great . . . until it doesn't. The syntax for building packages can be inscrutable. And, while you can do anything in a package (because packages are just tarballs of scripts with metadata), it's very dangerous to allow anyone the ability to do anything. Even if there are no bad actors, everyone is still a human. Humans make mistakes and making mistakes as root is a good way to lose weekends rebuilding from tape.

Luckily, there is a better way.

Unlike the virtualization manager (Vagrant), there are several good choices for a provisioner. Puppet and Chef are the two big ones right now, but several others are nipping at their heels. They differ in various ways, but all of them provide the same primary function - describing how a server should be set up in a parseable format. If you are underwhelmed, just wait a few minutes. (I'll use Puppet in my examples because it's the one I'm using right now. All these examples could be written just as easily in Chef, SaltStack, or Ansible. Juju is a little different.)

The basic unit of work is the manifest (in Puppet) or cookbook (in Chef). This is what contains the parseable description of what needs to be accomplished. In both, you describe what you want to exist, after execution is complete. (Unlike a script, you do not describe how to do it or in what order - it's the provisioner's job to figure that out.) So, you might have something like:

$name = "apache"
package { "apache2":
  require => User[$name],
}
group { $name:
  ensure => "present",
}
user { $name:
  ensure => "present",
  gid => $name,
  require => Group[$name],
}

This would install the apache2 package (found in Ubuntu), create an 'apache' group and an 'apache' user. You'll notice that the apache2 package requires the apache user. So, creating the user would run before installing the package, even though it's defined afterwards. So, define things in the order that makes sense and the provisioner will figure things out. This means, however, that when you watch it run, things won't run in the same order from time to time, and that's okay.

Provisioners are designed to run again and again. They are idempotent, meaning that they will only do something if it hasn't been done already. This property is extremely powerful because we can make a change to a manifest (or cookbook) and, when we run it, only the change (and anything dependent on that change) will execute. This solves the problem of the upgrades with Ghost.

Now, we have a executable description of what a given server should look like. The best part? It's in plaintext. We're going to check this description into our source control so that we can track the changes necessary for each request. We can now treat this as any other code - with changesets, pair programming, and code reviews. Changes to servers can be deployed like every other piece of code in our application. Best of all, they can be tied to the application changes that spawned the need for them (if appropriate). So, our structural changes go through the exact same QA process as the application changes, increasing our confidence in them.

These days, it's really hard to argue against using a provisioner. We can argue which provisioner to use, but it's kinda like using source control. We can argue Git vs. Subversion vs. Mercurial vs. Darcs vs. Bazaar. But, no-one is arguing for the position of "Don't want it." The same should go for provisioners.

Tuesday, August 27, 2013

Use Vagrant for a Great Good

Vagrant is the one half of the best tool for IT organizations in the past decade. Hands down. And I'm going to tell you exactly why you are going to believe me.

No-one focuses on it and no-one cares about it, but environment mismatches are one of the biggest problems IT organizations face. It's a silent threat that doesn't take down whole sites. It's more insidious., only biting you every few months. Things that pass QA sometimes mostly work in production. It's really hard to replicate that bug in production. So, you write it off as a heisenbug. Or maybe the test suite passes on the developer's machine and the QA's machine, but sometimes fails in Jenkins. So, you disable that test from running in Jenkins because you've already wasted three days trying to figure it out.

Everyone kinda knows what the root problem is - you bitch about it over lunch every so often. But, it seems like such a high-class problem, so you don't fix it. Yeah, sure, Github and Etsy do it, but those are huge teams with tons of operations resources to put towards making everything perfect, right?

Not really. Both of them are actually small teams, relatively speaking. And, they don't devote huge amounts of time to it. They just do things right from the get-go. There's a bunch of tools these and similar teams use. The first and most foundational tool is Vagrant.

Vagrant is virtualization made easy. Vagrant creates and manages a semi-anonymous virtual machine (VM) using a simple configuration file (called a Vagrantfile). There are three basic commands:

  • vagrant up
  • vagrant halt
  • vagrant ssh
(There's more to it - a total of 15 commands as of this writing, but those are the three big ones.) And they do exactly what they say on the tin - bring the VM up, bring it down, and login to it. It works with Virtualbox, VMWare, and several other virtualization providers.

That's secret sauce #1 - Vagrant is just sugar around virtualization providers. It does all the heavy lifting of setting up the VM, managing it, and making sure it doesn't conflict with other VMs. (Yes, we're going to talk about multi-VM setups!)

So, now you have create a VM. So what? Because the setup of the VM is automated and everything is checked into your source control, every user of this repository has the exact same VM setup on their machine. As the setup of the server changes, a quick vagrant reload and everyone is in sync again.

Setting up multiple VMs is also very simple. You might want to do this for all kinds of reasons.
  1. An application server and its database.
    1. If they're both in the same repository, the same Vagrantfile can define both VMs.
    2. If they're not, each repository has its own Vagrant file. In this case, defining your own subnet works wonders. (I like 33.33.33.xx - it's a private DoD subnet that's not routable.)
    3. Remember - coworkers shouldn't share cups, plates, or databases. It's just not sanitary.
  2. Front-end developers working with services.
    1. The services can run on their own VMs and be deployed to as if they were in the QA environment. Your designers can now work on their code without having to know how the services are managed AND not have conflicts.
So, when do you want to set up a VM? I strongly believe that every source code repository should have its own VM. This includes backend code, like Python or Ruby applications as well as front-end code, like Backbone or Ember applications.

"Rob, really?! Front-end code? Doesn't it run in the browser already? Why go through all the hassle of setting up a VM?"

Yes, really, for several reasons:
  1. Front-end applications may run in the browser, but they aren't built in the browser. Compass/SASS, Less - these are all tools that are versioned and depend on a toolchain of specific versions.
  2. No-one ever works on a single project these days. Each project has its own toolchain, but many of these tools expect to be installed globally.
  3. Most front-end applications depend on some REST API to be available. If it's not, you may choose to build a stub application instead of hard-coding the responses in text files. Now you have a back-end application that needs to be managed.
  4. Test tools often want to run in a server. This is especially true for PhantomJS and ZombieJS. It really sucks when your testing frameworks aren't in sync between developers.
And, finally, Vagrant provides the foundation for the other half of the most important tool of the past decade - the provisioner.

Tuesday, August 6, 2013

Designing for testability

I'm going to assume you agree that writing tests is good and that 100% code coverage (or as close to it as possible) is a great ideal to strive for.

Testing stuff is hard. Any stuff. By anyone. (QA teams don't have it any easier.) This is true if you don't have tests and if you have tests. And, sometimes, the tests you have make it harder to write more tests.

The root problem is testability. I define testability as "the ease by which a system is verifiable." (This is different from "How well can someone describe a testcase." The latter is a skill of the person, the former an attribute of the system.) The easier a system is to test, the greater its testability.

Testability affects and is affected by everything. Every decision made by anyone on the project can reduce the project's testability. Often in ways that aren't obvious until months later. For example, the ops team adds a new service and it needs a configuration file. The person in charge of doing it is focused on getting this service up and running, so they hard-code the file's path into a module that's included in the application. They didn't know the dev team's process for adding a new configuration file - they're ops, not dev. But, that's now a block to testability. Instead of creating a new configuration file with appropriate values for testing and pointing the code at it, the tester has to put the file in that spot. The spot might be in a directory that requires privileges to write in, meaning tests now have to run with elevated privileges. It's also a spot which might change later, intermittently breaking the test suite in hard-to-diagnose ways.

A lot of ink (digital and not) has been spent on discussing ways of improving the application code within a system to make it easier to write unit-tests. An incomplete list would include:
  • Decoupling
  • Interfaces
  • Mock objects
A nearly equal amount has described how to write integration tests, though with less prescription for making a system more testable (we'll see why in a later post). And, still further, people have talked about other ways of distinguishing this test from that test.

At the heart, testing any system is just this:

  1. Hook up an input stream with testing data
  2. Hook up monitors on an output stream
  3. Run the test
This process works for everything, so we'll look at it in the light of a car. When I take my car into the local oil change place, they test a whole bunch of components in my car, not just the oil. For example, to test the transmission fluid, they:
  1. (input) Extract a small amount of fluid from my transmission and put it on a card.
  2. (output) The card has a reference color on it.
  3. (run test) Compare the fluid color against the reference color using a Mark-1 eyeball.
That's a highly repeatable and very strong test. It's cheap to execute (for time, materials, and training) and it works. (Happily for me, they are able to do this - the transmission fluid in one of my older cars was filthy and would have caused the transmission to fail if it hadn't been changed. I wouldn't have known to do it otherwise.) They test the air filter, the transmission fluid, the lights, the wipers - pretty much every component in my car. 

Well, not quite. They test every highly-testable component in my car. They don't test the integrity of the engine mounts, the safety of the seat-belts, or if the airbags are charged. Why not? What's different about those components that makes tests for them much harder?

Unlike the various fluids and filters, the airbags (for example), aren't designed to be tested. There may be very good reasons for doing so, but that's not the question. If there was a car that designed the airbags in such a way that my oil changing place could cheaply test their charge, they would jump all over it. Running several dozen cheap tests make clueless drivers (like me!) want to use them and the more they can test, the more they will find that (legitimately) needs replaced. (Likely, by them, because why go somewhere else?)

The oil change experience also gives us another crucial point - unit tests and integration tests are the same thing. The mechanics use different inputs, outputs, and tests when examining different components. But, the point of input, the point of output, and the expectation are all well-defined. There's no distinction between someone who is capable of judging the transmission fluid vs. the performance of the car as a whole. Nor is there a distinction between the types of tests (or inspections, as they call them).

More on this in part 2.

Wednesday, July 31, 2013

Deployment is not source control (pt 3)

(This is the third part of a series on deployment. See part 1 and part 2.)

The deployment process I've outlined in the previous two posts works really well in a continuous deployment environment, where changesets that are merged to master go up quickly to production. It also works really well for mainline development, where changes track in one direction only and all changesets start from master. When there are no bugs in production that have to be fixed right away, before what's in master can be promoted to production.

You and I don't work on projects or teams that run continuous deployment. (Very few teams do, for good reasons.) There will come a time, possibly often, where you will need to fix a bug in production and cannot run the change through the master branch first. You need a hotfix.

The hotfix process is very similar to the mainline process outlined in part 1. The primary difference are the branch and merge points. Mainline development always branches from and merges to master. (You never develop directly in master.) Hotfixes, on the other hand, branch from and merge to the version of master which was used to build what is currently in production. They go through the same process of building a package, promoting to a hotfix-test environment (separate from the mainline test environment), then promoting to production. (This requires a separate hotfix-test package repository.)

At this point, we have successfully promoted our hotfix to production. One last item remains - we need to merge our hotfix into the current mainline development. If you've done everything right, this is one of the only two places where merge conflicts can occur. (The other is when pulling master into your development branch.) Both git and mercurial will apply the new diff (of the hotfix) to the sequence of changes right after the production diff, then apply the subsequent changes made to master on top of it. If any of the diffs conflict, the merging developer will need to fix the conflicts.

After the conflicts (if any) have been fixed, all that's left is to pull the hotfix changes into any existing development branches. And, we're done!

Monday, July 29, 2013

Deployment is not source control (pt. 2)

(This is the second post in a series on deployment. See part 1 and part 3.)

+Roman Daszczyszak had a question on Google+ in response to Deployment is not source control. He asked:
While I agree with your points, how do you apply this to developing a web application? My team has run into problems trying to properly package a Django-based app with a Mongo backend. Thoughts?
Whenever anything is installed (or deployed - it's the same thing), there are a set of steps that must be completed. For a standard web application (Django, Rails, etc), that could be something like:

  1. Login to the webserver.
  2. Copy the source code (via git, scp, rsync, etc) into the installation directory (/var/www, etc).
  3. Install any necessary prerequisites (frameworks, libraries, language modules, etc).
  4. Run a script to set things up (compiling / uglifying, configuration, softlinks, etc).
  5. Restart the service (Apache, FastCGI, etc).
  6. Repeat this process for each webserver in the group.
OS packages (rpm or deb) are designed to handle steps 2-5. While each packaging format has their stronger and weaker points, all of them can do the following:
  • Bundle files into a logical hierarchy
  • Execute scripts (in any language) at different points in the installation process
  • Specify prerequisites (including specific versions to require or exclude)
  • Execute tests to ensure a good installation
  • Allow for arbitrary metadata to be stored for later queries
  • Rollback to a prior installed version (most important function)
One important point to remember is that the files in source control are often not the files that belong on the production server. While this is true for compiled applications (such as Apache and MySQL), it has become true for web applications as well. Javascript and CSS assets are often uglified and compressed. You may not even be writing in CSS - Sass/Compass and Less are becoming excellent frameworks to use. Your Javascript assets may have been written in Coffeescript, your HTML in Jade or HAML, and images may be sprited.

This leads us to an important rule of thumb:
Each server should only exactly what it needs to perform its tasks and nothing more.
Applying that to our packaging means the package should only install the compiled, compressed, and otherwise-mangled files that will actually be served from the webserver. If you're putting gcc, git, or make on your production servers, you're doing it wrong. The package should have the compiled versions, not the source versions. It may have templated configuration files ("Insert hostname here"), but the template isn't installed - only the result of filling in the template.

Frameworks, such as Django, and datastores, such as MongoDB, have already been built into packages by their maintainers. Specifying them as dependencies allows the package to be self-describing.

The metadata associated with the package is important to the success of the process. The package version is required. I've found that using "1.[timestamp]" is a good monotonically-increasing version number. As this is only released internally, a nonsensical version number is good enough.

All the packaging formats allow setting arbitrary metadata on a package. A good set of metadata includes:
  • The timestamp this package was built.
  • The SCM identifier of the commit used to built the package (git SHA1, SVN version, etc).
  • The issue number for the changeset that was merged to master that triggered this package build.
With that metadata, any person in the company can hit an internal website and see exactly what the last build to each environment is and what issues are in test that aren't in production. Your issue tracker should be able to provide this, but your servers should also be able to tell you this.

So far, we have discussed putting together the application and its on-server dependencies. Roman's question asked about MongoDB. I'll expand it to datastores in general. It's good practice to keep application servers and datastores on separate horizontal groups. This allows operations to balance the needs of one vs. the other. It's extremely rare for both application and datastore to grow at exactly the same pace. So, we have to figure out a way of managing cross-server dependencies. (This problem also arises when dealing with multiple applications supporting the same product. The solution is the same.)

Datastore change management can and should also be managed with packages. Packages aren't just a set of files to be applied. A package is a set of actions that need to be taken in order to upgrade an installation from version X to version Y. The most common thing to do is provide a new set of files, but a set of actions (such as "ALTER TABLE" statements) is also appropriate. By applying datastore changes with packages, you are now able to ask your datastore "What version are you?" and make decisions based on that. One decision could be "Version X of the application cannot be installed because the datastore is not at version Y."

Roman - I hope this helps!

Thursday, July 25, 2013

Deployment is not source control

(This is the first post in a series on deployment. See part 2 and part 3.)

A source control manager (or SCM) is the second most-important tool an application team can use, right after a good editor. It preserves history, maintains context, and makes perfect julienne fries, every time. Everything should go into source control - source code, tests, requirements, configuration, build scripts, deployment tools. Everything. Building an application that isn't managed in source control is like trying to cross the Grand Canyon on a high wire - without the high wire.

But, as much as source control is a phenomenal tool, it is not the right tool for every purpose. No-one would replace vim or Sublime with Git or Mercurial. That just makes no sense. Which is why I'm always baffled when I come into a team and see deployments managed with git branches.

Deployment is the process of taking an product from environment A to environment B, usually from test (or beta) to staging (or user-acceptance), then to production. An environment isn't just the application code that lives in a single server. It's the entire stack of processes, such as the database, application(s), third-party libraries, configuration, background jobs, and services that go into providing the features of your product. Ensuring that all the different pieces of that stack are in sync at all times is the major function of deployment.

In order to do this, the deployment tool must understand dependencies. Dependencies between application code and third-party libraries on the same server is just the start of this. Dependency-tracking across server groups, between the application code and the database version, and even configuration changes are all components of this. And everything has to move in lockstep.

There is no single tool that, to my knowledge, manages the entire stack in this holistic fashion. But, an application team can make life a lot simpler for themselves by doing one simple thing - deploy with OS packages and not source control.

OS packaging tools (such as RPM and APT) have been around for decades. They are the way to deploy libraries and applications to Linux (and Windows, thanks to Chocolatey). They manage dependencies, put everything in the right place, update configuration, verify compatibility, and do everything else necessary to make sure that, when they're done, the requested thing works. Often, this means setting specific compilation switches (or even pre-compiling for specific architectures). They encode knowledge that is often hard-won and difficult to rediscover. And, finally, they let a user ask the server what is installed, revert to a previous version, or even uninstall the package (and all downstream dependencies) altogether.

Source control does not do any of those things. Source control is designed to do one and only one thing - track and manage changes between versions of groups of text files. Modern SCMs (such as Git and Mercurial) do this very very well.

Managing a deployment requires a package. When QA approves a specific deployment within their test environment, operations needs to "make prod look like test". The way ops can ensure that production will look exactly like test is to build production exactly as test was built. Server build tools (like Puppet and Chef) help ensure that the servers (or VMs) are built exactly the same every time. The application (and its configuration) needs to have the same treatment.

So, I recommend the following process:

  1. Do your development as you normally do right now. (I will have thoughts on the rest of this later, but those are another set of posts.)
  2. Once a changeset is merged into primary branch (master, for Git or default for Mercurial):
    1. It is tagged with the name of the changeset.
    2. An OS package is built and uploaded to the test package repository.
  3. The OS package is deployed to the test environment.
    1. This can happen either automatically or as a result of a user action.
  4. QA verifies the build.
    1. If it fails, issues are opened and the development process begins anew.
    2. If it fails catastrophically, the environment is reverted.
  5. When QA passes the build, the package is copied into the production package repository.
    1. The commit that was used to build this package is tagged with the date it was promoted to production.
  6. The package is applied to the production environment at the appropriate time.
At the point of merging into the primary branch, the SCM has finished its job. It's now the job of the package manager to replicate that branch out to the various environments in the correct order with the correct dependencies.

Tuesday, July 23, 2013

The canonical source

No, I'm not talking about Mark Shuttleworth's attempt to define Linux for the rest of the world. (Though, come to think of it, that's probably the source of the name. He wants to be the canonical source of all things open-source. Hence, the creation of Juju when Puppet and Chef would seem to be perfectly good solutions to the devops problem.)

To be the canonical source for something means to be the ultimate authority for how that thing is structured. Whenever a new copy of this thing is created, the canonical source is consulted to create it. Whenever a change is made, the change is first made in the canonical source and those changes propagate outwards from it. If this was a religious blog, we would discuss the Bible and its roots in Alpha and the Omega. If this was a legal blog, we'd be talking about constitutions and common law.

In IT, there are dozens of things that are copied and slung around. Database schemas, server configurations, applications, third-party tool configurations, and the like. And each one of them has a canonical source.

The only issue most organizations have is that they haven't clearly defined exactly what the canonical source is for each component in their applications. (Frankly, most organizations don't have a complete list of the components!) Why is this an issue?

Let's digress for a minute and peek over at the DRY principle. It normally is discussed in terms of code and is the explanation given for why refactoring is a good idea. Instead of having the same validation code at the beginning of four subroutines, you pull out the validation into its own subroutine and call that instead. This way, if that validation would ever change (and it will change), it is changed in one place and everywhere that needs it automagically (an awesome word!) receives the update. Without you having to do anything. Without you even having to know everywhere that needed the change.

Most people involved in the creation of software instinctively understand that this is a good idea in software. There should be a single place where the Luhn Algorithm or the email verification algorithm is defined. Why would someone want to have it in two places?

Can the same be said about your database schema? Or the structure for your production servers? What about your testing infrastructure? Do you have canonical sources for each one? And if you do, do you have processes in place that ensure the canonical source is modified first, then the changes flow from there? Is your documentation built from the same source?

If your team cannot point to the canonical source of something, that means one of three things:

  1. There isn't a canonical source.
  2. There's more than one canonical source.
  3. The canonical source is in production.
If there wasn't a canonical source, or a Single Source Of Truth, your application would be in a disastrous mess. Regressions would be occurring on a regular basis and testing would be ineffective (at best). Having more than one canonical source is the exact same thing. (Having two ultimate sources of truth is exactly the same as having none. Which is exactly what the Roman Catholic Church realized about popes.)

So, this means your canonical source is whatever is currently working in production. This would seem to have a nice poetic ring to it - whatever your users see is the canonical source for what your team works on. Many teams operate exactly like this.

Well, they operate poorly like this. Two problems rear their ugly heads very quickly. The first is usually simple-ish to fix. How should someone build a new instance of the canonical item (such as a database or server)? Cloning production would require (in many cases) taking something offline. Taking any part of production offline is usually a "Bad Idea"(tm), so it's only done very rarely.

The second problem is much more insidious. If production is canonical and it is the cleanroom, how do you safely push changes up to it?

Saturday, July 20, 2013

Promoting an environment of clean rooms

In a previous post, I talked about how to take something that's a mess and keeping it from getting worse. You may not be able to fix it, but you can at least not contribute to the mess. Today is about what a clean room gets you.

The concept of "cleanroom" work (notice the pun?) appears in a lot of disciplines. All computing hardware, especially CPUs and disks, are built in cleanrooms. The tolerances for these (and most other computing hardware) are so small that any dust or electrostatic charge would make the final product useless. (Even with this, Intel is rumored to have at best a 95% yield, much worse on newer processes.)

Like so many other things, the concept originates with medicine. When the germ theory was just starting to gain acceptance, early disease control would distinguish between "cleanroom" and "sickroom". The protocol (official process) was that anything in the cleanroom could be taken into the sickroom, but nothing from the sickroom could be taken into the cleanroom without being sterilized.  This isolates the germs in the sick areas and prevents contamination. This protocol is still used today, both by parents dealing with sick children and as the basis for how epidemiologists work with highly infectious diseases like Ebola and meningitis.

The best way to think about this is with bubble children. Due to no immune system, they have to live in a completely isolated and sterile environment - a bubble. Everything (diapers, food, water, books) has to be sterilized and kept sterile before it can be introduced into the bubble. The moment anything non-sterile (a dirty fork) touches something meant for the bubble (a book), the book is now back at square 1, having to go through the whole sterilization procedure again.

In IT, we promote our applications from environment to environment (notice the second pun?). Developers do their work in a "development" or "dev" environment (ideally on their local machines). When they have finished, the work is promoted to a "test" or "beta" environment to be evaluated by the QA staff. When they have finished, the work is promoted (possibly through a "staging" environment for regression testing) to "production". Production is the bubble.

This idea of successive steps of verification is exactly the same as disinfecting anything that leaves a sickroom. Instead of dealing with infections, we deal with bugs. When someone writes a line of code, that line is maximally buggy. The more tests we apply to it (code review, unit-tests, integration tests, regression tests, etc), the more we can say "This line of code doesn't have that bug." We achieve a relatively high degree of confidence that the new line of code is sufficiently bug-free that it can move into a "cleaner environment". Eventually, the new line of code goes through enough cleanings that it can be introduced into the production bubble.

Environments, like cleanrooms, must be isolated from anything that could impact them. In an odd twist, cleanrooms have their own infectious nature. In order for a cleanroom to remain clean, anything that can feed into or affect the cleanroom has to be at the same level of cleanliness. If you touch something in the bubble, you're part of the bubble. If something touches you, then that something is now also part of the bubble. Being part of the bubble means you have to qualify for the exact same stringent requirements for cleanliness, or bug-free-ness.

So, that one cronjob server that is used, among other things, to control when the production backups occur? It's officially part of the prod bubble and has to be treated as such.

Finally, connection points between environments must be dead-drops. This is a concept from spycraft where one person (usually the traitor) puts the stolen files somewhere. The other person (usually the handler) picks up the files later. The two people are never in contact with each other, reducing the chance that a counter-espionage team will be able to figure out the hole in their security. In IT terms, updates to a cleaner environment (such as production) are pushed to a central location (such as an internal package server for rpm or apt). Agents within the production environment (such as Puppet or Chef) are then able to retrieve and deploy the updates on their schedule. Possibly after doing checks to make sure the updates are safe to deploy.

Wednesday, July 17, 2013

What is good documentation?

Imagine this - you are assigned to a new project. An admin gives you permissions on the repository and you grab a copy of the code (via git, svn, or whatever). As you look in the directory, you see a mismash of files and directories. You recognize some of them (lib/, src/, test/, and so on), but some of them are weird (server/, conf/ vs. config/ vs. cnf/). And, sure enough, the project lead is on vacation this week and the other developer is in meetings all morning. Slashdot, here we come, right?

This sorry situation appears to be the norm in most development teams I've worked with. Most information about how to set up an environment, how to function within the team, and what expectations are tends to be transmitted by word of mouth. What little that is written down is always out of date or just blatantly wrong.

Everyone deplores the state of affairs, but very few do anything about it. It seems like a Sisyphean task, constantly pushing that boulder uphill without any support from anyone else. Management certainly doesn't agree that documentation is as important as shipping code. Developers are notorious for not wanting to write anything except code. Testers are supposedly clueless (more on this in another post). And, worst of all, ops thinks everyone else is an idiot out to make their life harder than it already is. No-one trusts anyone, so no-one is going to take the risk of actually writing something down.

Most application teams (dev, ops, analysts - everyone!) operate like guilds from the Middle Ages. Not with a deliberate intent to hide knowledge for the purposes of maintaining a monopoly to collect rents (though the effect ends up being the same). Instead, there's no provision made to actually manage the generation, storage, and transmission of information. Instead, the lowly apprentice has to petition the journeymen and masters of the team to provide them with whatever scraps of information they can get. This becomes the root cause for two unfortunately common phenomena:
  1. Many companies expect a new employee to take up to 3 months to become useful.
    • And consume a "useful resource" at the same time.
  2. Senior developers are never allowed move off a project.
Good documentation fixes both of these problems. 

So, what does good documentation even look like? The short answer is "good documentation is sufficient unto itself." The ideal is that most answers to most questions is not only within the documentation, but can be found by the average reader. It is clear, concise, complete, and comprehensible. It has enough information for the new reader, yet can be read as a reference by the old hand. And it is both current and versioned. Whenever a change happens in the code, it is reflected in the documentation.

Before you say "Impossible!", look at your favorite opensource products. The really good ones - you know which ones I'm talking about. These are the tools, frameworks, and modules that you can understand exactly what it will (and won't!) do within 15 minutes of easy reading. They have examples. Tutorials. References. The number of FAQs is very small.

Most importantly, the number of times you have to google for something or ask in an IRC channel or mailing list? Zero. Zilch. None.

Good documentation is also not always written. The very best form of documentation is executable. This guarantees that the documentation will always be current. It has to be, otherwise the project doesn't work. It's also documentation that everyone likes - it's code and it saves everyone time. Executable documentation includes:

  • Tests (unit-tests, integration tests, customer tests, etc)
    • Cucumber is ideal as executable documentation.
  • Deployment and environment management
  • Build tools (make, Ant, Maven, Grunt, Vagrant)

I'm going to be writing posts for each of these in the future.

Sunday, July 14, 2013

When is something done?

I have five kids. They're good kids, but they're kids. So, when they do something, some part always gets forgotten (usually cleaning up). A perfect example is my oldest, just turning 18. He loves steak. He loves eating steak and he loves cooking steak. He's quite good at both skills, too. If you ask him about steak, he can go on and on about the different cuts, different techniques, and even different seasonings. (Apparently, a rub and a marinade are never to be done together. I'll never make that mistake again!)

And, like many older teenagers, his sleep habits on weekends are not quite . . . habitual. A Saturday 3am steak craving has been known to happen. To his credit, I've never woken up to a fire alarm or any other emergency. But, I always know when he's had a craving. The kitchen is not quite as my wife and I left it the night before. The steak has been put away (in his stomach), but nothing else has.


As far as he is concerned, the process of cooking a steak is finished when the goal of eating a steak has been achieved. Every step to the right of the goal is unnecessary. Which is obviously not true. Processes have a lifecycle of their own, regardless of the goal(s) they are supporting. If he was cooking a steak for his significant other (which he has done), the process remains the same, even if the goal is different.

Every time I get on him about cleaning up after himself, he gives me this look. Every parent knows exactly what look I'm talking about. It's the "I'll do it because you'll punish me if I don't, but you haven't convinced me of why I should care" look. He's a hobbyist.

Compare this to what the chefs do on Hell's Kitchen. These cooks are not just professionals, they're consummate professionals. After a long day of challenges and a night of creating 5-star food while being yelled at in public, they are still cleaning their stations and washing the pots and pans. It doesn't matter how rough it got between them - at the end of the day (literally!), they work together to finish the process of preparing 5-star cuisine. The end of that process is to prepare for the next time the process is execute.

This separation of process from goal is equally true in every facet of our lives, and especially true in the development of software. The goal is obvious - a functioning application that our customers can use. The shortest path to that goal is:

  • Make production server (if necessary, otherwise skip to next step)
  • Edit code on production server
  • service apache restart (or however you source in the edited code)
  • Send email announcing new feature
This is called DIP, or "Developing In Production". It's what 99% of the world population thinks we do, including many stakeholders, most business analysts, and all users. Oh, and that one developer in the back who only works on ancient ASP4 or PHP3 apps. And, to a (very small) degree, this process works. The application does tend to work, to a large degree, most of the time. And, for hobbyists or businesses which can tolerate large amounts of downtime, this minimalist process could serve quite nicely.

For the rest of us . . . not so much.

DIP, as a process, leaves much (well, everything) to be desired. It is this-goal focused. We need to get this feature out. We need to fix this bug. The implicit understanding behind it is "And nothing else matters." There is no preparation made for the next execution of the process. The plate and utensils are just left on the table with the grease congealing. The pan is left on the cooling burner, a hard crust forming. Nobody puts away the seasonings and the sauce curdles overnight.

Tests aren't written, so nobody knows (or remembers) exactly how something is supposed to work. The environment setup isn't documented (or automated), so every server build is a one-off that takes days and is still never the same. There is no source control, so nobody knows why something was done. The ticketing system (if there is one) loses information, so nobody even knows what was done or when or for what purpose.

If the thing you did was the last time anyone would ever have to deal with that application, then none of this matters. But, what's the chance of that? In my 20 years of working in IT, that has never happened to me or anyone I have ever met or even in any of the stories they have told.

In short, every change you ever make to a system will have another change after it. The process of making that change isn't complete until you have cleaned up after yourself.

Wednesday, July 10, 2013

Keeping new things clean in a dirty world

Most software projects are unhygenic. Organization is poor, some scripts don't work, and it's likely that many or all of the test suite doesn't work, either. Very little is automated and nothing is documented. It's the software equivalent of living in a fraternity house. You have to carefully walk through the mess in the living room to get from the filthy kitchen to your disaster of a room.

No-one sets out to live in a dump, just like no-one sets out to work in a disaster of a software project. Little things add up, like that pizza sauce stain or where someone fell drunk against the wall. People tried to clean them up, but it was never the same. And they add up, day after day, until you can't see the walls for the empty beer bottles. And you're not sure what you can do to make a difference, short of condemning the whole thing and starting from scratch.

We can see exactly what is wrong with a fraternity. No-one in the fraternity is focused on cleanliness - on hygiene. If a dish falls on the floor, no-one makes it a priority to pick it up. If it is picked up, then it's not put away in the cupboard. If it's actually put away, then it was put away haphazardly, without stacking nicely. And if, by some miracle, the dishes were actually stacked, it's only done the once. The next time, the dishes will be left to mold in a corner on the table. No-one cares.

There are fraternities where everything is kept neat and tidy. I lived in one (well, mostly neat and tidy). But, it requires twenty-somethings to do something they tend to not do - focus on their environment over themselves. It requires someone who takes charge and chivies everyone else to work at it. To make the big push to clean everything up. To spend their Sunday cleaning instead of sleeping off last night's party and cramming for tomorrow's exam. So that the house can start the week of neat and tidy.

Even that doesn't work. I can see some of you rolling your eyes in memory. Everything does get clean, but does it stay clean? Of course not. Because there's nothing in place to make sure that the work done remains done. There is no ongoing maintenance.

The root is that the house wasn't what needed cleaned. Yes, the beer bottles and pizza boxes needed to be thrown out. Yes, the dirty dishes needed washed and put away in nice stacks. But, the real problem was the mindset that let everything slide. The real fix is in retraining everyone in the house to feel uncomfortable when something is out of place, to itch under the skin when a dirty dish just lays on the end table. If nothing else is accomplished that Sunday afternoon except to fix that mindset, then the house will clean itself, as a matter of course.

So, when you add that configuration file to support the new module, don't just throw it anywhere like everyone else seems to have done. Put it where it should go. If you have to write a script to make sure it's shoehorned into the dirty way of doing things, then so be it. At least you have kept this small part of the house from being dirty. And, you have paved the way for someone else to clean up a piece that was dirty. You have started to create the itch for hygiene.

Sunday, July 7, 2013

Signs your software process isn't working

The business has made it. The product has been well-received. New customers are coming in the door and partners are signing up. You're even profitable! But, something is rotten in Denmark.

Software releases feel slower. You haven't measured it, but you're pretty sure that the number of regressions is up. Emergency releases seem to be happening more often. New features aren't happening as quickly as they used to. Some features are failing now, which never happened before. And when new features do get out, it seems like everyone is exhausted. Some of the early employees are moving on. Some of the first customers are complaining. It's just not as sparkly.

The good news is that it's not in your mind. There are real issues that are causing all the pain. They can be identified, measured, and surfaced to the rest of the organization.

The other good news is that you can fix it. There are real and concrete things your groups can do to reduce the turmoil. Some of the problems can even turn into assets after some grooming.

But, that's about the extent of the good news.

The bad news is that there is no quick fix. Switching your technology stack isn't going to fix it. Hiring a rockstar developer or a great project manager isn't going to fix it. Adding "Agile" (or kanban or whatever) is only going to muddy the waters. And throwing more people at it is only going to exacerbate the problems.

Yes, problems. Because there isn't just one problem. There are multiple problems that contribute to this sense of unease and dread. Not everyone has the same set of problems, and some groups are unique snowflakes and have their own special brand of crazy. But most groups tend to run into some combination of the same sets of issues.
  • Measurement is haphazard.
  • Cross-checking is haphazard.
  • Responsibilities aren't clearly defined.
  • There is no assembly line.
  • All knowledge is internalized.
These issues tend to be carryovers from the attributes that made the company successful in the first place. One or two highly motivated people start a project. They know everything and communicate with each other constantly. Quality is high because it's small. Process is whatever gets something out the door. And, amazingly, everything scales. Going from two or three people to ten means 5x as much work is getting done. So, why did everything just fall over when going from ten people to fifty?

Fred Brooks, in The Mythical Man-Month, touches on one of the root issues. When dealing with three or four people, the number of 1-to-1 lines of communication is manageable (at six and ten, respectively). Everyone can get into one room and share all the knowledge about everything. The business is small enough that a single person can keep the entire thing in their head.

Going up to ten people increases the number of lines to 55. This is much larger, but still manageable because we have specialized. Everyone isn't expected to be able to do everything anymore, so some information can be limited to sharing with just some people. And one person can still keep everything in their head, even if they do less and less of the daily work.

The more astute reader is starting to see where the problem starts to form. Everyone has worked at a place where it's nearly impossible to find out what you need to know in order to do your job. Information is siloed. It is rare that someone is actively hiding the information. (If that does happen, the solution is very simple, if emotionally and politically difficult to do.) More often, the person who has the information you need doesn't know you need it. Sometimes, you don't know you need it, until you can't move forward without it. And one person can no longer keep the entire business in their head.

At the root, there is a limit to the amount of information and cross-references a single person can handle. There is also the limit of how much work a single person can accomplish. We organize groups and companies to exceed those limits. One person doesn't have to communicate with anyone. A few people, usually less than a dozen, can communicate together clearly together. One person can manage what a dozen or so people do. Beyond that, we need systems and processes in place to formalize how information and knowledge are organized, prioritized, and transferred between groups and people.

Alongside the problems communicating information dealing with today's tasks, we have to communicate yesterday's accomplishments and tomorrow's plans. New coworkers have to be trained and environments set up. Those who are leaving need to have their information retrieved. Everyone needs to know what will be coming. The number of information streams rapidly becomes unwieldy without explicit boundaries and organization around them.

In short, the corporate organism needs to learn how to think.