Friday, August 30, 2013

Provisions to last the journey

In the last post, I talked about Vagrant as half of the most important tool IT organizations have gained in the past decade. This post talks about the other half - the provisioner.

The problem we need to solve is replicating the build of a server exactly over and over. In the past (some 10 years ago), I would use Norton Ghost (a backup utility) to clone a server once I had it setup perfectly, then restore that clone to the other servers. And that worked great, so long as I never needed to change what was on that server. For example, a web-server might have had Apache, various modules (mod_perl, mod_proxy, mod_rewrite, etc), and the MySQL client software. Then, we would install the language dependencies (at the time, I was writing Perl apps) from CPAN. We would take a Ghost at that point, replicate that out, then deploy the application using SVN. If we needed a new module or a new version of a module, that required a new Ghost. If we needed a new Apache module or an upgrade, that required a new Ghost. It only took an hour or two, but it was very manual.

This worked great, for production. All of our production servers would be exactly the same, because they were clones of the same Ghost. But, since the production configuration would be on the Ghost, we couldn't use that in QA or in development.

The other problem was that we had no record of what we were doing. Nothing was in source control, largely because there was nothing to put in source control. SVN (and now Git) are only really useful with text files. (Yes, they take binary files, but only as undifferentiable blobs. Not useful.) This meant no code reviews, no history, and no controls. Everyone had to be a sysadmin.

I've heard of other places using a master package (rpm or deb) that does nothing but require all the other packages necessary for the server to be setup properly. And, this works great . . . until it doesn't. The syntax for building packages can be inscrutable. And, while you can do anything in a package (because packages are just tarballs of scripts with metadata), it's very dangerous to allow anyone the ability to do anything. Even if there are no bad actors, everyone is still a human. Humans make mistakes and making mistakes as root is a good way to lose weekends rebuilding from tape.

Luckily, there is a better way.

Unlike the virtualization manager (Vagrant), there are several good choices for a provisioner. Puppet and Chef are the two big ones right now, but several others are nipping at their heels. They differ in various ways, but all of them provide the same primary function - describing how a server should be set up in a parseable format. If you are underwhelmed, just wait a few minutes. (I'll use Puppet in my examples because it's the one I'm using right now. All these examples could be written just as easily in Chef, SaltStack, or Ansible. Juju is a little different.)

The basic unit of work is the manifest (in Puppet) or cookbook (in Chef). This is what contains the parseable description of what needs to be accomplished. In both, you describe what you want to exist, after execution is complete. (Unlike a script, you do not describe how to do it or in what order - it's the provisioner's job to figure that out.) So, you might have something like:

$name = "apache"
package { "apache2":
  require => User[$name],
}
group { $name:
  ensure => "present",
}
user { $name:
  ensure => "present",
  gid => $name,
  require => Group[$name],
}

This would install the apache2 package (found in Ubuntu), create an 'apache' group and an 'apache' user. You'll notice that the apache2 package requires the apache user. So, creating the user would run before installing the package, even though it's defined afterwards. So, define things in the order that makes sense and the provisioner will figure things out. This means, however, that when you watch it run, things won't run in the same order from time to time, and that's okay.

Provisioners are designed to run again and again. They are idempotent, meaning that they will only do something if it hasn't been done already. This property is extremely powerful because we can make a change to a manifest (or cookbook) and, when we run it, only the change (and anything dependent on that change) will execute. This solves the problem of the upgrades with Ghost.

Now, we have a executable description of what a given server should look like. The best part? It's in plaintext. We're going to check this description into our source control so that we can track the changes necessary for each request. We can now treat this as any other code - with changesets, pair programming, and code reviews. Changes to servers can be deployed like every other piece of code in our application. Best of all, they can be tied to the application changes that spawned the need for them (if appropriate). So, our structural changes go through the exact same QA process as the application changes, increasing our confidence in them.

These days, it's really hard to argue against using a provisioner. We can argue which provisioner to use, but it's kinda like using source control. We can argue Git vs. Subversion vs. Mercurial vs. Darcs vs. Bazaar. But, no-one is arguing for the position of "Don't want it." The same should go for provisioners.

Tuesday, August 27, 2013

Use Vagrant for a Great Good

Vagrant is the one half of the best tool for IT organizations in the past decade. Hands down. And I'm going to tell you exactly why you are going to believe me.

No-one focuses on it and no-one cares about it, but environment mismatches are one of the biggest problems IT organizations face. It's a silent threat that doesn't take down whole sites. It's more insidious., only biting you every few months. Things that pass QA sometimes mostly work in production. It's really hard to replicate that bug in production. So, you write it off as a heisenbug. Or maybe the test suite passes on the developer's machine and the QA's machine, but sometimes fails in Jenkins. So, you disable that test from running in Jenkins because you've already wasted three days trying to figure it out.

Everyone kinda knows what the root problem is - you bitch about it over lunch every so often. But, it seems like such a high-class problem, so you don't fix it. Yeah, sure, Github and Etsy do it, but those are huge teams with tons of operations resources to put towards making everything perfect, right?

Not really. Both of them are actually small teams, relatively speaking. And, they don't devote huge amounts of time to it. They just do things right from the get-go. There's a bunch of tools these and similar teams use. The first and most foundational tool is Vagrant.

Vagrant is virtualization made easy. Vagrant creates and manages a semi-anonymous virtual machine (VM) using a simple configuration file (called a Vagrantfile). There are three basic commands:

  • vagrant up
  • vagrant halt
  • vagrant ssh
(There's more to it - a total of 15 commands as of this writing, but those are the three big ones.) And they do exactly what they say on the tin - bring the VM up, bring it down, and login to it. It works with Virtualbox, VMWare, and several other virtualization providers.

That's secret sauce #1 - Vagrant is just sugar around virtualization providers. It does all the heavy lifting of setting up the VM, managing it, and making sure it doesn't conflict with other VMs. (Yes, we're going to talk about multi-VM setups!)

So, now you have create a VM. So what? Because the setup of the VM is automated and everything is checked into your source control, every user of this repository has the exact same VM setup on their machine. As the setup of the server changes, a quick vagrant reload and everyone is in sync again.

Setting up multiple VMs is also very simple. You might want to do this for all kinds of reasons.
  1. An application server and its database.
    1. If they're both in the same repository, the same Vagrantfile can define both VMs.
    2. If they're not, each repository has its own Vagrant file. In this case, defining your own subnet works wonders. (I like 33.33.33.xx - it's a private DoD subnet that's not routable.)
    3. Remember - coworkers shouldn't share cups, plates, or databases. It's just not sanitary.
  2. Front-end developers working with services.
    1. The services can run on their own VMs and be deployed to as if they were in the QA environment. Your designers can now work on their code without having to know how the services are managed AND not have conflicts.
So, when do you want to set up a VM? I strongly believe that every source code repository should have its own VM. This includes backend code, like Python or Ruby applications as well as front-end code, like Backbone or Ember applications.

"Rob, really?! Front-end code? Doesn't it run in the browser already? Why go through all the hassle of setting up a VM?"

Yes, really, for several reasons:
  1. Front-end applications may run in the browser, but they aren't built in the browser. Compass/SASS, Less - these are all tools that are versioned and depend on a toolchain of specific versions.
  2. No-one ever works on a single project these days. Each project has its own toolchain, but many of these tools expect to be installed globally.
  3. Most front-end applications depend on some REST API to be available. If it's not, you may choose to build a stub application instead of hard-coding the responses in text files. Now you have a back-end application that needs to be managed.
  4. Test tools often want to run in a server. This is especially true for PhantomJS and ZombieJS. It really sucks when your testing frameworks aren't in sync between developers.
And, finally, Vagrant provides the foundation for the other half of the most important tool of the past decade - the provisioner.

Tuesday, August 6, 2013

Designing for testability

I'm going to assume you agree that writing tests is good and that 100% code coverage (or as close to it as possible) is a great ideal to strive for.

Testing stuff is hard. Any stuff. By anyone. (QA teams don't have it any easier.) This is true if you don't have tests and if you have tests. And, sometimes, the tests you have make it harder to write more tests.

The root problem is testability. I define testability as "the ease by which a system is verifiable." (This is different from "How well can someone describe a testcase." The latter is a skill of the person, the former an attribute of the system.) The easier a system is to test, the greater its testability.

Testability affects and is affected by everything. Every decision made by anyone on the project can reduce the project's testability. Often in ways that aren't obvious until months later. For example, the ops team adds a new service and it needs a configuration file. The person in charge of doing it is focused on getting this service up and running, so they hard-code the file's path into a module that's included in the application. They didn't know the dev team's process for adding a new configuration file - they're ops, not dev. But, that's now a block to testability. Instead of creating a new configuration file with appropriate values for testing and pointing the code at it, the tester has to put the file in that spot. The spot might be in a directory that requires privileges to write in, meaning tests now have to run with elevated privileges. It's also a spot which might change later, intermittently breaking the test suite in hard-to-diagnose ways.

A lot of ink (digital and not) has been spent on discussing ways of improving the application code within a system to make it easier to write unit-tests. An incomplete list would include:
  • Decoupling
  • Interfaces
  • Mock objects
A nearly equal amount has described how to write integration tests, though with less prescription for making a system more testable (we'll see why in a later post). And, still further, people have talked about other ways of distinguishing this test from that test.

At the heart, testing any system is just this:

  1. Hook up an input stream with testing data
  2. Hook up monitors on an output stream
  3. Run the test
This process works for everything, so we'll look at it in the light of a car. When I take my car into the local oil change place, they test a whole bunch of components in my car, not just the oil. For example, to test the transmission fluid, they:
  1. (input) Extract a small amount of fluid from my transmission and put it on a card.
  2. (output) The card has a reference color on it.
  3. (run test) Compare the fluid color against the reference color using a Mark-1 eyeball.
That's a highly repeatable and very strong test. It's cheap to execute (for time, materials, and training) and it works. (Happily for me, they are able to do this - the transmission fluid in one of my older cars was filthy and would have caused the transmission to fail if it hadn't been changed. I wouldn't have known to do it otherwise.) They test the air filter, the transmission fluid, the lights, the wipers - pretty much every component in my car. 

Well, not quite. They test every highly-testable component in my car. They don't test the integrity of the engine mounts, the safety of the seat-belts, or if the airbags are charged. Why not? What's different about those components that makes tests for them much harder?

Unlike the various fluids and filters, the airbags (for example), aren't designed to be tested. There may be very good reasons for doing so, but that's not the question. If there was a car that designed the airbags in such a way that my oil changing place could cheaply test their charge, they would jump all over it. Running several dozen cheap tests make clueless drivers (like me!) want to use them and the more they can test, the more they will find that (legitimately) needs replaced. (Likely, by them, because why go somewhere else?)

The oil change experience also gives us another crucial point - unit tests and integration tests are the same thing. The mechanics use different inputs, outputs, and tests when examining different components. But, the point of input, the point of output, and the expectation are all well-defined. There's no distinction between someone who is capable of judging the transmission fluid vs. the performance of the car as a whole. Nor is there a distinction between the types of tests (or inspections, as they call them).

More on this in part 2.

Wednesday, July 31, 2013

Deployment is not source control (pt 3)

(This is the third part of a series on deployment. See part 1 and part 2.)

The deployment process I've outlined in the previous two posts works really well in a continuous deployment environment, where changesets that are merged to master go up quickly to production. It also works really well for mainline development, where changes track in one direction only and all changesets start from master. When there are no bugs in production that have to be fixed right away, before what's in master can be promoted to production.

You and I don't work on projects or teams that run continuous deployment. (Very few teams do, for good reasons.) There will come a time, possibly often, where you will need to fix a bug in production and cannot run the change through the master branch first. You need a hotfix.

The hotfix process is very similar to the mainline process outlined in part 1. The primary difference are the branch and merge points. Mainline development always branches from and merges to master. (You never develop directly in master.) Hotfixes, on the other hand, branch from and merge to the version of master which was used to build what is currently in production. They go through the same process of building a package, promoting to a hotfix-test environment (separate from the mainline test environment), then promoting to production. (This requires a separate hotfix-test package repository.)

At this point, we have successfully promoted our hotfix to production. One last item remains - we need to merge our hotfix into the current mainline development. If you've done everything right, this is one of the only two places where merge conflicts can occur. (The other is when pulling master into your development branch.) Both git and mercurial will apply the new diff (of the hotfix) to the sequence of changes right after the production diff, then apply the subsequent changes made to master on top of it. If any of the diffs conflict, the merging developer will need to fix the conflicts.

After the conflicts (if any) have been fixed, all that's left is to pull the hotfix changes into any existing development branches. And, we're done!

Monday, July 29, 2013

Deployment is not source control (pt. 2)

(This is the second post in a series on deployment. See part 1 and part 3.)

+Roman Daszczyszak had a question on Google+ in response to Deployment is not source control. He asked:
While I agree with your points, how do you apply this to developing a web application? My team has run into problems trying to properly package a Django-based app with a Mongo backend. Thoughts?
Whenever anything is installed (or deployed - it's the same thing), there are a set of steps that must be completed. For a standard web application (Django, Rails, etc), that could be something like:

  1. Login to the webserver.
  2. Copy the source code (via git, scp, rsync, etc) into the installation directory (/var/www, etc).
  3. Install any necessary prerequisites (frameworks, libraries, language modules, etc).
  4. Run a script to set things up (compiling / uglifying, configuration, softlinks, etc).
  5. Restart the service (Apache, FastCGI, etc).
  6. Repeat this process for each webserver in the group.
OS packages (rpm or deb) are designed to handle steps 2-5. While each packaging format has their stronger and weaker points, all of them can do the following:
  • Bundle files into a logical hierarchy
  • Execute scripts (in any language) at different points in the installation process
  • Specify prerequisites (including specific versions to require or exclude)
  • Execute tests to ensure a good installation
  • Allow for arbitrary metadata to be stored for later queries
  • Rollback to a prior installed version (most important function)
One important point to remember is that the files in source control are often not the files that belong on the production server. While this is true for compiled applications (such as Apache and MySQL), it has become true for web applications as well. Javascript and CSS assets are often uglified and compressed. You may not even be writing in CSS - Sass/Compass and Less are becoming excellent frameworks to use. Your Javascript assets may have been written in Coffeescript, your HTML in Jade or HAML, and images may be sprited.

This leads us to an important rule of thumb:
Each server should only exactly what it needs to perform its tasks and nothing more.
Applying that to our packaging means the package should only install the compiled, compressed, and otherwise-mangled files that will actually be served from the webserver. If you're putting gcc, git, or make on your production servers, you're doing it wrong. The package should have the compiled versions, not the source versions. It may have templated configuration files ("Insert hostname here"), but the template isn't installed - only the result of filling in the template.

Frameworks, such as Django, and datastores, such as MongoDB, have already been built into packages by their maintainers. Specifying them as dependencies allows the package to be self-describing.

The metadata associated with the package is important to the success of the process. The package version is required. I've found that using "1.[timestamp]" is a good monotonically-increasing version number. As this is only released internally, a nonsensical version number is good enough.

All the packaging formats allow setting arbitrary metadata on a package. A good set of metadata includes:
  • The timestamp this package was built.
  • The SCM identifier of the commit used to built the package (git SHA1, SVN version, etc).
  • The issue number for the changeset that was merged to master that triggered this package build.
With that metadata, any person in the company can hit an internal website and see exactly what the last build to each environment is and what issues are in test that aren't in production. Your issue tracker should be able to provide this, but your servers should also be able to tell you this.

So far, we have discussed putting together the application and its on-server dependencies. Roman's question asked about MongoDB. I'll expand it to datastores in general. It's good practice to keep application servers and datastores on separate horizontal groups. This allows operations to balance the needs of one vs. the other. It's extremely rare for both application and datastore to grow at exactly the same pace. So, we have to figure out a way of managing cross-server dependencies. (This problem also arises when dealing with multiple applications supporting the same product. The solution is the same.)

Datastore change management can and should also be managed with packages. Packages aren't just a set of files to be applied. A package is a set of actions that need to be taken in order to upgrade an installation from version X to version Y. The most common thing to do is provide a new set of files, but a set of actions (such as "ALTER TABLE" statements) is also appropriate. By applying datastore changes with packages, you are now able to ask your datastore "What version are you?" and make decisions based on that. One decision could be "Version X of the application cannot be installed because the datastore is not at version Y."

Roman - I hope this helps!

Thursday, July 25, 2013

Deployment is not source control

(This is the first post in a series on deployment. See part 2 and part 3.)

A source control manager (or SCM) is the second most-important tool an application team can use, right after a good editor. It preserves history, maintains context, and makes perfect julienne fries, every time. Everything should go into source control - source code, tests, requirements, configuration, build scripts, deployment tools. Everything. Building an application that isn't managed in source control is like trying to cross the Grand Canyon on a high wire - without the high wire.

But, as much as source control is a phenomenal tool, it is not the right tool for every purpose. No-one would replace vim or Sublime with Git or Mercurial. That just makes no sense. Which is why I'm always baffled when I come into a team and see deployments managed with git branches.

Deployment is the process of taking an product from environment A to environment B, usually from test (or beta) to staging (or user-acceptance), then to production. An environment isn't just the application code that lives in a single server. It's the entire stack of processes, such as the database, application(s), third-party libraries, configuration, background jobs, and services that go into providing the features of your product. Ensuring that all the different pieces of that stack are in sync at all times is the major function of deployment.

In order to do this, the deployment tool must understand dependencies. Dependencies between application code and third-party libraries on the same server is just the start of this. Dependency-tracking across server groups, between the application code and the database version, and even configuration changes are all components of this. And everything has to move in lockstep.

There is no single tool that, to my knowledge, manages the entire stack in this holistic fashion. But, an application team can make life a lot simpler for themselves by doing one simple thing - deploy with OS packages and not source control.

OS packaging tools (such as RPM and APT) have been around for decades. They are the way to deploy libraries and applications to Linux (and Windows, thanks to Chocolatey). They manage dependencies, put everything in the right place, update configuration, verify compatibility, and do everything else necessary to make sure that, when they're done, the requested thing works. Often, this means setting specific compilation switches (or even pre-compiling for specific architectures). They encode knowledge that is often hard-won and difficult to rediscover. And, finally, they let a user ask the server what is installed, revert to a previous version, or even uninstall the package (and all downstream dependencies) altogether.

Source control does not do any of those things. Source control is designed to do one and only one thing - track and manage changes between versions of groups of text files. Modern SCMs (such as Git and Mercurial) do this very very well.

Managing a deployment requires a package. When QA approves a specific deployment within their test environment, operations needs to "make prod look like test". The way ops can ensure that production will look exactly like test is to build production exactly as test was built. Server build tools (like Puppet and Chef) help ensure that the servers (or VMs) are built exactly the same every time. The application (and its configuration) needs to have the same treatment.

So, I recommend the following process:

  1. Do your development as you normally do right now. (I will have thoughts on the rest of this later, but those are another set of posts.)
  2. Once a changeset is merged into primary branch (master, for Git or default for Mercurial):
    1. It is tagged with the name of the changeset.
    2. An OS package is built and uploaded to the test package repository.
  3. The OS package is deployed to the test environment.
    1. This can happen either automatically or as a result of a user action.
  4. QA verifies the build.
    1. If it fails, issues are opened and the development process begins anew.
    2. If it fails catastrophically, the environment is reverted.
  5. When QA passes the build, the package is copied into the production package repository.
    1. The commit that was used to build this package is tagged with the date it was promoted to production.
  6. The package is applied to the production environment at the appropriate time.
At the point of merging into the primary branch, the SCM has finished its job. It's now the job of the package manager to replicate that branch out to the various environments in the correct order with the correct dependencies.

Tuesday, July 23, 2013

The canonical source

No, I'm not talking about Mark Shuttleworth's attempt to define Linux for the rest of the world. (Though, come to think of it, that's probably the source of the name. He wants to be the canonical source of all things open-source. Hence, the creation of Juju when Puppet and Chef would seem to be perfectly good solutions to the devops problem.)

To be the canonical source for something means to be the ultimate authority for how that thing is structured. Whenever a new copy of this thing is created, the canonical source is consulted to create it. Whenever a change is made, the change is first made in the canonical source and those changes propagate outwards from it. If this was a religious blog, we would discuss the Bible and its roots in Alpha and the Omega. If this was a legal blog, we'd be talking about constitutions and common law.

In IT, there are dozens of things that are copied and slung around. Database schemas, server configurations, applications, third-party tool configurations, and the like. And each one of them has a canonical source.

The only issue most organizations have is that they haven't clearly defined exactly what the canonical source is for each component in their applications. (Frankly, most organizations don't have a complete list of the components!) Why is this an issue?

Let's digress for a minute and peek over at the DRY principle. It normally is discussed in terms of code and is the explanation given for why refactoring is a good idea. Instead of having the same validation code at the beginning of four subroutines, you pull out the validation into its own subroutine and call that instead. This way, if that validation would ever change (and it will change), it is changed in one place and everywhere that needs it automagically (an awesome word!) receives the update. Without you having to do anything. Without you even having to know everywhere that needed the change.

Most people involved in the creation of software instinctively understand that this is a good idea in software. There should be a single place where the Luhn Algorithm or the email verification algorithm is defined. Why would someone want to have it in two places?

Can the same be said about your database schema? Or the structure for your production servers? What about your testing infrastructure? Do you have canonical sources for each one? And if you do, do you have processes in place that ensure the canonical source is modified first, then the changes flow from there? Is your documentation built from the same source?

If your team cannot point to the canonical source of something, that means one of three things:

  1. There isn't a canonical source.
  2. There's more than one canonical source.
  3. The canonical source is in production.
If there wasn't a canonical source, or a Single Source Of Truth, your application would be in a disastrous mess. Regressions would be occurring on a regular basis and testing would be ineffective (at best). Having more than one canonical source is the exact same thing. (Having two ultimate sources of truth is exactly the same as having none. Which is exactly what the Roman Catholic Church realized about popes.)

So, this means your canonical source is whatever is currently working in production. This would seem to have a nice poetic ring to it - whatever your users see is the canonical source for what your team works on. Many teams operate exactly like this.

Well, they operate poorly like this. Two problems rear their ugly heads very quickly. The first is usually simple-ish to fix. How should someone build a new instance of the canonical item (such as a database or server)? Cloning production would require (in many cases) taking something offline. Taking any part of production offline is usually a "Bad Idea"(tm), so it's only done very rarely.

The second problem is much more insidious. If production is canonical and it is the cleanroom, how do you safely push changes up to it?