Tuesday, July 23, 2013

The canonical source

No, I'm not talking about Mark Shuttleworth's attempt to define Linux for the rest of the world. (Though, come to think of it, that's probably the source of the name. He wants to be the canonical source of all things open-source. Hence, the creation of Juju when Puppet and Chef would seem to be perfectly good solutions to the devops problem.)

To be the canonical source for something means to be the ultimate authority for how that thing is structured. Whenever a new copy of this thing is created, the canonical source is consulted to create it. Whenever a change is made, the change is first made in the canonical source and those changes propagate outwards from it. If this was a religious blog, we would discuss the Bible and its roots in Alpha and the Omega. If this was a legal blog, we'd be talking about constitutions and common law.

In IT, there are dozens of things that are copied and slung around. Database schemas, server configurations, applications, third-party tool configurations, and the like. And each one of them has a canonical source.

The only issue most organizations have is that they haven't clearly defined exactly what the canonical source is for each component in their applications. (Frankly, most organizations don't have a complete list of the components!) Why is this an issue?

Let's digress for a minute and peek over at the DRY principle. It normally is discussed in terms of code and is the explanation given for why refactoring is a good idea. Instead of having the same validation code at the beginning of four subroutines, you pull out the validation into its own subroutine and call that instead. This way, if that validation would ever change (and it will change), it is changed in one place and everywhere that needs it automagically (an awesome word!) receives the update. Without you having to do anything. Without you even having to know everywhere that needed the change.

Most people involved in the creation of software instinctively understand that this is a good idea in software. There should be a single place where the Luhn Algorithm or the email verification algorithm is defined. Why would someone want to have it in two places?

Can the same be said about your database schema? Or the structure for your production servers? What about your testing infrastructure? Do you have canonical sources for each one? And if you do, do you have processes in place that ensure the canonical source is modified first, then the changes flow from there? Is your documentation built from the same source?

If your team cannot point to the canonical source of something, that means one of three things:

  1. There isn't a canonical source.
  2. There's more than one canonical source.
  3. The canonical source is in production.
If there wasn't a canonical source, or a Single Source Of Truth, your application would be in a disastrous mess. Regressions would be occurring on a regular basis and testing would be ineffective (at best). Having more than one canonical source is the exact same thing. (Having two ultimate sources of truth is exactly the same as having none. Which is exactly what the Roman Catholic Church realized about popes.)

So, this means your canonical source is whatever is currently working in production. This would seem to have a nice poetic ring to it - whatever your users see is the canonical source for what your team works on. Many teams operate exactly like this.

Well, they operate poorly like this. Two problems rear their ugly heads very quickly. The first is usually simple-ish to fix. How should someone build a new instance of the canonical item (such as a database or server)? Cloning production would require (in many cases) taking something offline. Taking any part of production offline is usually a "Bad Idea"(tm), so it's only done very rarely.

The second problem is much more insidious. If production is canonical and it is the cleanroom, how do you safely push changes up to it?


  1. I think that there is one other possibility. The cannonical source could be in the test suite (as it can be with Test-Driven Development). I think that meets the definition of a cannonical source (as you have presented it). I have worked (and continue to work) in TDD environments where everything is subject to testing (yes, that includes the database structures and initialization data). I also like the built-in side effect that a needed feature (a feature for which a test has been defined, but which is not implemented) is swiftly identifiable in TDD and the implementation team is notified of the needed feature on a regular basis (until, of course, it is successfully implemented).

  2. I agree - a test suite can be a great canonical source for requirements, such as Cucumber specifications being "executable specifications". It does require that new features must be written in tests first, both at the integration (user-level) and the unit (package/module-level) layers. Using the Red-Green-Refactor process makes this a lot easier.

    This is a great point. Thanks for commenting!