First, what are we looking for in this process? It's really hard to know if you've achieved something if we don't know what we're trying to achieve. Sounds pretty obvious, I know, but think back - how many projects have you worked on where that wasn't done? How successful were those projects?
What I want out of my operations process is this:
- A guarantee that every piece of infrastructure was:
- created solely from things in source control.
- changed solely from things in source control.
- A guarantee that I can find out:
- what changed
- why it changed
- when it changed
- who reviewed/approved the change
- when it was applied to each instance in each environment
- and which instances in which environments it hasn't been applied to
- what is the dependency graph between this and other changes
Oh, and it has to be unobtrusive and not get in the way and be easy to understand. Not so hard.
These guarantees look very similar to the guarantees that the standard Agile development processes provide. The first is obviously implemented - devs aren't allowed to touch deployed servers. So, any change they want to see in the application must come from things in source control. (This is different from most non-Agile processes where code changes to deployed servers are sometimes emailed (or even IM'ed!) from dev to ops and implemented by hand because changes are so infrequent.)
The second is implemented through discipline. There is nothing in Agile that says all changes must be associated with an issue, confined within a branch, and go through a multi-person development/review process. But, every Agile team I have ever seen or heard of does it that way because doing it any other way has led to uncomfortable situations and unanswerable questions. It's so ubiquitous that tools now treat this as "Agile mode". Github's pull request process even creates an issue# just to ensure that there is a record in the issue tracker.
In order to apply this process to operations, we will need discipline of our own. Any operations team, to do its job, will be applying changes manually. This gets the job done now and is easy to reason about. Not a bad thing. Except, how do you know that what was done is what was defined in source control? This is where the discipline comes in.
Ideally, all changes to deployed servers happen via script. If those scripts are executed from a place separate from the deployed servers, that's even better. (For example, only using the AWS SDK to touch your AWS infrastructure.) The script can be a Puppet/Chef/Salt thing or Ruby scripts or even Bash. It doesn't matter, so long as the computer is the one actually doing the changes. The scripted-ness is checked in and you treat it like application code. Including deployment.
In short, you treat deployments of your changes exactly as you treat deployments of the applications under your management. Which makes sense because an application is more than just code - it's also the infrastructure.
"That's great in an ideal world, Rob, but nothing I have is scripted. It's all checklists. Now what?"
Checklists are scripts that run against a human virtual machine. If you look at the checklist, you should be able to replace many of the steps with "execute this script". Maybe even condense 2-3 steps into one script. Over time (and this is definitely a journey, not a destination), each checklist will condense into "invoke these N scripts". Which, itself, is just one script.
"That great, but I don't even have checklists. My people just know what to do."
If they "just know what to do," then you do not. Do not know what they do, do not know they have done it, do not have control. You're responsible that they do it. And, most importantly, you're responsible that new people can learn it. If it isn't written down, how are new people learning their job?