Tuesday, September 8, 2015

Devops - Where do I start?

Last post, I laid out a series of questions every operations team should be able to answer. Everyone may agree that this list of operational capabilities is good, but getting from here to there is far more complicated. What should we do first?

The absolute first thing every operations team must do is get everything into source control. If it's not in source control, you cannot audit it, review it, or manage it. If a change happens, you don't know how, when, what, or why and there's no chain of custody showing what the approvals were. In short, you do not control it. Not controlling the stuff that makes your stuff isn't sane devops.

This implies there needs to be solid source control. First choice is what to use. Always use a distributed version control system (aka, DVCS) - Git or Mercurial if at all possible. Using a DVCS has two massive advantages over a centralized version control system, like Subversion, Perforce, or CVS (no links to bad choices!). First, DVCS's are far better tools for managing development changesets - lots of discussion about that around the web. The better reason, for operations, is that every clone can be used as the new master if everything else goes pear-shaped. Remember - the first question asks if you are capable of recovering everything else if you have your production backups and the latest checkout of source control. You cannot recover source control itself with the latest checkout of Subversion or CVS. You can do so with Git or Mercurial.

If at all possible, use a service. (This, btw, is going to be a theme I'll expand on more later.) GitHub, AWS's CodeCommit, Atlassian's BitBucket, or Google's Cloud Source Repositories are all excellent choices, along with many others. They all provide private repositories and are extremely scalable and secure. For nearly every organization, these services are capable enough. If you're already using Google's AppEngine or the AWS suite of services, the choice is pretty simple. If you're using the hosted Atlassian suite, BitBucket again seems to be an easy choice. Github is an excellent choice in most other scenarios. Sometimes, for corporate reasons, you have to host internally. In that case, you should strongly consider using GitLab or Atlassian Stash. In all cases, you should be using the same tools as your developers.

Once you picked a tool and a method for hosting, the next step is to get everything into it. Literally and truly everything. If it's a script, check it in. If it's a configuration file, check it in. If it's a Chef recipe, Puppet manifest, Salt pillar, or any other file for a similar tool - yup, check it in. All the secrets (GPG-encrypted first, of course). All the scripts. All the configuration. Everything.

If you don't have an automated way of building it, then write down how to build and check that in. Unless you have a really good reason not to, use a text-based markup language. It's important to use text because text is diff-able by Git/Mercurial. If it's not diff-able, then it becomes very difficult to see the differences when someone wants to change something. I prefer GitHub-flavored Markdown, but there's plenty of other good choices. Use the one that makes the most sense with the rest of your tooling landscape.

Note: I recommend both text (for diff-ability) and GPG-encryption (for secrets). GPG encryption is inherently not diff-able. For secrets, this is good. For instructions and scripts, you want diff-ability.

What's the minimal list of servers/services/activities that you need to check into source control? You guessed it - everything.
  • DNS / Network definitions
  • LDAP / IAM / User authentication lists
  • Mail servers (if you manage mail internally, otherwise treat it as an external service)
  • Monitoring and alerting definitions
    • Especially if you use something like PagerDuty for alerting
  • GPG-encrypted master passwords for external services
  • GPG-encrypted keys (and other authentication methods) for external services
  • GPG-encrypted SSL certificates
  • Server construction methods (for your application)
    • Including where and how to get the base images
  • Application deployment methods
  • Service construction methods (for your internally-hosted supporting services)
    • For example, CI (Jenkins, Stash, etc)
  • Any and all desktop support (including VPN clients, etc)
  • Anything else you are responsible for
(Note: While it doesn't explicitly say it in the documentation, but you can GPG-encrypt a file for multiple recipients and any of them can decrypt it. This is a good thing. Note that you will need to re-key all the secrets whenever someone leaves, not just re-encrypt them. You needed to do this anyway. You only need to re-encrypt them when someone new comes in.)

All of these things may not all live in the same repository. But, don't hesitate to put it into some repository just because you don't know the perfect place. You can always move it later. And, the movement itself will document the growing understanding you have of how to manage the infrastructure.

If you don't know how to rebuild a server, take your best guess and stick that in the repository. As you learn more, you will update what's there. The repository logs will be a trail of exactly what you had to do in order to get all the information you currently have.

The next post will talk about what to do with this repository once you've built it.