Thursday, September 3, 2015

Questions to ask Operations

(. . . or, if you're the devops team, questions you should be asking yourself)

Devops teams exist to make the answers to the following questions a resounding "Absolutely!". If there is any question in this list that you're not 110% confident in the "Absolutely!", then that's the next thing to work on. And, yes, this list is ordered from most important to least important.
  1. Can we confidently rebuild our production environment from source control and backups of data?
    1. In under an hour?
    2. Including all monitoring, alerting, and metrics gathering?
  2. Can we confidently terminate one person's access?
    1. In under 10 minutes?
    2. With one command?
  3. Can we confidently create a instance of the application?
    1. That is a structural clone of production?
    2. With reasonable fake data?
    3. In one command?
    4. On a laptop?
  4. Can we confidently turn off any one server in production at any time?
    1. With zero impact or visibility to users?
    2. Including your:
      1. database master?
      2. session store?
  5. Can we confidently tell anyone to take 3 months leave to care for a sick family member?
    1. Without ever calling them once?
  6. Can we confidently hire into any spot and have that person fully authenticated and authorized?
    1. With nothing missing?
    2. In their first hour?
    3. Before they even show up?
  7. Can we confidently hire someone into IT and have them make a change to production?
    1. In their first week?
    2. In their first day?
  8. Can we confidently say that what is reviewed in QA is EXACTLY what can go to production?
  9. Can we confidently let anyone promote from one environment to the next?
    1. With a button?
    2. Showing them exactly what will be promoted?
      1. As issue numbers linked from your issue tracker?
    3. With rollbacks?
  10. Do you have tests of your infrastructure?
    1. Including monitoring, alerting, and metrics gathering?
    2. Including external interfaces?
    3. Run as part of a CI service?
    4. With automated coverage statistics?
      1. Over 90%?
Implicit in every question is the follow-up "How do you know?" If you ask yourself these questions and cannot point to where you did that yesterday (or the last time, in the case of authn/z changes), then you're treating your infrastructure as magic.

Next post discusses where to start.