I just finished reading a great book - Web Operations by John Allspaw and Jesse Robbins. This isn't a book of code samples; it's a book to make you think more about system administration in the large, infrastructure as code, and other big picture items for the web systems that we deal with today. Some notes:
- Chapters 6 and 11 talk about the difference between monitoring and metrics and the need for both. I've always done the monitoring part - mostly on the system level, though, gathering load average and disk usage and such. What I haven't done enough of is monitoring on the app level - e.g., if 100 users have signed up in the last day, alert me. These sorts of things are very doable, but they require effort. For metrics, doing those on the application level is very informative - how many users signed up today? How many tickets were opened in the past week? These chapters talk about using Ganglia for that... sounds like a great tool, and would be an improvement over the low-ceremony daily summary emails. At least those are a start, though; better than nothing. Gathering those metrics can lead to interesting discussions around trends and forecasting - if we did get 100 users in a day, what would that mean for our system? How would that change our hardware breakdown? And there are business-level questions too - are those users coming back, or are they creating accounts and never logging in again?
- Chapter 7 is a reprint of the excellent "How Complex Systems Fail" essay by Dr Richard Cook. Well worth reading and re-reading.
- Chapter 10 is a good discussion of dev/ops collaboration - or, sadly and more often, the conflict between the two. It's a tragedy and a waste when dev folks and ops folks don't get along. The ops guys have so much to teach the devs - they spend all day dealing with and are experts in stuff that devs usually touch once every 3 months (network configs, DNS, SMTP, backup/restore). And vice versa, of course - my impression is that there are a lot of ops shops that aren't using SCM for their scripts and don't have server builds automated. Tools like Puppet can help to bridge this gap... generally, both dev and ops need to be considerate of each other's responsibilities and needs. Lots to learn here.
- Chapter 12 had a fun discussion of the lure of DB clustering. I won't spoil it for you, but it really rang true for me.
- One thing I liked about the whole book was the general assumption that you do want failover and redundancy and are willing to work for that. That is to say, that you actually care about the app you are working on, the people you are working with, and the customers you are serving! I don't quite know how to put my finger on it... but there's sort of an underlying thoughtfulness about it all. There's a feeling that I don't want to blame others for the system going down, rather, I want to build in sufficient checks and balances so that when a server goes down the system continues to tick along without anyone having to make the 2 AM drive to the colo. It reminds me of the Cassandra project fellows saying that if a DB node goes offline in the middle of the night they can say "meh" and let it go until the next day. Good stuff.
Obviously, I heartily recommend this book. If you've done much devops at all you'll find yourself enjoying (and sympathizing with the folks suffering through) the anecdotes, and you'll come away from this book a list of things to do to lower the stress level around running your web app. Enjoy!