Whether it's in classrooms or on Zoom calls, the kids have headed back to school! Bright-eyed students are gearing up to study new subjects and test their brains. Hopefully on their report cards, failure isn’t inevitable. Before the first day, parents load up their kids’ backpacks with everything they’ll need. Being well equipped with good supplies is the best way to stay focused and educate “reliably”.
Likewise, SREs need the right tools and practices for the job. If you want to reduce incidents, respond to them faster, and continuously learn from them, you’ll want everything on our SRE back-to-school checklist!
Sometimes it’s hard to know what to focus on. When you look at a wall of text, your eyes might glaze over. Without something to make the essential data stand out, you might not learn anything. The same thing can happen when you’re looking at all the data your system produces.
Like a highlighter, service level indicators (SLIs) help you connect the dots on what’s important. What’s on the test? Customer happiness. Highlight everything that shows the story of a customer using your service. Identify different SLIs for different use cases, just like you’d use different colours.
Once you’ve built your SLIs, monitor them as they change. Then put a BIG highlighter mark where you DON’T want it to go - where customers will become pained by unreliability. This is your service level objective (SLO). When you can tell the line is going to breach the SLO, it’s time to slow down and start using policies like code freezes. But when it’s safely far away, it’s full steam ahead on development! SLOs make sure you’re focused on the critical line, and not overreacting to every change.
To build your SLIs and SLOs, you need something to measure the health of your service. Monitoring tools are like rulers and protractors - they let you look at something and get a meaningful number from it. Make sure to measure all the most important things — your network traffic, resources like disk space, how quickly your service is responding, and how your third party components are working.
Once you’re measuring these things, monitor them as things happen. Line up your protractor to get the slope on how fast metrics change. Your monitoring tools will help you find trends and patterns to the oscillations — you won’t even need trigonometry! These observations will help you make the right decisions for service health.
No matter how much you prepare, things are bound to go wrong. That’s okay! It’s why pencils have erasers. In a complex system, failures can sometimes be like knocking over dominoes. That’s why you’ll need a whole lot of erasers for all sorts of errors.
Set up a classification system so you know which eraser you’ll need. Then make sure the right people are using it. Make sure your on-call schedules are fair and effective. Erasers wear down eventually, so always check in to be aware if your systems and responders are holding up.
When responding to incidents, keep track of what works. Document each thing you check, and what decisions you’d make for each result. When dealing with a new type of problem, cutting and pasting these proven steps often gives you a great head start. (Well, really, it’s more “copy and paste”, but Xerox and glue doesn’t sound quite right…)
The guides you build from these modular steps are runbooks. They walk responders through each stage of solving a problem. By investing in them, you’ll reduce the time future incidents will take to resolve. You can go even further by automating your runbooks. To get these payoffs, you’ll need to build a good collection of steps, so start cutting up and pasting your processes!
We all loved taking notes in class, didn’t we? Well, maybe not, but if we did take notes, we were sure grateful we did later. Your incidents are the same way. In a crisis situation, you might not want to spend the time recording what’s happening, but this will become your most valuable asset for dealing with future incidents.
Your incident retrospective should contain contextual monitoring data, a log of communication, a timeline of events, and recommended followup actions. This sounds like a lot, but tools can help. Blameless allows you to automatically construct a retrospective as you resolve an incident.
But taking notes is only half the story — you have to study, too. Schedule meetings to review your retrospectives and make sure action items are still in progress. Use them to build new runbooks and other practices. There’s no better investment you can make in incident response than taking good notes and reading them!
Getting into SRE can be intimidating, and the abundance of resources can be overwhelming too. It might feel like walking up to a huge wall of textbooks, not knowing which to choose. However, once you identify which “class” you’re in (organizational size, maturity, and resources), you’ll be able to pick the right books.
Our Essential Guide to SRE meets you where you are, offering practical advice for any organization. As you mature, expand your practices further with the Google SRE book. This is a comprehensive guide written from the perspective of an extremely well-equipped organization — definitely a high reading grade required. If you’re finding their solutions unfeasible, look for different ideas for each category in our big list of SRE resources. Scour the shelves to find the perfect book for you.
Even with all the best supplies, you can still run into trouble if you don’t know how to use them. What if there was a brilliant tutor that guided you through setting SLOs, building and documenting runbooks, creating retrospectives, and more? That’s us! To find out how we can help get your SRE school year started off right, check out a demo!