The Blameless Blog
In this blog post, we’ll break down what software reliability means. We’ll look at how the reliability of your software is perceived, how teams operate to improve reliability, and how to contextualize reliability with customer happiness and cultural lessons.
In this blog post, we’ll walk you through holistic measures and best practices that you can employ starting today. These will include challenges and pain points in gaining insight as well as key metrics and how they evolve as organizations mature.
In our third episode, Amy chats with Tammy Bryant, Principal SRE at Gremlin, skateboarder, and horror movie lover and Eric Roberts, Sr. Manager SRE at Under Armour, performer/writer/recorder of music, and coffee aficionado.
Establishing equitable on-call rotations, putting the right guardrails and automation in place, and regular incident practice are key to minimizing the stress of on-call. In this blog, we’ll share key tools and practices to ensure your on-call engineers are set up for success.
Jacob Scott is an experienced engineer and enthusiastic participant in the resilience engineering community, having spent time caring for the technology systems powering high-growth startups as well as unicorns like Lyft and Stripe. See our interview with him here.
In our fourth episode, Amy chats with Craig Sebenik, SRE at Aurora and co-author of “What is SRE?” and “Salt Essentials.” He has a degree from Le Cordon Bleu (Sydney, Australia), a Master's in Italian Cuisine (Apcius in Florene, Italy), and a Master's in Gastronomy (University of Rheims, France). His greatest passion is teaching what he has learned from adventures in SRE and cooking.
It’s important to minimize alert or pager fatigue as much as possible, for the health and well being of your team members. After all, the health of your systems is dependent on the health of your people. Here are 5 tips on how to cut down on alert fatigue and improve your signal-to-noise ratio.
With the difficulties we’re facing during this time, it can be difficult to keep up with the increasingly vast demand for our services. You need to make use of all the tools in your toolbelt in order to conserve your team’s cognitive resources. Two ways you can do this are through automating toil from your processes and prioritizing with SLOs.
Between COVID-19 and the typical summer slow down, offices are emptier than they’re ever been. With team members taking some much-needed time off, it’s important to know how your team will be affected. Here are some tips to help your teams function during this time of flux.
CEO Ashar Rizqi had the pleasure of being a guest on Google Cloud OnAir, a Google Cloud Customer Interview Series. Ashar and interviewer Jimmy Sopko discussed how Blameless has extended our runway using Google Cloud and Google Kubernetes Engine and how the team cultivates a culture of site reliability in a changing world.
Like many organizations, our SRE journey didn't follow a linear path. We had to learn along the way. As a software reliability platform purpose-built for SREs, Blameless strives to practice what we preach and utilizes SRE best practices daily to cultivate a culture of resilience. Here's how it all began.
The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability. (Read here for a definition of SLOs and how they transformed Evernote.). Today, the Twitter team has invested in centralized tooling to measure, track, and visualize SLOs and their corresponding error budgets.
To be effective, service level indicators must be relevant to the users’ needs and experience. By consolidating a number of internal metrics into one indicator that reflects the typical use of the service, we can ensure that meeting our SLO means keeping users happy. A good way to think about this is by looking at the user’s experience or journey.
Runbooks, also known as playbooks, are documents that walk you through a certain task with specific steps. Automated runbooks can be a powerful tool for time-saving and consistency. We’ll look at five best practices for getting the most out of runbook automation, some tools on the market that can help you implement them, and discuss how to integrate runbook automation into a complete SRE solution.
As organizations are made of people, any organization can foster continuous learning, blameless culture, and psychological safety so long as its people are committed to a growth mindset. Once these cultural factors are in place, it becomes much easier to implement the practices, processes, and tools that scale that culture of excellence.
I was asked to talk about why is reliability important to me personally. I was up at 3:00 AM this morning, thinking through this question. So my sleep is obviously pretty unreliable and those kinds of questions will always get me going. And I thought, let me kind of walk folks through how reliability is personal to me.
We're proud to announce that we were selected by CIOReview as one of the Top 20 DevOps Solution Providers of 2020 alongside other innovators in the space such as Chef, Jfrog, Splunk, and Xebia Labs. This recognition validates our vision to help teams achieve production excellence by facilitating resilience and learning.
In addition to Zoom, Slack and Google Hangouts, Blameless has released a new integration with GoToMeeting to further extend our collaboration capabilities. With this integration, customers can automatically spin up a GoToMeeting link within the Blameless Slack incident channel.
Blameless recently had the privilege of hosting some fantastic leaders in the SRE and resilience community for a panel discussion.Our panelists discussed the effects of imposter syndrome especially during high tempo situations, how to use it to our advantage and overcome doubt, and how culture directly affects the availability of our systems.
Over a year ago, Blameless launched the industry’s first end-to-end SRE platform to help software teams innovate without sacrificing reliability. As Service Level Objectives (SLOs) provide an anchor for reliability targets and corresponding decisions, they are the foundational step toward helping teams truly adopt SRE best practices. Today, we are very excited to announce our new SLO platform, giving teams a shared language on how to focus their engineering efforts.
Blameless is so excited to sponsor INS1GHTS2020. This one-day digital gathering of industry leaders in NetOps, DevOps, and application delivery provides the (virtual) space for candid conversations and presentations on navigating the present and building the infrastructure that will power the future.
Psychologically safe organizations are free to create, discuss, disagree, take risks, and make mistakes. These organizations are often the ones we see as key innovators in their unique industries. In other words, cultivating a culture of psychological safety is paramount in order to succeed. So what can we do to make sure our teammates feel secure even while socially distanced?
In our second episode, Amy chats with Tim Banks, a technical account manager at Mission who has held the title of database engineer, DevOps engineer, SRE, American National and Pan American Brazilian Jiu-Jitsu champion, and professional chef during his career.
During this crisis, managing burnout has become more difficult with people unable to separate home from work, the increased burden of keeping everything on and heightened on-call loads, and the strain on communication. Here are tips to help combat burnout in your teams.
CEO and Co-founder of Blameless Ashar Rizqi had the privilege of interviewing Melody Hildebrandt on her fascinating personal story, as well as her thoughts on security and resilience in today’s constantly evolving world of technology.
April 22, 2020 at 11:20 AM PST, Amy Tobey began her talk “The Future of DevOps is Resilience Engineering” at Gremlin’s Failover Conf. During her talk, attendees registered additional questions. Requests and responses noted in timeline below.
With dozens of cancelled events, social distancing policies, and heightened stress due to the current crisis, it was more necessary than ever to take a moment to learn, share, and talk to one another about something we are all passionate about. We loved Failover Conf, and want to share our favorite parts with you.
SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen discuss how SREs can approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.
We are very excited to partner with Lightstep to share practical steps on gaining deep observability into distributed systems, and automating toil from incident response and learning to improve production readiness in our live webinar, Incident Readiness, Observability & Learning for Production Teams.
In Digital Enterprise Journal’s March Edition of its Technology Innovation Snapshot, Blameless was listed among 11 other companies as promising vendors. Blameless is honored to be alongside companies such as Gremlin, Catchpoint, and Moogsoft, and excited about the future DEJ sees for the SRE space.
We’re so proud to be co-sponsoring Gremlin’s new virtual conference, Failover Conf. As Gremlin states, “We expected to gather together in person to share our knowledge and experiences when the unexpected happened. But we’re resilient. When one opportunity goes down, we create another."
Blameless is excited to announce its sponsorship of Survivor Season 41: Silicon Valley. In this season of Survivor, players will hold nothing back! It’s a season that will shock everyone, even Jeff Probst himself. Survivor season 41 will feature an age-old conflict: Developers vs Operations!
Blameless Staff SRE Amy Tobey is lending her time to provide SRE office hours to help anyone in need get their head above water. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others.
No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.
On-call: you may see it as a necessary evil. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager around the clock. But does on-call have to be so dreadful? We think not. Here are five best practices that can help your team respond quicker and build more resilient systems that minimize repetitive interruptions.
In response to recent events, many organizations are implementing social distancing programs such as remote work. Successfully transitioning to remote work does come with challenges, but the right practices and attitudes can make it much less painful (and safer for you than heading into the office).
The trick is to ensure that regardless of your organizations’ different operating models or toolchains, there is shared visibility, communication, and collaboration across teams. This will allow your disparate teams to stay aligned while using the best practices from ITIL, DevOps, and SRE.
Today, the number of possible failure modes in cloud and microservices applications are exploding, making it increasingly difficult to gain true observability and take the right action across IT environments. Register for this webinar with Blameless and Zebrium to find out how AI can help with incident auto-detection.
For many SREs, networking prompts a similar response as going to the dentist. You know you should do it, but you don’t really want to. But networking is much less like a root canal and more like a regular teeth cleaning; you may not want to go, but once you’re there, it’s not so bad.
The new comment sidebar helps drive postmortem workflows by enabling collaborators to comment on postmortems, reply to comments, and resolve comments. We’ve also updated the look and feel of postmortems so that postmortem authors can gain as well as provide important post-incident context in a simple way.
We’re probably all familiar with Dickens’ story of Scrooge and the Three Ghosts of Christmas, written all the way back in 1843. What we may not know is that ghosts providing visions and teaching lessons is still common practice today! Let’s look into the carol of an ambitious, but unreliable, tech CEO.
My name is Simone Salman, and I’ve been working as a software engineer at Blameless since May 2019. In the spirit of thanks as we’re approaching the holidays, I wanted to reflect on my time at Blameless thus far, and share a few things about the culture that I’m especially grateful for.
It’s astonishing that despite the tremendous time we spend working on our systems, we seem to have very little control over them. If we can’t predict where the next incidents will come from, then we will be forever stuck in a reactive cycle of repair. An analogous example is the famous fable of the Three Little Pigs.
For many SREs, Google’s 99.999% availability seems like an untouchable dream. If anything, getting out of pager hell is already worth celebrating with all your coworkers, friends, and family. How can you get to a stage where you have time to proactively prevent incidents, and enter a mental state of calm and control?
Thanks to companies like Amazon, Google, Facebook, Netflix, etc., software delivery is transitioning from a novelty to a utility...When feature requests for reliability exceeds 50% of all feature requests, it’s time to focus on reliability first and foremost.
Having talked with 300 companies from industries like retail, finance, healthcare and SaaS; we see SRE as a discipline is shifting left in the software development life cycle...However, this does not take away job opportunities from SREs. Shifting left allows SREs to become partners in the development process.
When companies blame, fearful employees are not incentivized to surface issues early or ship risky changes... The fact is, complex systems fail. Rather than blaming individuals for these failures, the only way to navigate this complexity is to empower people to have the adaptive capacity that machines do not.