The Blameless Blog
The concept of service level objectives (SLOs) and error budgets have been key to this transformation, as SLOs shape an organization’s ability to make data-oriented decisions around reliability. (Read here for a definition of SLOs and how they transformed Evernote.). Today, the Twitter team has invested in centralized tooling to measure, track, and visualize SLOs and their corresponding error budgets.
To be effective, service level indicators must be relevant to the users’ needs and experience. By consolidating a number of internal metrics into one indicator that reflects the typical use of the service, we can ensure that meeting our SLO means keeping users happy. A good way to think about this is by looking at the user’s experience or journey.
Runbooks, also known as playbooks, are documents that walk you through a certain task with specific steps. Automated runbooks can be a powerful tool for time-saving and consistency. We’ll look at five best practices for getting the most out of runbook automation, some tools on the market that can help you implement them, and discuss how to integrate runbook automation into a complete SRE solution.
As organizations are made of people, any organization can foster continuous learning, blameless culture, and psychological safety so long as its people are committed to a growth mindset. Once these cultural factors are in place, it becomes much easier to implement the practices, processes, and tools that scale that culture of excellence.
I was asked to talk about why is reliability important to me personally. I was up at 3:00 AM this morning, thinking through this question. So my sleep is obviously pretty unreliable and those kinds of questions will always get me going. And I thought, let me kind of walk folks through how reliability is personal to me.
We're proud to announce that we were selected by CIOReview as one of the Top 20 DevOps Solution Providers of 2020 alongside other innovators in the space such as Chef, Jfrog, Splunk, and Xebia Labs. This recognition validates our vision to help teams achieve production excellence by facilitating resilience and learning.
In addition to Zoom, Slack and Google Hangouts, Blameless has released a new integration with GoToMeeting to further extend our collaboration capabilities. With this integration, customers can automatically spin up a GoToMeeting link within the Blameless Slack incident channel.
Blameless recently had the privilege of hosting some fantastic leaders in the SRE and resilience community for a panel discussion.Our panelists discussed the effects of imposter syndrome especially during high tempo situations, how to use it to our advantage and overcome doubt, and how culture directly affects the availability of our systems.
Over a year ago, Blameless launched the industry’s first end-to-end SRE platform to help software teams innovate without sacrificing reliability. As Service Level Objectives (SLOs) provide an anchor for reliability targets and corresponding decisions, they are the foundational step toward helping teams truly adopt SRE best practices. Today, we are very excited to announce our new SLO platform, giving teams a shared language on how to focus their engineering efforts.
Blameless is so excited to sponsor INS1GHTS2020. This one-day digital gathering of industry leaders in NetOps, DevOps, and application delivery provides the (virtual) space for candid conversations and presentations on navigating the present and building the infrastructure that will power the future.
Psychologically safe organizations are free to create, discuss, disagree, take risks, and make mistakes. These organizations are often the ones we see as key innovators in their unique industries. In other words, cultivating a culture of psychological safety is paramount in order to succeed. So what can we do to make sure our teammates feel secure even while socially distanced?
In our second episode, Amy chats with Tim Banks, a technical account manager at Mission who has held the title of database engineer, DevOps engineer, SRE, American National and Pan American Brazilian Jiu-Jitsu champion, and professional chef during his career.
During this crisis, managing burnout has become more difficult with people unable to separate home from work, the increased burden of keeping everything on and heightened on-call loads, and the strain on communication. Here are tips to help combat burnout in your teams.
CEO and Co-founder of Blameless Ashar Rizqi had the privilege of interviewing Melody Hildebrandt on her fascinating personal story, as well as her thoughts on security and resilience in today’s constantly evolving world of technology.
April 22, 2020 at 11:20 AM PST, Amy Tobey began her talk “The Future of DevOps is Resilience Engineering” at Gremlin’s Failover Conf. During her talk, attendees registered additional questions. Requests and responses noted in timeline below.
With dozens of cancelled events, social distancing policies, and heightened stress due to the current crisis, it was more necessary than ever to take a moment to learn, share, and talk to one another about something we are all passionate about. We loved Failover Conf, and want to share our favorite parts with you.
SRE leaders Craig Sebenik, David Blank-Edelman, and Kurt Andersen discuss how SREs can approach work as done vs work as imagined, how to define SRE and DevOps and the complementary nature of the two, the ethics of purchasing packaged versions of open source software, and more.
We are very excited to partner with Lightstep to share practical steps on gaining deep observability into distributed systems, and automating toil from incident response and learning to improve production readiness in our live webinar, Incident Readiness, Observability & Learning for Production Teams.
After getting managerial approval for incident management, your SRE buy-in program is well underway. In part 2 of this blog series, we're going to share how to convince a VP or director to invest in additional SRE practices to strategically improve business results: automated metrics and continuous learning.
In this blog series, we will walk you through how to come up with a winning pitch for each level of leadership to ensure that SRE buy-in will succeed in your organization. Let’s start at the beginning with your team lead or manager.
In Digital Enterprise Journal’s March Edition of its Technology Innovation Snapshot, Blameless was listed among 11 other companies as promising vendors. Blameless is honored to be alongside companies such as Gremlin, Catchpoint, and Moogsoft, and excited about the future DEJ sees for the SRE space.
We’re so proud to be co-sponsoring Gremlin’s new virtual conference, Failover Conf. As Gremlin states, “We expected to gather together in person to share our knowledge and experiences when the unexpected happened. But we’re resilient. When one opportunity goes down, we create another."
Blameless is excited to announce its sponsorship of Survivor Season 41: Silicon Valley. In this season of Survivor, players will hold nothing back! It’s a season that will shock everyone, even Jeff Probst himself. Survivor season 41 will feature an age-old conflict: Developers vs Operations!
Blameless Staff SRE Amy Tobey is lending her time to provide SRE office hours to help anyone in need get their head above water. She cares deeply about her community of SREs and wants to take what she’s learned over the 20+ years of her career to help others.
No, it won’t be possible to continue operating business-as-usual. For the unforeseeable future, teams across the world will be dealing with cutbacks, infrastructure instability, and more. However, with SRE best practices, your team can embrace resilience and adapt through this difficult time.
On-call: you may see it as a necessary evil. It isn’t a surprise that many engineers have horror stories about the difficulty of carrying a pager around the clock. But does on-call have to be so dreadful? We think not. Here are five best practices that can help your team respond quicker and build more resilient systems that minimize repetitive interruptions.
In response to recent events, many organizations are implementing social distancing programs such as remote work. Successfully transitioning to remote work does come with challenges, but the right practices and attitudes can make it much less painful (and safer for you than heading into the office).
The trick is to ensure that regardless of your organizations’ different operating models or toolchains, there is shared visibility, communication, and collaboration across teams. This will allow your disparate teams to stay aligned while using the best practices from ITIL, DevOps, and SRE.
Today, the number of possible failure modes in cloud and microservices applications are exploding, making it increasingly difficult to gain true observability and take the right action across IT environments. Register for this webinar with Blameless and Zebrium to find out how AI can help with incident auto-detection.
For many SREs, networking prompts a similar response as going to the dentist. You know you should do it, but you don’t really want to. But networking is much less like a root canal and more like a regular teeth cleaning; you may not want to go, but once you’re there, it’s not so bad.
The new comment sidebar helps drive postmortem workflows by enabling collaborators to comment on postmortems, reply to comments, and resolve comments. We’ve also updated the look and feel of postmortems so that postmortem authors can gain as well as provide important post-incident context in a simple way.
We’re probably all familiar with Dickens’ story of Scrooge and the Three Ghosts of Christmas, written all the way back in 1843. What we may not know is that ghosts providing visions and teaching lessons is still common practice today! Let’s look into the carol of an ambitious, but unreliable, tech CEO.
My name is Simone Salman, and I’ve been working as a software engineer at Blameless since May 2019. In the spirit of thanks as we’re approaching the holidays, I wanted to reflect on my time at Blameless thus far, and share a few things about the culture that I’m especially grateful for.
It’s astonishing that despite the tremendous time we spend working on our systems, we seem to have very little control over them. If we can’t predict where the next incidents will come from, then we will be forever stuck in a reactive cycle of repair. An analogous example is the famous fable of the Three Little Pigs.
For many SREs, Google’s 99.999% availability seems like an untouchable dream. If anything, getting out of pager hell is already worth celebrating with all your coworkers, friends, and family. How can you get to a stage where you have time to proactively prevent incidents, and enter a mental state of calm and control?
Thanks to companies like Amazon, Google, Facebook, Netflix, etc., software delivery is transitioning from a novelty to a utility...When feature requests for reliability exceeds 50% of all feature requests, it’s time to focus on reliability first and foremost.
Having talked with 300 companies from industries like retail, finance, healthcare and SaaS; we see SRE as a discipline is shifting left in the software development life cycle...However, this does not take away job opportunities from SREs. Shifting left allows SREs to become partners in the development process.
When companies blame, fearful employees are not incentivized to surface issues early or ship risky changes... The fact is, complex systems fail. Rather than blaming individuals for these failures, the only way to navigate this complexity is to empower people to have the adaptive capacity that machines do not.
Whether you are implementing SRE or DevOps, your best intentions are likely going to disappoint you on the first try. Much like with DevOps, the path to successful SRE implementation is not as elusive as it may seem. Costly mistakes and team dissatisfaction can be avoided — if you take the right steps.
If you'd like to get to know us better, then join the team for drinks and appetizers at the unique and delicious restaurant Hodge's on May 4th, or if you'd like to apply for a position to work with Blameless and change the world through SRE, visit us at booth 23 at the PyCon job fair!
Walk the halls of any SaaS startup and you’ll hear the same thing over and over again, “before we can really scale, we have to be enterprise-ready.” While “enterprise ready” has many definitions, it’s generally defined as being secure and compliant, having the baseline features the market expects, and being consistently reliable.
It's that time of the year again: SaaStr time! Blameless is heading to SaaStr in San Jose and hopes you'll be able to join us. You can stop by Booth 721 to chat with us about Site Reliability Engineering (SRE) and why Blameless might be the best solution for you. Come grab some swag, chat with us, and RSVP for our VIP dinner!
December is typically considered a month of reflection and anticipation for what’s to come -- it's kind of like a postmortem for your year :wink:. 2018 was a great year for Blameless, and we have something even more exciting coming in 2019. So join us at KubeCon 2018 as we get ready to make an announcement that will be guaranteed to brighten your new year.
A remarkable milestone for any company’s site reliability engineering (SRE) is five 9s availability. That’s less than 30 seconds of service unavailability per month! Exactly what Twilio has accomplished. Tyler Wells, the Director of Engineering at Twilio, shares the key building blocks of getting to five 9s.
Most feature developers don’t plan for retirement of features. Microservices gives you the illusion that you can yank and replace, but that’s not really the case. It’s tough to turn off a microservice without losing an arm or leg. That’s why it’s important for SRE to have a full life cycle engagement.