When Blameless started in 2018, the team set out on a mission to help all engineers achieve reliability with less toil and risk. Three years in, that mission has become more important than ever. What has changed is the rate of SRE adoption, now the fastest growing team and practice inside engineering. This represents a clear recognition of the many upsides that an SRE practice brings with its combination of continuous learning, velocity, and resilience. SRE is a major force driving change in software engineering today.
The Blameless team is delighted to share that we’ve just closed a $30M series B funding round from existing and new investors. This funding will allow us to continue providing the best in SRE as we expand our product, engineering and GTM teams. Check out the news release!
We want to take this opportunity to thank all our customers, partners and employees for their continued commitment as we look forward to more success ahead. To celebrate this moment, we’re sharing an open conversation with our co-founder and CEO Lyon Wong and Kurt Andersen, our SRE Architect.
As software becomes more integral to how we live everyday, Kurt shared this insight as a useful backdrop to our discussion:
“Reliability Matters. Without it, all the features in the world are worthless”.
“Lyon, what surprises you the most about the SRE market since founding the company?”
Every engineering team wants reliable services. Surprisingly there’s quite a spread of maturity in the market on how to achieve it. When teams adopt SRE — tooling and process — it’s rarely a linear, steady progression. The reality is you’re never really “done” because it's a journey of continuous change and learning. Building SRE as a key part of modern software engineering is a multi-step journey with lots of milestones and so it’s important to celebrate wins, both large and small.
I’m surprised when teams declare “done” too soon. Pausing or stalling along the path to achieving SRE denies you many benefits that are just missed or perhaps never reached. Improving how you manage incidents is an important step but that’s just the tip of the spear.
Blameless helps teams better manage failures and then learn from those failures in order to continuously improve, as you grow and scale. It’s a journey and SRE success involves a few key ingredients -- a playbook to fit your business, tech tools to streamline the entire process and the right culture mind-set. Reaching important milestones takes a windy path and it’s important to stay the course.
“Kurt, what is the most exciting thing that’s happening in the market today with SRE adoption?
I’m happy to say that some of the basic mechanics of practicing SRE are now commonly understood. This includes incident response, retrospectives, and even SLOs — though that is lagging a bit. I’m personally excited that there's a visible, growing awareness about the role of people in providing resilience and that just adopting the mechanical process steps is not enough.
SRE is a whole mindset shift to holistically embrace reliability and systems wide engineering.
The other exciting shift is a better understanding of online service delivery as a source of business value. It’s no longer a cost center or a way to cut corners and save money. In fact, quite the opposite. It’s an area to invest in and continue to improve, scale and innovate.
Outside engineering, other stakeholders are much more aware of the investment required to deliver reliability excellence. As online services continue to deliver customer value, business leaders want to speak the same language as engineering. Actually SLOs are a great way to bring all functions together, get on the same page and really understand what’s involved in achieving the right level of uptime.
“Kurt, what are some of the challenges that engineering teams face when adopting SRE?”
Continuous improvement doesn't have an end. Frankly the challenges are more often a management mind-block. As Lyon said, the “are we there yet?” mentality presents a real headwind. We recognize that mindset and culture shifts are difficult to overcome and when teams are not incentivized to manage reliability, it becomes really hard. You need a distinct component that addresses reliability in order to sustain the effort.
If the team is mired working on tons of bugs and a backlog of tech debt, they really aren’t even thinking about reliability. They are just trying to keep their head above water. Unfortunately this translates to toil and burnout and if the path of broken process continues, engineers leave and that directly impacts the business. Team resilience is a foundation for SRE.
“Lyon, what market trends do you foresee in the coming years as teams adopt a blameless culture?”
Adopting a blameless culture is really an initial step or phase in practicing SRE. Now we are seeing a transition to establishing a culture and practice of reliability. Teams will start to move away from “this is a fix it issue” to more of an asset that needs investment and continuous improvement.
“Kurt, what’s your advice to engineering teams facing tool bloat today?”
Look for a platform that gives you 80% of what you need and remember perfect is the enemy of good. Larger orgs now recognize that investing in a hodge-podge of point solutions actually costs more — not only in tool investment, but also in people effort which is not the best approach over the long haul.
Find a platform that will integrate with your core tools and where you don’t need to hire an entire team to maintain it. The platform should be doing the work for you and freeing up precious resources to focus on higher value engineering work.
There’s often the debate in engineering on build vs buy. In this case, I recommend buying. Good enough beats perfect, so use your engineering resources to focus on driving continuous improvements and please don’t build your own database to do the work.
“Kurt, Blameless recently delivered SLO features which are valuable in getting teams and other stakeholders aligned on what's important. What advice would you give to anyone looking to adopt SLO?”
The biggest problem with SLO adoption is that teams try to fit it into their pre-existing mental paradigm and often they think too narrowly. It’s the street light effect or focusing on what’s easy.
The real value comes when you take a step back and look at the value you bring from a user journey perspective. This is where SLOs take it to the next level and drive value.
That’s not the same as saying “I’ve got 50 services running and I need to think about what is important for each distinct service”. SLO’s are your scouting party that detect impending problems before customers cross an unhappy threshold. They allow the team to respond before it reaches a critical level. Focus on what matters to users, rather than how much your CPU is spinning.
“Lyon, what Blameless features are you most excited about?”
Insights from aggregated cross-team learning is really exciting. Retrospectives give teams useful information per incident but it’s hard to discern a pattern emerging from, say, the last 2000 incidents. You really want to know what actions and course corrections a team takes over a specific timeframe. Being able to drill down by team will help you move forward and improve in areas that need the most attention. Just because one team has more incidents doesn’t mean it’s less reliable. Showing how a team gets better over time, regardless of the incident volume is much more interesting.
Blameless allows teams to look at the right data and paint a picture of what’s happening with service reliability and learning.
“Kurt, what features excite you?”
I’m excited about the possibilities around how teams can ‘shift left’ for reliability.
It’s not always about when things blow up. We want to catch things when the product is in planning, ideation and the build phase and consider reliability as early as possible in the cycle.
Thinking about customer happiness and accomplishing that journey starts with product and development teams and that desire should flow all the way through the life of the product. Being able to surface possible problems as early as possible has tremendous value and will certainly reduce toil and drive reliability.
Blameless is growing!
Our journey to bring reliability to all is gaining momentum, but we’re just getting started! If you’re looking to join a great team on a mission to solve engineering toil, check out our open positions. To see the platform in action, request a demo. Follow us on twitter and LinkedIn.