Jim, Tell us about yourself and your background.
I fell in love with computer programming when I was 12 and spent the first half of my career as a software engineer. I love solving deeply technical problems and I also love well designed products. My first job out of college was at Apple and I was able to pursue my technical and non-technical interests. That’s where I learned how to build really great products.
In 2008, I had the good fortune of joining a 5 person startup called New Relic. I ended up staying with the company for 11 years, seeing it through many phases of growth. I learned how to deliver enterprise software at scale, assess new market opportunities and empathize with customers. The challenges I faced while at New Relic form the basis of what Blameless is on a mission to achieve. While customers love new features, they expect that your service will be reliable - especially in their time of need.
During your time at New Relic, the company grew to $500m, went public and consistently led in the APM/O11y market. Can you share some milestones and highlights you’re proud of?
We started out in just one market, with a product for Ruby on Rails. Over time, we added support for other languages and we delivered additional product offerings such as real user monitoring, synthetics and mobile support. Each of these allowed us to expand our TAM, and each are memorable milestones for me.
During that time, we also moved up-market to cater to the needs of larger enterprises. In 2013 we were named leader in Gartner’s APM Magic Quadrant, and in 2014 we had ~400,000 unique users when the company went public.
In 2015, the engineering team went through a major restructuring in order to keep up with the demands of the business. At that time, with 150+ engineers, we decided to form an SRE team to begin the best practices described in Google’s landmark SRE handbook, which was fairly new at that time. Looking back, there was a ton of learning and we definitely benefited from doing it. However, if we had had a product like Blameless, it would have saved us countless hours of precious engineering resources.
You not only oversaw engineering but also product management and design at New Relic. What are some lessons learned and how does that inform your thinking at Blameless?
I am often asked by peers questions such as: how does product management and engineering get aligned or how do you consistently deliver a roadmap that is meaningful to customers or even how do you use your best engineers most efficiently.
That’s one of the reasons why I was so attracted to Blameless. Many of the things an SRE would do are built right into the platform. This not only saves time but also unburdens the team from having to recall what steps need to be taken. When everyone knows what they are supposed to do - how to triage, communicate, complete tasks, the team gets their work done much more efficiently. The opposite is chaos where no one knows what they’re supposed to do or what outcome they are driving towards and at that point, it becomes impossible to scale.
Over the years, I’ve taken on-call rotations. It’s eye-opening to experience an incident for yourself versus hearing how your team experienced it. It changes your views, opinions and it definitely changes how you think about reliability. The residual impact an incident can cause is often far greater than the 10 or 20 minute downtime window. It can be quite damaging to team productivity and morale if not proactively managed.
You clearly understand the function of software engineering and specifically DevOps - what trends have you seen in that function that excite you?
First, reliability is no longer “nice to have”. It’s critical to the success of any digital business. Also, it’s no longer confined to the engineering team. Of course it depends on how you define reliability for your business but it always starts with how you manage incidents. Incidents are the lowest common denominator for every reliability program. Teams going through scale realize they need help and nobody wants to build something when good tooling is available.
Culturally, organizations have started to accept that incidents are inevitable and not something to avoid. The next natural step is to invest time, resources and tooling to drive change.
For management, the conversation tends to be around what are teams getting pulled away from and why. It becomes more about what you are not doing to expand the core and drive revenue. With incident data at your fingertips, you may learn that the slowest velocity dev teams are experiencing the most incidents. With insights such as this you can make informed decisions to reallocate resources or reprioritize projects. It's great to see management ask all the right questions which demonstrates that they have witnessed the impact when incidents don’t run smoothly.
The more visibility that functional leaders and management have, the more attention and investment, which improves morale and certainly customer satisfaction.
Many DevOps teams are now moving to SRE practices. Can you explain why that’s happening?
In my mind, DevOps was originally focused on deploying code to production and making sure basic availability was there. Honestly, it was hard enough to get code deployed especially in a cloud-native, infrastructure-as-software environment. The industry spent a lot of time streamlining and improving that. Now that the industry has matured, the focus has shifted from “get it running” to “make sure it’s running reliably.” This is where SRE comes in.
By adopting SRE, you build in reliability from the beginning. Once you’ve achieved critical mass and a regular release cycle, you’re likely using a deployment pipeline or a CI/CD platform. Now you need to think about how that pipeline helps to build a more reliable system.
Reliability crosses over beyond DevOps and into pure dev. Often when failures occur, it’s because you have increased traffic or perhaps the database you deployed fails because it wasn’t adequately configured. This is what an SRE practice cares about. Are you thinking about reliability each time you push fixes or new features? Are you being proactive vs reactive? SRE is driving this and we’ve naturally gotten to the point we are now at.
Let’s take an example from another industry. Car manufacturers can put defective cars in the market and later fix them at dealerships, or they can slow the assembly line and fix issues then. Why would they not elect to do the latter? That way you avoid customer complaints and service warranties. There are similar parallels to SRE.
Every dev team should be thinking about service ownership and reliability. They should be the ones taking on-call, since they are in the best position to react to a problem with their service. In less modern times, the NOC (Network Operations Center) would be on-call, but in a cloud-native world, software moves faster and requires that the team that builds the software to make sure it stays running.
What problems is Blameless focused on solving for engineering / DevOps teams today?
We are helping drive the industry forward by enabling teams to be proactive with reliability. The specific problems we solve are:
- Faster time to incident resolution through automated incident management workflows. Reduced toil with less manual work empowers engineers and frees-up time to focus.
- Improved learning through retrospective reports uplevels an organization and reduces repeat incidents.
- Drive reliability improvements through data-driven insights that give management critical information from MTTx to incident causes and potential team burn-out.
- Reduce customer churn as services become more reliable.
Which part of the Blameless product is the most valuable for individual engineers?
Users should get started by running better incident playbooks, learn how to communicate, coordinate, and conduct follow up actions with retrospective learning. This will bring down your MTTR (meantime to recovery). Incidents that previously took 20 minutes can be reduced to 10. That can add up over the course of a week, month or quarter.
Which part of the product is the most valuable for Managers and leaders?
Reliability Insights is the most valuable part of the product for anyone in a leadership position. This module allows for pattern matching and taking a close look at the data insights gleaned over time. Reports on key questions such as are there similar incidents occurring across different teams or can we detect similarities across different incidents to learn about causal effects? With rich data, you find out which team is facing the most incidents or further slice the data by incident type and severity. Knowing the frequency of on-call for specific team-members will help prevent burnout. Management can’t always be on the front-lines with on-call but data reports can paint a nuanced picture which brings the entire team together.
How does Blameless drive culture change with customers?
We help customers run in a professional, streamlined way. At Blameless we are on a mission to help the industry and that means helping software engineers operate smoothly, allowing them to focus on their craft.
When teams buy into a culture of reliability, it drives change. When companies invest in Blameless, it shows they care about their customers’ experience and the employee experience of their engineers. We’ve codified SRE practices into the Blameless product and it naturally drives behavioral changes. In the words of one customer: BetterCloud knows that customer happiness improvements are related to improved product reliability, and their new approach and culture mindset. Engineering teams no longer shy away from declaring incidents. Rather, “they enjoy interacting with the process”.*
*Click here to read the BetterCloud case study.