In the opening moments of an engineering incident, the most important aspect of a response plan is speed. Getting out of the gate quickly by leveraging automation to assemble the team can save precious moments during a critical engineering incident and make the difference between happy and unhappy customers downstream. This is why we’re excited to announce the integration of Blameless with OpsGenie. Integrating with Opsgenie allows Blameless users to quickly and intelligently assemble the right team-members at the outset of an incident by accessing data from Opsgenie’s service catalog.
Check out a demo of the integation in action!
Automation in the early moments of your response process
The opening moments of an incident are often the most chaotic and unstructured of the entire incident response process. Critical questions like “Who owns this”, “Who needs to be notified”, “How are we bringing the response team together” all need to be answered before the work of diagnosing the problem and restoring service can begin. Products like OpsGenie help to resolve some of this ambiguity through the development of a service catalog and escalation protocols.
The OpsGenie service catalog allows ownership over different microservices to be easily defined and recorded. This simple step of documenting ownership becomes incredibly useful when something breaks. It provides ops personnel or on-call engineers with a roadmap for “Who” might be able to help restore service.
This serves as the backbone of an organization’s escalation protocol. OpsGenie also provides a flexible interface for defining an escalation protocol by team or by service. This can determine who receives the first page, based on the on-call calendar. It can also determine the conditions under which leadership is notified.
By tapping into these two aspects of OpsGenie, Blameless allows users to harness the power of service catalog and escalation protocols to automate and accelerate the assembly of an incident response team. When an incident is acknowledged, OpsGenie automatically alerts the right service owners and Blameless recruits the full roster of relevant people to the incident channel. This is the best approach to get triage underway quickly.
OpsGenie vs. Blameless - What’s the Difference?
Automating the assembly of the incident response team is just one step alone what we call The Golden Thread of Incident Response. This Golden Thread is a workflow that begins when an error is detected and continues downstream through alerting, assembly, incident resolution, retrospective and even follow up remediation.
OpsGenie has gained great popularity as a tool for alerting and on-call management in particular.
[OpsGenie] Alerting: Opsgenie ensures their customers will never miss a critical alert. By combining the service catalog and escalation protocol capabilities discussed earlier, with integrations into monitoring, ticketing, and incident response tools like Blameless, Opsgenie helps users group alerts, filter out noise, and provide the necessary information teams need to begin resolution.
[OpsGenie] On-call Management: OpsGenie also provides industry leading on-call management tools. Users can build and modify schedules and define escalation rules within one interface. Ensuring your team will always know who is on-call and accountable during incidents and have the confidence that critical alerts will always be acknowledged.
[Blameless] Incident response: Once an alert is acknowledged, Blameless takes over. The chatbot interface guides members of the incident response team through a codified playbook. From identification through diagnosis and remediation. All while keeping stakeholders informed with flexible, automated communications.
[Blameless] Incident learning: Once the incident has been resolved, Blameless gives engineering teams the tools to analyze the incident itself. Root cause, team performance, impacted downstream services, on-call load, Blameless allows leaders to develop actionable insights related to both their team and their infrastructure.
How does the integration of Blameless and OpsGenie amplify the value of each?
“The speed with which ops teams respond to an initial alert has a significant impact on the overall time to resolve the incident,” stated Paul Nashawaty, Principal Analyst for Application Modernization, Enterprise Strategy Group. “Being able to move quickly from acknowledging an alert to working on a resolution is made more difficult when services ownership is unclear or the assembly process is manual. Blameless integrating with OpsGenie can eliminate lag time between acknowledging and addressing the incident by automating team assembly based on the users escalation protocol,” added Nashawaty.
Blameless integrating with OpsGenie can eliminate lag time between acknowledging and addressing the incident by automating team assembly based on the users escalation protocol
Paul Nashaway, Principal Analyst for Application Modernization, Enterprise Strategy Group
The integration of Blameless and OpsGenie can help organizations in a number of ways beyond simply speeding up assembly of the incident response team.
Playbook for your on-call rotation
Every organization needs both a plan for “who” and “How” that team will respond in crisis. Without understanding the “who”, your on-call team wastes time searching for domain experts to help restore service. Without a defined “How”, the incident team is stuck creating a response process as they go. In each scenario, precious time is lost, increasing the likelihood that customers are impacted or revenue is lost. Combining Blameless and OpsGenie allows users to maximize the benefits of both by assembling the team fast, then guiding them through a clearly defined, repeatable, response process.
Centralized data across the workflow
The engineering tool ecosystem is expanding rapidly presenting some significant challenges to data analysis. Data on alerts, impacted service and on-call rotation is most valuable when it can be paired with incident performance data. Doing so requires that data from multiple tools be collected and organized in one place. The unified data model that Blameless is built on makes the capture and ingestion of data from monitoring and alerting tools quick and easy. Which in turn makes it possible to complete a wealth of incident analytics within Blameless using Reliability Insights.
All triggered alerts can now be tracked as an event in the Blameless incident timeline, with the name of the user who started the alert, the name and link to the service, and the team which has been notified. This is ultimately included in the retrospective report and also downstream analytics so you know exactly what happened, when. This is really the downstream benefit of centralized data. Beyond just the data however, Blameless also provides a structure for retrospectives that encourages the development of a healthy learning culture. Custom questions can help teams explore certain aspects of the incident more deeply. Ultimately, all the learnings developed can be passed to the engineering project management tools also through integration
The importance of connecting your incident response tools
Things break, it’s inevitable. Whether that happens during the regular course of business or during your release process, the responsibility of the operations team is the same. Restore service with minimal impact on the business and on customers. By connecting all the tools used to support operations, your team can more effectively respond to incidents. Integrating your incident response tooling creates a centralized location for the collection and analysis of data, enabling faster decision making and reducing the potential for errors, while improving collaboration and communication among incident response teams.
"I have less anxiety being on-call now. It’s great knowing comms, tasks, etc. are pre-configured in Blameless. Just the fact that I know there’s an automated process, roles are clear, I just need to follow the instructions and I’m covered. That’s very helpful."
"I love the Blameless product name. When you have an incident, "Blameless" serves as a great reminder to not blame anything or anyone (not even yourself) and just focus on the incident resolving itself."