Want to up-level your reliability program? Let's start by identifying your opportunities for growth.
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

This is your Guide for Implementing SRE in NOCs

Emily Arnott
|
10.1.2020

Network Operation Centers, or NOCs, serve as hubs for monitoring and incident response. A NOC is usually a physical location in an organization. NOC operators sit at a central desk with screens showing current service data. But, the functionality of a NOC can be distributed. Some organizations build virtual NOCs. These can be staffed fully remotely. This allows for distributed teams and follow-the-sun rotations. NOC as a service is another structure gaining in popularity. This is where the NOC is outsourced to a third party who offers it as a service similar to other infrastructure tools.

As IT services become more fragmented, shifting to virtual NOCs becomes more popular. These structures are far removed from the traditional big desk model, but their functions are the same. Any system where operators are able to monitor for incidents and respond to them can serve as a NOC.

The goals of NOC operators and SREs are aligned. Both try to improve the reliability of the system. In fact, SRE best practices applied to the NOC structure can take reliability to a new level. In this blog post, we’ll look at how SRE can improve NOC functions such as system monitoring, triage and escalation, incident response procedure, and ticketing.

Monitor smarter by focusing on complex metrics

The traditional image of a NOC is a huge grid of monitors showing every detail of the service’s data. A team of operators watch like hawks, catching any warning signs of incidents and responding. This system has several advantages. The completeness of the data displayed ensures nothing is missed. Also, having eyes on glass at all times promotes timely responses.

The SRE perspective on monitoring is different. The system monitors and alerts on metrics that have customer impact. These metrics are Service Level Indicators, or SLIs. Instead of human observers, monitoring tools send alerts when these metrics hit thresholds. After iteration, these systems can be more reliable than a human observer. Yet, this doesn't mean incidents won't slip through the cracks. SRE teaches us that failure in any system is inevitable. Especially for organizations with multiple operating models, a mix of legacy and modern technologies, and the need to ensure governance and control, human observers in a NOC as another layer of monitoring may continue to be deeply essential.

To achieve the best of both worlds of your NOC and SRE practices, you’ll need to understand what response each of your metrics requires. For simple metrics that you can pull directly from system data, automated responses can save toil for your NOC operators. More nuanced metrics where an expert’s judgment may be necessary can be discussed in the NOC. This allows operators to focus on where their expertise is necessary. Monitoring tools handle the rest.

Escalate and triage with classification and on-call

When a NOC operator notices an incident, their typical mode of operation is to first triage and try to remediate the issue via runbooks and existing documentation. They determine the severity and service area of the incident. Based on this, they escalate and engage the correct people for the incident response. In a traditional NOC structure, there’s a dedicated on-call team for incident response.

In the SRE world, things become less siloed. Incident classification applies across the organization. The developers most closely involved with each service area are also responsible for on-call shifts, rather than laying that responsibility squarely on dedicated on-call teams. NOC operators can collaborate with engineers on developing fair and effective on-call schedules. Yet NOC procedures for alerting don’t need to change. All of the infrastructure set up to alert and escalate will still apply. SRE only increases the range and effectiveness of these alerts by involving more experts. As service complexity grows, ensuring that a wide variety of experts can respond to incidents is essential.

Get the most from incidents with meaningful response

Many steps in a NOC’s incident response procedure overlap with SRE best practices. Both use runbooks (and ideally some automation) to accelerate responses with set checks and steps. Both log incident data and track incident patterns. Both observe monitoring data and triage as soon as problems emerge. Implementing SRE best practices into the NOC structure isn’t about changing procedures as much as changing perspective.

SRE teaches us to view incidents as unplanned investments in reliability. They aren’t signs of failure, but something to celebrate as an opportunity to learn and grow. In the NOC structure, incidents may end up siloed within the NOC team. SRE aims to break these silos, by having lessons shared between NOC operators and development. NOC operators’ experiences dealing with incidents can inform future development.

SRE encourages recording every aspect of an incident into a comprehensive incident retrospective (also known as a postmortem, post-incident response, etc.). Often, these documents are created after the fact. A lot of effort is spent tracking down the relevant data and communication. Instead, the SRE perspective suggests that the NOC operator should kick off this document as soon as the incident is detected. All alerted responders would then contribute their insights, communication, and relevant information as the incident response proceeds. 

After the incident, the document would be shared across the organization, with follow-up action items that are reported out on and review meetings focused on learning from the response. By sharing NOC perspectives, engineers can better account for reliability concerns within future development sprints. And, by hearing the perspectives of on-call engineers, NOC operators can improve their incident response procedures.

Manage ticketing from a customer-focused perspective

A common function of NOCs is the management of ticketing systems. NOC operators may serve as the stewards of ticketing systems such as ServiceNow, BMC Remedy, JIRA, or other tools. They examine the monitoring data of the service in the context of the progress and effects of these tickets. By understanding where development is on a given project or fix, they can adjust their expectations for what data they’ll observe from that service area. NOC operators can also create new tickets addressing issues they observe. SREs do not typically share these responsibilities, but SRE best practices can still enhance these processes.

One of the major functions of a ticketing system is helping engineers prioritize development goals. Issues are assessed on severity, scope, and time required to complete. When NOC operators create a ticket relating to service reliability, they need to contextualize it along these metrics. But how do you prioritize a concern about the availability of a service, a report of a bug in the codebase, and a proposed new feature? It comes back to the quintessential question: how do you balance speed, scope, and quality?

SRE suggests working backwards from the customer’s perspective. Think about what your customers care most about and how each ticket impacts that. To measure this impact, use best practices like SLOs to set a threshold of when customer experience is at risk of reaching unacceptable levels. Estimated timelines to completion can be evaluated based on the SLO. By orienting ticketing systems around this perspective, NOCs can ensure that development goals align with customer success.

Blameless can help NOC teams implement SRE best practices with tools for SLOs, incident retrospectives, and more. To see how it all works, schedule a demo.

If you liked this blog, you may also want to check out the following:

Resources
Book a blameless demo
To view the calendar in full page view, click here.