Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Incident Response Team | Roles & Responsibilities Defined

Myra Nizami
|
3.29.2024

When your organization faces outages, errors, security breaches, and other incidents, you need to have a plan in place to take appropriate actions as needed. However, you also need a capable team of experts filling critical roles and responsibilities to execute those actions and effectively collaborate to resolve issues quickly.

An incident response team, therefore should be developed in a way that avoids skills gaps in expertise. The right roles ensure the team functions efficiently to recover from outages and reduce downtime that costs your organization money. But who are those key players, and what roles and responsibilities are essential to that team?

Here we define the incident response team’s roles and responsibilities to ensure you can prepare a response plan, assess situations and deploy protective measures using a system-wide standardized response.

What is an incident response team?

An incident response team is a group of IT professionals that are responsible for preparing for, responding to, and handling any sort of system outage or downtime. The incident response team is also responsible for leading post-incident analysis and creating plans to avoid similar situations.

The goal of the incident response team is to create a centralized approach to incident response that handles recovering different business functions after an incident. Doing so ensures a comprehensive response to outages, errors, security breaches, and other incidents with appropriate actions as needed. 

The incident response team is responsible for collecting and analyzing information relating to incidents with the product and creating a plan of action on how to respond to it. In addition, the team will discuss the incident, share important information and communication, and other activities depending on the nature of the incident and how serious it is. 

During an active incident, the response team comes together to decide how to fix the issue. They will also determine what needs to be communicated to internal stakeholders and customers. The response team model can also include meetings at regular intervals to discuss developments, progress, and any actions needed. 

The incident response team model

Yourteam model should include members across different functions and business areas for comprehensive coverage, including:

  • Security and threat monitoring members
  • Management
  • Legal teams
  • Audit and risk management
  • PR and marketing
  • Development teams
  • Operation teams

Having members from each business area ensures that incidents have the coverage of required information needed. This will lead to efficient responses and minimize damage. 

Most teams are formed with existing employees who have the necessary expertise and experience. However, if the team identifies the need for other individuals with specific types of expertise, teams can bring in new hires. 

Teams need to consider hiring decisions in the context of their history with incidents. Some considerations include how many incidents are occurring, how severe they are, how they are being handled, and what kind of coverage is needed to ensure as little stress as possible even when an incident occurs. 

How should I structure my incident response team?

Companies handle incident response differently depending on the team and resources available. Remember, the point of the team is to streamline the process, avoid duplicate work and ensure you don’t overlook important tasks. By designating clear roles and responsibilities, everyone knows what their job entails yet works collaboratively to find quick resolutions. Leadership provides oversight, experts contribute to the troubleshooting process, and a single point of contact manages communication, so everyone is up to date on the status of the response.

Although diversity is key, the most critical roles would include:

  • Team leader/Incident Manager: The team leader’s primary responsibility is to bring together and coordinate incident response to ensure the focus remains on solving the problem at hand. The incident manager holds the authority to coordinate and direct team members to execute the incident response effort. They also have the power to designate responsibilities and choose specific skilled members to perform ad hoc roles to respond to unexpected issues. When necessary, they perform tasks when something requires attention and no one else is available.
  • Investigative lead: Responsible for evidence collection analysis and directing the response, they also identify the root cause and implement changes to avoid future issues. Investigative leads often act as the root cause analyst coordinating the postmortem and logging and tracking remediation tickets.
  • Communications specialist: The specialist keeps internal stakeholders and teams updated on progress throughout incident response. They also manage public communications as required, such as writing and sending external communications about the incident and updates for the status page. When applicable, they might manage expectations by collecting customer responses while keeping high-level stakeholders informed.
  • Analysts: Documenting and analyzing team activities, monitoring the networks, creating timelines, and doing an initial analysis of the evidence and threats is critical during a response. Also known as a tech lead, they develop theories about what has happened and why, to provide input to the technical team, working closely with the team leader/incident manager. Their theories are usually documented, as are any actions taken as contributions to the incident postmortem. They can also be involved in consulting with other responders and recommending appropriate subject matter experts to become involved in the response.

The bottom line is that the more technically diverse your response team is, the better able they are to handle a broader range of situations and quickly identify threats. A diverse team also contributes to innovative problem-solving, which minimizes damage while reducing the risks of attacks.

What skills should an incident response team have?

Having technological skills and capabilities to investigate incidents is, of course, the most crucial skill for response team members. Although not all team members will have this, anyone directly investigating the incident should know how to understand what’s going on and spot anomalies and issues. That knowledge should include relevant tools and architectures, knowledge of your organization’s codebase, and malicious code analysis. Intrusion detection and vulnerability management are also crucial in this context.

Besides technical expertise, there are other skills needed for response teams to be successful, such as investigative and analytical ability. Any incidents occurring need to be investigated and analyzed thoroughly to understand why they occurred, who or what system was impacted, and which team members are needed.

After the incident, actions need to be analyzed through incident retrospectives, also known as postmortems, and other tools to understand how to improve moving forward. Alongside investigative skills, understanding and analyzing necessary computer forensics evidence is incredibly important too. 

Another key element is communication skills.  During and after a response, there are many key players that need to understand progress and steps being taken. Being able to determine what information to share and effectively communicate to internal leadership, stakeholders, and customers is imperative. 

What are the typical processes for an incident response team?

Daily tasks will vary depending on whether there is an active incident or not. Along with security tools, incident response teams are there to monitor and detect security breaches. 

They’ll need to look at anomalies across different areas such as traffic, account access, excessive usage of resources, and any suspicious requests that might come through. If there is any deviation from standard patterns, incident response teams can raise the alarm to bring in other team members as needed. 

When threats are detected, a centralized approach helps keep everything streamlined. Teams will create incident timelines and begin investigating the anomalies they’ve detected. Teams can set up automation to create preliminary responses to anomalies until the incident response team can solve the issue. 

After the incident occurs, the team will have post-incident measures. This will include isolating issues and problems faced with the incident response plan and tracking metrics that are relevant. Some of the metrics that your team can use to measure themselves after an incident occurs are: 

  • Mean time to detect (MTTD): This measures how long it takes to detect and whether it’s internal identification (i.e., a team member flagging an issue) or external identification such as users and administrators. 
  • Detection accuracy/false-positive rates:  This rate shows teams what percentage of alerts are valid threats versus false alarms. Too many false alarms can lead to efficiency issues and distract teams from real incidents, so it’s important to keep this rate down.
  • Mean Time to Respond/Repair (MTTR): Once an incident is identified, how long does it take to respond and repair the issue? This metric is used to understand the impact of the incident and how long it takes to come up with a solution and implement it. It provides insight into how well the response is going and can help with finding opportunities for improvements and automation that could help. 

SRE teams play an integral role in both incident management and incident response. SREs are responsible for designing and activating response protocols when a threat is detected to handle the situation. SREs can also implement automation and run retrospectives (postmortems) after the incident is dealt with to understand how to improve moving forward.

Tips for Incident Response Team Members

As the saying goes, the best-laid plans of mice and men often go astray. So, while you’ve got your team and plan in place, the first alarm can set things in motion in a far different manner than what was imagined or planned for. As a result, you can use these tips to keep you focused and agile when an event occurs.

Merge Human Experience with Chosen Tools

Although your team painstakingly sought out technology to detect incidents, they can’t just sit back and wait. Instead, they should actively be watching for suspicious activity and investigate before an event even occurs, such as:

  • Traffic anomalies: Anomalies indicate issues related to connectivity, reconnaissance, or credential abuse, including:some text
    • Increases or decreases in traffic
    • Traffic from inconsistent addresses
    • Unexpected traffic
  • Suspicious access: In this case, bad actors could be trying to access restricted files or system areas. Although this seems obvious, it’s important to look for specific access attempts. A good example would be a superuser who rarely or never tries to access components despite having permission to do so and is suddenly very active.
  • Excessive consumption: This could mean the system is undergoing crypto mining or other abuses of your resources or show possible data exfiltration or malware infections. Examples of this might include:some text
    • Sudden drops in performance
    • Increases in resource demand
    • Large exports of data

Set Acceptable Behavior Metrics

The best way to assess issues when investigating suspicious behavior is to set baselines for what you consider “acceptable.” This provides something to measure against, so you know when action is required. A useful tool is a user and entity behavioral analytics (UEBA) solution which can create your baselines so you can watch for behavior deviations. You can also feed your own baselines into the system when you’ve identified a new line of acceptable behavior, to constantly improve your evaluation abilities.

Centralize Monitoring and Logging Information  

A centralized monitoring and logging database provides ongoing information, putting events into context during the evaluation process. Finding an effective system information and event management (SIEM) solution can make life easier by collecting data from all systems and providing it to you in a central location.

Create a Response “Toolbox”

Centralized maintenance tools allow your team to become more responsive and efficient. This “toolbox” empowers teams to manage configurations and maintain systems, swiftly moving from task to task. One way of achieving this is using a cyber security orchestration, automation, and response (SOAR) solution to collect information, automatically deploy protective measures, and provide standardized responses across all systems.

Act Using Data Based Decisions

Although gut reactions can be effective, they can cause players to jump to conclusions. Your team needs to determine the seriousness of alerts to guide their decisions accurately. This calls for data as an investigative tool that allows them to quickly assess alerts before dismissing an event out of hand. This type of oversight is costly and paves the way for more serious attacks. Using data as part of the investigation process ensures you either confirm your suspicions or discover something far more sinister is at play. This reduces the risk of burdening workloads unnecessarily and avoiding down time that puts systems at risk.

How can Blameless help? 

A robust incident response protocol needs strong tools to implement the plan developed. Using the right tools helps provide additional insight and data required to manage incident response effectively. 

Blameless helps teams streamline incident management, ranging from incident detection, role assignments, runbook checklists, retrospectives, reliability insights, and SLO management.

To learn more about how Blameless can benefit incident response teams, schedule a demo today or subscribe to our newsletter below. 

Resources
Book a blameless demo
To view the calendar in full page view, click here.