Navigate Incident Management Like a Pro: MyFitnessPal's Sr. Director of Engineering Shares Insider Strategies with Lee Atchison
How much time are engineering teams spending on incidents?
Are you trying to set your engineering team free to do their best work? Read our new case study to learn how Blameless can help you do that.

Parenting Incident Retrospectives with Morgan Kelly

|
9.3.2020

We’re all familiar with workplace incidents, but have you considered all the incidents that happen at home? Parents also deal with incidents many times a day, ranging from Sev 3 all the way to dreaded Sev 0’s. Below are three example incident retrospectives of incidents occurring on a single day in the life of Blameless Account Executive Morgan Kelly, who also happens to be dad to Carmen and Gianna. Who knew that parents were SREs, too?

Incident 031920.1 Sev 2 “Morning Spills”

An incident pertaining to the Frigidaire 14.8 cu. ft. refrigerator, the garage sink, and a cup of coffee resulted in a temporary household outage from 7:15 AM to 7:22 AM PST. Estimated cost: 50% morning error budget, 2 rolls of paper towels.

Timeline

  • At approximately 7:15 AM PST on-call engineer Morgan Kelly discovered that the top shelf of the cold goods storage unit had been affected by a cup of spilled coffee.
  • 7:16 AM PST Morgan realized that the spillage affected more than the top shelf of the unit, reaching all the way down to the crisper bin. Morgan called in the first shift lead Ina to help remediate.
  • 7:20 AM PST Initial spillage was remediated, but a secondary spillage of milk was discovered on counter top due to excessive dancing by daughters Carmen and Gianna.
  • 7:22 AM PST secondary spillage remediated.

SLO and SLA standing

  • SLO intact with 50% of error budget remaining
  • SLA with elementary school intact with 23 minutes before breach

Root cause analysis

  1. Milk spilled on the counter due to dancing. Girls were dancing because pop music was playing. Pop music was playing because girls were in a silly mood. Girls were in a silly mood because they were tired from staying up late last night. Bedtime was delayed by 1.5 hours last night. Bedtime was delayed by beginning a movie at 8 PM PST.

Action items

After our root cause analysis, it has been determined that the team experienced an outage due to a spillage caused by delayed bedtime. Moving forward, no movies will begin after 7 PM PST. Morgan and Ina will also be taking the additional precaution of queuing a more relaxed playlist in order to avoid further breakfast dance parties.

Incident 031920.2 Sev 1 “Common Cold”

At 12:34 PM PST Morgan was notified by school that daughter Carmen was ill. The incident would potentially last days, however was not prone to an SLA breach. Morgan assigned Ina as communications lead for the incident, who contacted the on-call, grandma Oma, to arrange pickup from the on-prem outage location.

Timeline

  • 12:34 PM PST Morgan was notified about an outage at elementary school. Daughter Carmen was experiencing cold-like symptoms.
  • 12:35 PM PST Morgan began triaging the incident. The phone call confirmed that Carmen was non-functioning and needed to be taken home.
  • 12:38 PM PST Morgan added Ina to incident Slack channel and assigned her the role of communications lead. Morgan filled Ina in on Carmen’s status, and proposed looping in  on-call engineer Oma for backup.
  • Key communication from Slack: Morgan “Ina, Carmen is experiencing a potential Sev 1 incident. Can you focus on mitigation rather than project work for the time being?”
  • Ina “Looping in Oma who has previous experience with similar outages, including the winter flu outage of 2018 which she was incident commander for.”
  • 12:40 PM PST Ina adds Oma to the incident Slack channel and assigns her the task of picking Carmen up from elementary school.
  • 1:02 PM PST Oma checks off task “pick Carmen up from school.” Grandma closes the incident, but suggests that Morgan and Ina attend to Carmen as soon as possible.

SLA and SLO standing

Not applicable for this outage, no affected customers. This is an internal outage only.

Root cause analysis

  1. Carmen feeling ill at school. Carmen had a sleepover last weekend with a friend who was sniffling. Carmen likely contracted a cold from her friend.

Action items

To lower the chances of Carmen getting sick again, it’s advised that Morgan and Ina take additional precautions when planning sleepovers such as watching for symptoms of illness in other attendees.

Incident 031920.3 Sev 0 “Sudden Tumble”

At 5:19 PM PST Gianna had a critical outage lasting 12 minutes while on break at her gymnastics class, due to a failed tumble. Morgan and Ina were the responding engineers. Estimated cost: 150% of breaktime error budget, five tissues, a granola bar.

Timeline

  • 5:19 PM PST Gianna fell when attempting a tumble at gymnastics. This resulted in a scraped knee. Gianna began to cry.
  • 5:20 PM PST Gymnastics coach was the first responder to this incident, however she is a junior engineer and was unable to remediate. She called senior engineers Morgan and Ina to assist and reassigned Morgan as incident commander. Ina is assigned as communications lead. Coach called a ten-minute break, establishing the SLA for this rolling window.
  • 5:21 PM PST Morgan and Ina take Gianna offline for 3 minutes. Outside the gym, they investigated the cause of the outage.
  • 5:25 PM PST Morgan and Ina determine that the incident can be mitigated with a hug and a granola bar.
  • 5:27 PM PST The hug and granola bar fix was deployed. Gianna reloading.
  • 5:31 PM PST Gianna returned to gymnastics practices and all services returned to a normal state.

SLO and SLA standing.

  • SLO breached when the 8-minute error budget was exceeded by 4 minutes.
  • SLA was breached when outage surpassed the 10-minute mark.

Root cause analysis

  1. Gianna faced a critical outage when she missed a tumble at gymnastics, hurting her knee. Gianna was less focused than normal and missed her tumble. Gianna was less focused because she had a difficult day with homework and forgot to eat her after-school snack. Gianna had a difficult time with homework because she doesn’t like history. Gianna finds history boring, so she doesn’t pay as much attention in class as she should.

Action items

Gianna’s outage was caused by multiple factors. The first factor was lack of food. In the future, Morgan and Ina will conduct preventative maintenance by providing a granola bar on the car ride to gymnastics. Additionally, frustration with homework was another contributing factor of this incident. Morgan and Ina will work on an action plan to interest Gianna in history, making her homework less difficult and more fun. Possible solutions could be children’s books, shows, and movies pertaining to her current subject matter.Automation and playbooks will also be employed to mitigate risk of future errors. Gianna will practice her tumbling for 10 minutes each night, even when she does not have gymnastics class, in order to make this movement an automatic, toil-free action, possibly mitigating future tumbling-related incidents.Morgan and Ina had one busy day! As Morgan says, “Parenting, there are different severity incidents every day. And it's funny because working at Blameless, I think about the world in the context of SRE. How can we resolve incidents better?” By using incident retrospective best practices, Morgan and Ina work together to create a blameless and happy home for Carmen and Gianna and continuously improve at the craft of parenting.

Resources
Book a blameless demo
To view the calendar in full page view, click here.