Skip to content
rdsoze
Go back

The Incident Review Marathon: 89 RCAs in Six Hours

What?

Whenever I join a new place, I like to do an Engineering audit. Leadership loves it because it provides a unbiased reality check and for me, it tells what I need to focus in the first 90 days.

Reliability stood out as one of the important ones to focus on. There were around 2 Sev0/1 production incidents every week. My philosophy was simple to begin with: It’s alright to fail the first time because of something new, BUT it would be an unforgivable sin to fail the second time because of the same cause.

There was a decent RCA process in place previously, but was abandoned in 2025. Good RCAs generate the right action items that prevent an incident from occurring again. So I was determined to go down the path of reviewing these past incidents. There were 89 unreviewed incidents in the last 6 months.

Part 1 - Creating the plan

Before I ran a single agent, I spent time designing the plan. The quality of the output would be entirely determined by the quality of the instructions going in.

In my head it was Find or Create the RCAs -> Pattern analysis across all RCAs -> Create a Reliability Plan

The first decision was where to look for data. Three systems held relevant information. Jira was the primary source: every incident had a ticket, and some engineers had written detailed notes in the comments summarizing what happened and how it was resolved. I specified key fields including custom ones to pull data from.

Slack was the second source, but inconsistent - very few incidents had a dedicated incident channel. I knew which channels were Engineering related and which were customer-facing, so I built that context into the plan rather than leaving agents to figure it out. Notion was the third source: the team had a COE (Correction of Error) hub with very few existing RCA pages, though the coverage was thin. Of the 43 SEV-0/1 incidents, only 3 had formal COEs part of the COE hub.

The second decision was what the RCA output should look like. The team had an existing template that was already well structured, so I just used that.

The instructions for the pattern analysis were straightforward: once all RCAs were done, run a synthesis across root cause categories, client concentration, detection source, version patterns, and repeat incident clusters. The goal was to find the signal in 89 separate incident reports that no individual report could show.

The instructions for the reliability plan were also clear: take what the pattern analysis found and turn it into four concrete workstreams -

What I did not specify was the intermediate step. The LLM added a “produce output” phase between pattern analysis and reliability plan: a prioritized systemic issues list, ranked by frequency times severity times client impact, with proposed fix, owner, and estimated effort per issue. I did not ask for this. With it, every recommendation in the plan traces back to a specific cluster of incidents with a concrete impact. The reliability plan became defensible rather than just directional.

Part 2 - Iteratively improving the prompt and RCA template

I ran this in multiple phases. SEV-0/1 incidents first - the 43 highest severity tickets where getting the RCA quality right mattered most. SEV-2/3 came next, 46 incidents, with a lighter output format since the goal there was pattern coverage rather than deep investigation. The not-an-incident tickets were last, 13 tickets with a different objective entirely.

The iterations happened primarily in Phase 1. By the time I moved to Phase 2, most of the template and prompt problems were already solved.

Each agent ran as a background subagent via Claude Code’s Agent tool. One agent, one incident, fully independent. No shared state between agents in a batch. Each one pulled its own data from Jira, Slack, and Notion, wrote the RCA file, and exited.

Batch sizes were 5 for Phase 1. My role during a batch was nothing - kick it off and wait. The active work happened between batches: read a sample of outputs, diagnose what broke, fix the template or prompt, start the next batch.

The first batch came back usable but inconsistent. Not wrong RCAs - just structurally all over the place. Some agents returned the 5 Whys as a prose paragraph. Others invented a three-column table. Root cause labels were freeform: one agent wrote “database connection exhaustion,” another wrote “infra failure,” a third wrote “pool sizing issue” - all describing the same class of problem. Cross-incident pattern analysis on that output would have been meaningless.

The work between batches was diagnosing what kind of problem each failure was. The 5 Whys problem was template ambiguity - the format was not specified tightly enough. I rewrote it as a strict 2-column table with a concrete example in the template. The root cause label problem was prompt ambiguity - agents had too much freedom. I added a fixed taxonomy: vendor-outage, change-failure, infrastructure-spof, configuration-error, dependency-failure, data-issue, unknown. Pick one. Same for detection source

Early batches returned action items without any meaningful prioritization, or with inconsistent priority criteria. I added explicit definitions: P0 means fix within 2 weeks and prevents recurrence of a SEV-0/1. P1 means fix within 4 weeks. P2 is observability and tech debt. P3 is documentation and backlog. Once those definitions were in the prompt, the prioritization became consistent enough to trust.

By the time I moved to the SEV-2/3 run, the template and prompt were in good shape. The second phase also had a different output format - not full RCAs but lighter summary rows, since the goal for lower-severity incidents was pattern coverage rather than deep investigation. The not-an-incident investigation was different again: agents were not asked to write an RCA at all. The task was narrower - trace who applied the NAI label, when, and whether the evidence justified it. Same infrastructure, completely different framing.

Incident Review Plan
# Incident Review Plan - SEV-0 and SEV-1

## Overview

Systematic review of SEV-0/SEV-1 incidents: triage, write RCAs, identify patterns, and produce a prioritized list of systemic issues to fix.

**Jira base filter (adjust date range per run):**
project = INCIDENT AND cf[10385] in ("SEV-0", "SEV-1") AND created >= "<START_DATE>" AND (labels is EMPTY OR labels not in ("not-an-incident")) ORDER BY created DESC

## Step 1: Triage and write RCAs

Steps 1 (triage) and 2 (write) are combined - the agent researches each incident, triages it (checks for existing RCAs, classifies), and drafts the RCA in a single pass.

### Buckets

| Bucket | Action |
|--------|--------|
| Needs RCA | Real incident, no RCA documented. Draft RCA, then route to service owner. |
| Has RCA | RCA exists (in Notion, Google Docs, or Jira comments). Extract and standardize into template. |
| Cekura detection - merge | Cekura autocut that correlated with a real production incident. Merge with the parent incident's RCA. Downgrade severity. |
| Cekura detection - downgrade | Cekura autocut with no correlated production impact. Transient test-only failure. Downgrade to SEV-2 or label not-an-incident. |
| Vendor issue | Root cause is external vendor. Document vendor, impact, and whether fallback exists. |

### Where RCAs live
- Local RCA files: incident-review-data/rcas/INCIDENT-XXXX.md
- Notion COE hub: https://www.notion.so/skit-ai/COE-Correction-of-Error-Docs-b7afa4ec809c4ea1b184d25bcf946070
- COE DB data source: b289402a-e8bd-4d27-be7e-37b15494dff3
- COE Template (Notion original): https://www.notion.so/59d83d2322b84ec8aa7712b86e31e6b2
- RCA Template (local, canonical): incident-review-data/rca-template.md
- Exemplar RCAs (for agent prompt): rcas/INCIDENT-1568.md (standard resolved incident), rcas/INCIDENT-1590.md (vendor/infra SPOF, simpler), rcas/INCIDENT-1582.md (Cekura autocut - merge), rcas/INCIDENT-1595.md (Cekura autocut - standalone)

### Jira fields to pull per incident

Standard fields: summary, description, status, assignee, reporter, labels, comments

Custom fields:

| Field | Custom Field ID |
|-------|----------------|
| Severity | customfield_10385 |
| Component Impacted | customfield_10386 |
| Region Impacted | customfield_10387 |
| Client Impacted | customfield_10388 |
| Source | customfield_10389 |
| Project Impacted | customfield_10393 |
| Software Version Impacted | customfield_11221 |

### Context sources for drafting each RCA
1. Jira - issue details, description, all fields above, and comments. Also check comments for links to Notion RCAs, GitLab MRs, or COE tickets (e.g., COE-40).
2. Slack - search for incident key (e.g., "INCIDENT-1586") to find discussion threads. Also search related terms (client name, error description) if the incident key search returns limited results.
3. Notion - check for existing RCA pages. Note: some RCAs live outside the COE hub as standalone pages (not registered in the COE DB). Check Jira comments for direct Notion links.

### RCA format (local template: incident-review-data/rca-template.md)
Each RCA follows the local template (derived from the Notion COE Template with additions):
1. Incident Metadata - Jira fields (including Reporter, Labels), links, related incidents, fix MRs, vendor ticket IDs
2. Incident Metrics - Duration, Time to Detect (TTD), Time to Resolve (TTR). For unresolved incidents: use "NOT RESOLVED" for TTR and note duration as "X+ days and counting"
3. Classification - Root Cause Category and Detection Source (structured fields for pattern analysis)
4. Triage Classification - (Cekura autocut incidents only) Bucket, correlation, recommendation
5. Summary - Brief summary of incident, duration, and impact
6. Impact - Quantitative (calls affected, duration) and qualitative (user experience)
7. Timeline - Key events during the incident
8. How did we notice it? - Detection source and time to detect
9. Root Cause - Background and the actual issue
10. 5 Whys Analysis - Problem statement broken down in 2-column table format (| | |)
11. How did we mitigate it? - Temporary and permanent fix with timelines. For unresolved: describe workarounds and current status.
12. Learnings - Could detection, root cause finding, or resolution time be improved?
13. Action Items - Bullet list with - [ ] checkboxes, bold priority tag, and Owner. Not a table. Format: - [ ] **P0** <action> - Owner: <name>

Root Cause Categories: vendor-outage, change-failure, infrastructure-spof, configuration-error, dependency-failure, data-issue, unknown

Detection Sources: Cekura-autocut, Opsgenie-alert, CS-Team, Delivery-Team, Tech-Team, Client-reported, Campaign-Ops

### RCA drafting tips
- For Cekura autocut incidents, always check incidents created around the same time - they often correlate with a real production incident.
- For Cekura incidents, include a Triage Classification section: bucket, correlated with production impact (yes/no), related incidents, and recommendation (keep/downgrade/merge).
- Jira comments are often the richest source of root cause detail - engineers post debugging findings, MR links, and resolution notes there.
- Where information is missing, write [NEEDS INPUT FROM SERVICE OWNER] so the reviewer knows what to fill in.
- For related incidents, mention the relationship in the agent prompt so it can focus on what's unique to this incident rather than re-investigating the same Slack threads.
- 5 Whys must use the 2-column table format from the template. Never use numbered lists or 3-column tables.
- Action Items must use - [ ] **P0** <action> - Owner: <name> bullet format with a bold priority tag (P0/P1/P2/P3). Never use tables for action items.
- Priority guidelines for action items:
  - P0 (fix within 2 weeks): Prevents recurrence of a SEV-0/1 incident, or fixes a systemic issue affecting multiple clients
  - P1 (fix within 4 weeks): Prevents recurrence of a SEV-2/3 incident, or adds critical monitoring/alerting
  - P2 (fix within 6 weeks): Improves observability, adds non-critical monitoring, addresses tech debt
  - P3 (backlog): Documentation, runbooks, process improvements, vendor follow-ups
- For vendor-outage incidents, include Vendor Ticket IDs in metadata.
- For unresolved incidents, use "NOT RESOLVED" for TTR, note current status in metadata, and describe partial workarounds in mitigation section.

### Agent prompt template

When spawning a background agent to draft an RCA, use this prompt structure:

You are drafting an RCA (Root Cause Analysis) for a production incident at Skit AI, a voice AI company.

**Incident:** INCIDENT-XXXX - <summary>
**Jira:** https://vernacular-ai.atlassian.net/browse/INCIDENT-XXXX
**Related context:** <any known related incidents or context to mention>

**Your task:**
1. Fetch the Jira issue details and ALL comments using mcp__atlassian__getJiraIssue
2. Search Slack for "INCIDENT-XXXX" using mcp__claude_ai_Slack__slack_search_public_and_private to find discussion threads
3. If the incident key search returns limited Slack results, also search for related terms (client name, error description)
4. Check Notion for existing RCA: search for "INCIDENT-XXXX" or the incident summary using mcp__claude_ai_Notion__notion-search
5. Compile a complete RCA following the EXACT template format below

**IMPORTANT formatting rules:**
- 5 Whys MUST use the 2-column table format shown in the template (| | | with |---|---|)
- Action Items MUST use bullet list format with priority: - [ ] **P0** <action> - Owner: <name>
- Never use tables for Action Items
- Each action item MUST have a bold priority tag (P0/P1/P2/P3) based on severity and impact
- Use [NEEDS INPUT FROM SERVICE OWNER] for any missing information
- For vendor-outage incidents, include Vendor Ticket IDs in metadata
- If the incident is unresolved, use "NOT RESOLVED" for TTR
- Never use em dashes, use regular hyphens

**Template (follow this EXACTLY):**
<paste contents of rca-template.md here>

**Exemplar RCA (follow this style):**
<paste contents of an appropriate exemplar RCA here>

Write the RCA file to: incident-review-data/rcas/INCIDENT-XXXX.md

### Process
For each incident:
1. Draft the RCA using AI agent - pulling context from Jira, Slack, and Notion
2. Write the RCA file locally
3. Route the draft to the relevant service owner for review and correction
4. Service owner validates/corrects the RCA and adds any missing details
5. Create the COE page in Notion under the COE hub using the template
6. Register the COE in the COE DB with appropriate tags
7. Link the Notion COE page back to the Jira INCIDENT ticket

### RCA routing by component owner
- Bot Runtime and Experience - Deepankar
- Telephony and OCM - Vipul
- Infrastructure - Aditya
- Others / PG Integrations / Data - Respective owners

## Step 2: Pattern analysis

Output: incident-review-data/pattern-analysis.md

Once RCAs are collected, analyze across all incidents for:

2a. Failure mode patterns
Group incidents by root cause category: vendor-outage, change-failure, infrastructure-spof, configuration-error, dependency-failure, data-issue

2b. Client concentration
Which clients are disproportionately affected?

2c. Detection source analysis
Who is finding incidents first? What % are detected by non-engineering teams?

2d. Version/architecture patterns
Which software versions are most fragile?

2e. Repeat incidents
Identify incidents with the same root cause recurring.

## Step 3: Produce output

3a. Prioritized systemic issues list
Rank by: (frequency x severity x client impact). For each:
- Description of the systemic issue
- Incidents it caused
- Proposed fix or investigation
- Owner
- Estimated effort

3b. Severity reclassification recommendations
Which incidents were mis-classified and should the severity framework prevent this going forward?

3c. Detection gap report
For each incident, was it detected by engineering or reported by delivery/CS/client? Use this to prioritize where smoke tests and alerting need to be added.

## Step 4: Feed into reliability plan

Output: incident-review-data/reliability-plan.md

The output of this review directly feeds into:
- SLI/SLO definitions - failure modes tell us what to measure
- Smoke test expansion - detection gaps tell us what to test
- Vendor fallback priorities - vendor incidents tell us where we need redundancy
- SPOF elimination - infrastructure patterns tell us where to invest

## Verification
- [ ] All incidents in scope have been triaged into a bucket
- [ ] All "Needs RCA" incidents have a written RCA with action items
- [ ] Pattern analysis document is complete
- [ ] Prioritized systemic issues list is shared with CEO and team

## Run history
- Run: Aug 2025 - Feb 2026 - 43 incidents, all complete

RCA Template
# RCA: INCIDENT-XXXX - <Jira summary>

## Incident Metadata
- **Severity:** SEV-0 / SEV-1
- **Component Impacted:** <from Jira customfield_10386>
- **Region Impacted:** <from Jira customfield_10387>
- **Client Impacted:** <from Jira customfield_10388>
- **Source:** <from Jira customfield_10389>
- **Project Impacted:** <from Jira customfield_10393>
- **Software Version:** <from Jira customfield_11221>
- **Status:** <from Jira>
- **Assignee:** <from Jira>
- **Reporter:** <from Jira>
- **Created:** <from Jira>
- **Resolved:** <date/time of resolution, or "NOT RESOLVED" if still active>
- **Jira Link:** https://vernacular-ai.atlassian.net/browse/INCIDENT-XXXX
- **Labels:** <from Jira, e.g. vendor-outage, rca-pending>
- **Related Incidents:** <if any - INCIDENT-YYYY (summary), INCIDENT-ZZZZ (summary)>
- **Fix MRs:**
  - <GitLab MR link (branch name - description)>
- **COE Ticket:** <if any - COE-XX link>
- **Existing Notion RCA:** <if any - link to existing Notion RCA by service owner>
- **Vendor Ticket IDs:** <if vendor-outage - vendor reference numbers>

### Incident Metrics
- **Duration:** <incident start to resolution, e.g. "~5 hours (10:00 - 15:20 IST)". For unresolved: "22+ days and counting (2026-02-04 to present, unresolved)">
- **Time to Detect (TTD):** <incident start to first human awareness, e.g. "~20 minutes". Include a brief note on how it was detected.>
- **Time to Resolve (TTR):** <first human awareness to resolution, e.g. "~4 hours 40 minutes". For unresolved: "NOT RESOLVED">

### Classification
- **Root Cause Category:** <one of: vendor-outage, change-failure, infrastructure-spof, configuration-error, dependency-failure, data-issue, unknown>
- **Detection Source:** <one of: Cekura-autocut, Opsgenie-alert, CS-Team, Delivery-Team, Tech-Team, Client-reported, Campaign-Ops>

### Triage Classification
- **Correlated with real production impact?** Yes / No
- **Severity assessment:** <Is the Jira severity correct? If not, what should it be and why?>

## Summary

<Brief summary of the incident: what happened, how long it lasted, and what the impact was. 2-4 sentences.>

## Impact

### Quantitative impact
- <Number of calls/clients affected, duration of outage, etc.>
- <Use [NEEDS INPUT FROM SERVICE OWNER] if data is unavailable>

### Qualitative impact
- <What was the user/client experience during the incident?>

## Timeline

| Date/Time (IST) | Event |
|------------------|-------|
| YYYY-MM-DD HH:MM | <event> |

## How did we notice it?

<Who noticed the incident first and how? Include time from incident start to detection. Mention whether it was detected by engineering (monitoring/alerting) or reported by non-engineering teams (CS, delivery, client).>

## Root Cause

### Background
<Context about the system/service involved. Enough for a reader unfamiliar with the system to understand the issue.>

### The Issue
<What specifically went wrong and why. Include technical details - code paths, error messages, configuration details. Link to MRs and commits where relevant.>

## 5 Whys Analysis

| | |
|---|---|
| Problem Statement | <one-line problem statement> |
| Why? | <first why> |
| Why? | <second why> |
| Why? | <third why> |
| Why? | <fourth why> |
| Why? | <fifth why - should reach the systemic/organizational root cause> |

## How did we mitigate it?

<What was done to stop the bleeding (immediate mitigation) and what was done as a permanent fix? Include timelines for each. For unresolved incidents, describe partial workarounds and current status.>

## Learnings

1. **Do we think the time it took to know that an incident has occurred needs to be reduced?**
   <Answer with Yes/No and explanation>

2. **Do we think the time it took to find the root cause needs to be reduced?**
   <Answer with Yes/No and explanation>

3. **Do we think the time it took to resolve the issue needs to be reduced?**
   <Answer with Yes/No and explanation>

## Action Items

- [ ] **P0/P1/P2/P3** <Specific action item> - Owner: <name>
- [ ] **P0/P1/P2/P3** <Use [NEEDS INPUT FROM SERVICE OWNER] for items requiring service owner input>

Pattern Analysis & Reliability Plan

The outputs of the pattern analysis and the reliability plan was entirely Claude generated. I specified what dimensions to analyze and what the plan should cover. The agents did the rest.

The pattern analysis ran across 20 sections - root cause breakdown, client concentration, detection source, region, repeat incident clusters, duration, TTD, severity accuracy, Cekura false positive rate, and more. I did not write any of it. What I did was define the questions upfront in the plan: which clients are disproportionately affected, who is finding incidents first, which failure modes are repeating. The agents read all 89 RCAs and answered those questions.

The reliability plan was the same. I gave it four workstreams: SLI/SLO definitions, smoke test expansion, vendor fallback priorities, SPOF elimination. I told it to map each recommendation back to the specific incidents that justified it. What came back was a 12-week sprint-level roadmap with owners, a 6-month target of 62% reduction in SEV-0/1 incidents, and every recommendation tied to a cluster of real incidents with frequency and impact scores attached.

My contribution was knowing what questions to ask and what the output needed to look like to be useful. Generating the answers at that scale and that level of detail was not something I could have done.

The numbers:

The next piece of real work begins now with reviewing the P0 and P1 items with service owners.

What’s next

The template and prompt are reusable. I have used them on a handful of recent incidents and the output quality is consistent. The interesting problem I ran into: I could not compare my RCA against the incident owner’s RCA because they had also used Claude to write theirs. My Claude versus their Claude. The finding quality was similar enough that the comparison was not useful. The process is becoming table stakes faster than I expected.

The bigger idea is further out. Agentic incident management - not post-mortem analysis but live incident response. Observability systems feed into agents that surface a ranked list of possible root causes based on recent changes, error patterns, and historical incidents. The incident commander reviews the list and picks the cause. Separate agents draft the code fix, open the PR, run the checks, and flag it for human review before merge. The incident commander stays in the loop at each decision point but is not doing the legwork.

Leadership Takeaway

Three years ago, running an RCA process inside a 400-person engineering team, the hard problems were human ones.

The writing itself is hard. The engineer who just spent 12 hours firefighting is also the one who has to sit down and write a structured document. They are burned out, they want to move on, and they are staring at a blank template. Even when they try, the details are already fading. The exact sequence of events, why a specific call was made, what the error message said - memory decays fast. RCAs written a week after the incident are shallow by default. That is why the 72-hour window matters, and that is exactly when people are least able to write.

Before anyone writes a word, they are manually pulling context from three or four systems. Jira for the timeline, Slack for the real-time debug thread, logs for the error details, dashboards for the impact data. That is 45 minutes of grunt work before the actual thinking starts.

And then there is the blame problem. People water down root causes to avoid pointing fingers. The 5 Whys stops at “human error” or “process not followed” instead of getting to the systemic cause. The RCA looks complete but the insight is not there. The same incident happens again six months later.

Today I don’t have to solve for any of those problems. The context gets pulled automatically. The RCA gets written while the incident is still fresh. The 5 Whys follows the template. The action items get tagged.

If you have a reliability problem and no process around it, this is where I would start. Start by understanding what actually happened in the last six months. The rest follows from that.


Share this post on:

Previous Post
How I Saved 72 Lakhs by Doing the Work Myself