How RCA Works
1
Investigation Triggered
RCA begins when an incident is created (automatically or manually) or manually initiated. The system creates a dedicated AI conversation and queues an RCA task in the background. A dedicated investigation run is created to track all findings, hypotheses, and evidence.
2
Agent Activation
Based on your connected infrastructure (AWS, Kubernetes, databases, etc.), relevant specialized agents are activated. Agent Anna coordinates the investigation, while specialists (Alex, Tony, Kai, Oliver) focus on their domains. Agents work in parallel, sharing findings through a centralized incident tool.
3
Context Gathering Phase
Agents explore infrastructure topology, collect baseline metrics, identify affected services, analyze deployment history, and examine recent configuration changes. The investigation automatically advances to the next phase when findings are complete.
4
Analysis Phase
AI forms 2-5 competing hypotheses about potential causes. Each hypothesis is tested by examining logs, traces, and dependencies. Evidence is systematically collected to confirm or rule out each theory. As evidence contradicts hypotheses, they are eliminated, narrowing the investigation.
5
Resolution Phase
The remaining confirmed hypothesis becomes the identified root cause. Targeted evidence is added to the chain (3-6 curated items), remediation suggestions are generated (1-3 focused actions), and the incident disposition is set with confidence scoring.
Investigation Phases
RCA follows a structured three-phase investigation workflow. Each phase has a specific purpose, and agents track phase progress in the timeline for real-time visibility.Phase Status Tracking
Each phase progresses through states:pending → in_progress → completed (or skipped)
The system automatically:
- Records phase start and completion timestamps
- Calculates phase duration (seconds)
- Requires findings summary for completed/skipped phases
- Auto-completes previous phase when new phase starts
- Auto-advances to next phase when current phase completes
Phase 1: Context Gathering
Duration: Typically 2-5 minutes The agents collect infrastructure context to establish baseline conditions: Activities:- Topology Exploration: Maps affected services and their dependencies (if topology available)
- Service Identification: Determines which services are impacted and severity level
- Metric Collection: Gathers CloudWatch, Prometheus, Datadog, and custom metrics
- Baseline Analysis: Compares incident-time metrics to historical baselines
- Deployment History: Identifies recent deployments within last 24-48 hours
- Configuration Review: Checks for recent configuration or infrastructure changes
- All relevant services identified
- Baseline metrics collected
- Recent changes documented
- Initial anomalies noted
Phase 2: Analysis & Hypothesis Testing
Duration: Typically 5-15 minutes Deep investigation phase where agents narrow down root cause through hypothesis testing: Activities:-
Hypothesis Generation: Creates 2-5 competing theories (one hypothesis per timeline entry)
- Theory 1: “Database connection pool exhaustion”
- Theory 2: “Lambda cold start latency”
- Theory 3: “Cascading failure from shared dependency”
- etc.
-
Evidence Collection: For each hypothesis, agents gather:
- Application logs (error patterns, stack traces)
- Distributed traces (latency breakdowns, service hops)
- Dependency analysis (downstream service health)
- Resource constraints (CPU, memory, connections)
-
Hypothesis Testing: Evidence is systematically evaluated:
- Evidence supports hypothesis → Confidence increases
- Evidence contradicts hypothesis → Hypothesis ruled out (with reason)
- Inconclusive evidence → Hypothesis remains under investigation
-
Ruling Out: Hypotheses eliminated when evidence proves them wrong
- Each ruled-out hypothesis logged with specific reason
- Reduces scope to most likely causes
- 2-5 hypotheses created
- Evidence gathered for each hypothesis
- At least 1 hypothesis ruled out
- 1 hypothesis remains as leading theory
Phase 3: Resolution
Duration: Typically 2-5 minutes Confirmation phase where root cause is finalized with evidence and remediation suggestions: Activities:- Rule Out Remaining Hypotheses: All but one hypothesis are eliminated with evidence
- Confirm Root Cause: Remaining hypothesis marked as confirmed (highest confidence)
- Curated Evidence Chain: Select 3-6 strongest evidence items supporting root cause
- Impact Summary: Update affected services and blast radius
- Remediation Actions: Generate 1-3 focused, actionable remediation steps
- Severity Assessment: Optionally suggest severity adjustment if investigation changes risk assessment
- Set Disposition: Mark incident as IDENTIFIED with confidence score
- 1 hypothesis confirmed as root cause
- Root cause summary and confidence score set
- 1-3 specific remediation suggestions generated
- Disposition set to IDENTIFIED (or NOT_FOUND if investigation inconclusive)
- Investigation marked complete
Evidence Chain
RCA builds a structured evidence chain to support root cause identification:Evidence Types & Auto-Calculations
CloudThinker automatically calculates derived fields to enable fast, accurate root cause analysis:Metrics
Required Fields: name, incident_value, incident_aggregation (max|avg|min|sum|p99), sourceOptional Fields: baseline_value, baseline_period (7d|24h|1h), threshold, unit, namespace, dimensionsAUTO-CALCULATED: deviation_percentage = (incident_value - baseline_value) / baseline_value × 100Shows before/after comparison with automatic deviation calculation. Example: “CPU 95% vs 25% baseline = 280% deviation”
Deployments & Changes
Required Fields: type (deployment|config_change|scaling|feature_flag|database_migration|infrastructure|rollback), description, timestamp, correlation (probable_cause|contributing|unrelated)Optional Fields: source, serviceAUTO-CALCULATED: time_delta_minutes = minutes between change and incident start
- Positive = change before incident (likely causative)
- Negative = change after incident (response action)
Logs
Fields: source, description, deep_link, timestamp, severity, raw_dataRelevant log entries with direct deep links to log consoles (CloudWatch, Splunk, Datadog, etc.) for manual verification.
Traces
Fields: source, description, raw_dataDistributed trace data from X-Ray or other tracing systems showing request flow and latency breakdowns.
Configuration
Fields: source, description, change details, timestampConfiguration changes tracked with exact parameter modifications (e.g., “memory: 512MB → 256MB”).
Alerts
Fields: source, severity (critical|high|medium|low), descriptionRelated alerts from monitoring systems showing correlated signals during incident window.
Evidence Ranking & Quality
Evidence is ranked by severity and automatically ordered in the UI:- Critical - Direct cause indicators (error spike, deployment failure, config mismatch)
- High - Strong supporting evidence (metric anomaly, failed health check)
- Medium - Context evidence (related alert, deployment timing)
- Low - Background information (unrelated event, infrastructure baseline change)
Confidence Scoring
The AI provides a confidence score (0.0-1.0) for the identified root cause. This score reflects the strength of the evidence chain and the degree of certainty in the causation.Confidence Score Interpretation
| Score Range | Category | Meaning | Action |
|---|---|---|---|
| 0.9 - 1.0 | Very High | Root cause identified with overwhelming evidence. Multiple independent evidence items corroborate the cause. Strong temporal and logical correlation. | Implement remediation immediately |
| 0.7 - 0.9 | High | Root cause identified with strong evidence. Clear causal relationship with supporting metrics and logs. Other hypotheses ruled out. | Implement remediation with normal priority |
| 0.5 - 0.7 | Medium | Probable root cause with supporting evidence, but gaps remain. Alternative hypotheses less likely but not ruled out. | Implement remediation; monitor for alternative causes |
| 0.3 - 0.5 | Low | Possible root cause but evidence is circumstantial. Multiple explanations remain plausible. Investigate further. | Validate findings manually before action |
| 0.0 - 0.3 | Uncertain | Insufficient evidence to establish root cause with confidence. Investigation inconclusive. | Cannot determine root cause; consider NOT_FOUND disposition |
How Confidence Scores Are Calculated
Confidence starts at the hypothesis level (0.0-1.0) and evolves as evidence is tested: Confidence Factors (Positive):- Temporal correlation: Change/deployment occurred within minutes of incident
- Metric anomalies: Clear deviation from baseline (>50% change indicates high relevance)
- Error patterns: Specific error logs/traces directly correlating to hypothesis
- Other hypotheses ruled out: Elimination of competing theories increases confidence
- Multiple data sources: Same finding corroborated by different systems (metrics + logs + traces)
- Severity match: Evidence severity matches incident impact
- Alternative explanations: Multiple hypotheses still plausible
- Weak temporal correlation: Change occurred hours before incident
- Missing verification: Hypothesis not directly tested
- Conflicting evidence: Some evidence supports, some contradicts hypothesis
- Unrelated metrics: Anomalies found but not causally linked to incident
Final Confidence Score
The final confidence score is set whenupdate_root_cause is called:
Hypothesis Tracking
RCA implements a structured hypothesis-driven investigation model inspired by the “5 Whys” and Fishbone Diagram methodologies. Each hypothesis is tracked independently, tested against evidence, and either confirmed or ruled out with full reasoning.Hypothesis Workflow
Hypothesis Lifecycle
- Created: Initial theory formed based on symptoms with 0.0-1.0 confidence estimate
- Investigating: Agents gather specific evidence to test the hypothesis
- Confirmed: Sufficient evidence supports this as the root cause
- Ruled Out: Evidence contradicts or disproves this hypothesis
Hypothesis Metadata
Each hypothesis in the timeline is stored with:- Hypothesis ID: Sequential reference (1, 2, 3…)
- Statement: The theory being tested (20-300 characters)
- Category: Type of failure (infrastructure, code, config, external, capacity, deployment, data)
- Confidence: Initial confidence 0.0-1.0, updated as evidence arrives
- Status: investigating | confirmed | ruled_out
- Timestamps: When created, when confirmed/ruled out
- Reason (if ruled out): Why evidence contradicted this theory
Investigation Timeline: Hypothesis Entries
Each hypothesis action is logged in real-time in the timeline:Root Cause Enforcement
The system enforces a critical constraint:update_root_cause is blocked until at least one hypothesis is confirmed. This ensures:
- AI documents reasoning before conclusions
- Full transparency into decision process
- Users can review hypothesis testing before root cause is set
- No “black box” root cause assignment
AI Timeline & Real-Time Investigation Progress
RCA generates a comprehensive, real-time investigation timeline showing every step of the AI’s reasoning. The timeline enables users to follow the investigation in real-time and understand the exact logic that led to the root cause determination.Timeline Entry Types
The timeline supports 8 entry types for different investigation activities:- info - General investigation note or observation
- finding - Specific discovery that impacts analysis (metric anomaly, error pattern)
- warning - Potential issue requiring verification
- error - Failed investigation attempt or contradictory evidence
- success - Confirmed finding or validated hypothesis
- hypothesis_created - New theory proposed with initial confidence
- hypothesis_ruled_out - Theory disproven with specific reasoning
- hypothesis_confirmed - Hypothesis validated as root cause explanation
Phase Transitions
Timeline entries automatically document phase transitions:Real-Time Investigation Visibility
- Live updates - Timeline entries appear instantly as agents discover findings
- Phase progress - Visual indicator shows investigation phase completion
- Hypothesis status - Users see hypothesis creation, testing, and resolution in real-time
- Evidence collection - New evidence items appear with source attribution as they’re added
- Timestamp correlation - Each entry auto-timestamped for precise investigation sequence tracking
Phase Auto-Advancement
The system automatically advances phases:- When current phase completes with findings, previous phase auto-closes
- Next phase auto-starts after findings are summarized
- Agents can manually transition if investigation leads elsewhere
- Phase duration (in seconds) is auto-calculated and displayed
Disposition Status & Investigation Conclusion
After investigation, the AI sets an incident disposition which determines the final investigation status. Each disposition is recorded with:- New status (transitioned from previous status)
- Reason (explanation of decision)
- Timestamp (when determination was made)
- Attribution (which agent made the decision)
- Confidence score (for IDENTIFIED only)
Disposition Types
| Status | Meaning | Prerequisites | Investigation Continues? |
|---|---|---|---|
| IDENTIFIED | Root cause found with supporting evidence | Requires: confirmed hypothesis, root cause summary, confidence 0.7+ | ❌ NO (Terminal) |
| NOT_FOUND | Investigation exhausted, no clear root cause | Optional: tried all hypotheses, insufficient evidence | ❌ NO (Terminal) |
| FALSE_ALARM | Issue was not a real incident | Optional: evidence shows issue was not incident | ❌ NO (Terminal) |
| ON_HOLD | Awaiting external input or additional data | Optional: investigation blocked pending info | ✅ YES (Resumable) |
Terminal vs. Resumable Dispositions
Terminal Dispositions (IDENTIFIED, NOT_FOUND, FALSE_ALARM):
- After setting, agents cannot:
- Call
update_root_cause - Add evidence
- Add timeline entries
- Update affected services
- Suggest remediation changes
- Call
- Investigation is considered complete and cannot be resumed
- Users can still manually add context or reopen as new RCA run
ON_HOLD):
- Investigation can resume when additional information is available
- Agents can continue investigation after external input
- Phase tracking resumes from last state
Disposition History & Audit Trail
Every disposition change is recorded chronologically:How Agents Report Findings (The Incident Tool)
During investigation, agents use the incident tool to report all findings directly to the RCA record. This ensures a unified, tamper-proof evidence chain and allows cross-agent correlation.Agent Workflow with Incident Tool
Agents follow this structured workflow:Incident Tool Commands
| Command | Purpose | Example |
|---|---|---|
update_investigation_phase | Track phase progression | Phase context→analysis→resolution transitions |
add_timeline_entry | Log findings in real-time | Discovery of metrics, hypothesis creation/testing |
update_root_cause | Set final root cause with confidence | ”Config change caused pool exhaustion, conf=0.92” |
add_evidence | Add supporting proof (metric/log/trace/deployment) | CloudWatch metric showing 95% CPU vs 25% baseline |
add_remediation | Suggest 1-3 actionable fixes | ”Revert config to previous values (critical)“ |
update_affected_services | Track impacted services and auto-derive blast radius | [“api-service”, “database”, “cache-layer”] |
suggest_severity_change | Recommend severity adjustment | Upgrade to critical if blast radius large |
update_disposition_status | Set final investigation status | IDENTIFIED, NOT_FOUND, FALSE_ALARM, or ON_HOLD |
Evidence Auto-Calculations
The incident tool automatically calculates derived fields to save agent effort:- Metric deviation: Automatically calculates
(incident_value - baseline) / baseline × 100 - Change timing: Automatically calculates minutes between change and incident start (positive=before, negative=after)
- Service-to-node mapping: Automatically resolves services to infrastructure topology nodes
- Blast radius: Automatically derives affected node set from service list
Severity Suggestions
The AI may suggest severity changes based on investigation findings:- Upgrade: Initial severity was too low given the impact
- Downgrade: Issue is less severe than initially reported
Triggering RCA
RCA can be triggered automatically through webhooks or manually from the incident detail page.Automatic Trigger
Configure webhook integrations to auto-trigger RCA for incidents meeting severity thresholds: Webhook Configuration:- When webhook creates incident with
auto_trigger_rca: true - If incident severity ≥
auto_trigger_rca_min_severity - RCA investigation starts automatically in background
- User notified of RCA progress via timeline updates
- PagerDuty: Webhook triggers on incident.triggered events meeting severity
- Opsgenie: Webhook triggers on alert creation with priority≥P2
- CloudWatch: Alarm state change triggers webhook
- Datadog/Grafana: Monitor alerts trigger webhook
Manual Trigger
Start RCA investigation from incident detail page:- Open incident detail
- Click Start RCA Analysis button
- System validates:
- No RCA is already running for this incident (prevents duplicate investigations)
- Incident is not a child incident (parent incidents only)
- RCA run is created and queued
- Background task begins investigation within 1-3 seconds
- UI shows “Investigation in progress” status
- Real-time timeline appears as findings are discovered
Automatic Promotion Trigger
When a child incident is promoted to parent, RCA auto-triggers on the parent if configured.RCA Configuration
Workspace Settings
Configure default RCA behavior in Incidents > Settings:- Default Topology View: Auto-associate topology for blast radius analysis
- Default Severity: Starting severity for new incidents
- Auto-generate Reports: Create PDF reports on resolution
Per-Incident Settings
Override settings when creating incidents:- Topology View: Select specific topology for this incident
- Affected Services: Pre-populate known affected services
Viewing RCA Results
The RCA analysis view displays comprehensive investigation results in multiple sections:Root Cause Summary
Clear explanation of the identified root cause (30-2000 characters) with confidence score (0.0-1.0). Shows when root cause was identified and any updated confidence through multiple RCA runs.
Hypothesis Tracking
All hypotheses created during investigation with their lifecycle: creation confidence → testing status → confirmation/ruling out. Shows hypothesis ID, statement, category, and final status with reasoning.
Evidence Chain
All collected evidence organized by type (metric, deployment, log, trace, config, alert) with severity ranking. Each evidence item shows: source, description, timestamps, deep links, raw data, and severity classification.
Investigation Timeline
Real-time chronological log of AI investigation steps showing: findings discovered, hypotheses created/tested, phase transitions, timestamps, and investigation message for each step.
Remediation Actions
AI-suggested fixes (1-3 items) with title, detailed description, and priority level (critical|high|medium|low). Each remediation includes the investigation values that led to the suggestion.
Affected Services
Complete list of services determined to be impacted during investigation, with optional external IDs (ARNs, resource names) and blast radius visualization if topology is connected.
RCA Results Metadata
Each RCA run shows:- Investigation Duration: Total time from start to completion
- Phase Breakdown: Time spent in each phase (context, analysis, resolution)
- Hypotheses Count: Number of hypotheses tested and outcome distribution
- Evidence Count: Total evidence items collected by type
- Agent Contributions: Which agents participated in investigation
- Investigation Timestamp: When investigation was initiated
- Last Updated: When results were last modified
Multiple RCA Runs
You can run multiple RCA investigations on the same incident:- Version tracking: Each run is numbered (v1, v2, v3…)
- History dropdown: Switch between historical runs
- Compare results: Review how findings evolved
- Additional information becomes available
- Initial investigation was inconclusive
- Incident scope expanded
Best Practices
Investigation Setup
- Connect topology - Infrastructure topology mapping enables blast radius analysis, service correlation, and better agent decision-making
- Configure webhooks - Auto-trigger RCA for medium+ severity incidents to start investigation immediately
- Provide incident context - Include symptoms, timeline, and user impact in incident description to guide investigation
- Link related incidents - Use parent/child relationships to group related incidents for correlation
During Investigation
- Monitor timeline in real-time - Follow hypothesis creation, testing, and confirmation as it happens
- Review hypothesis chain - Understand which theories were tested and why others were ruled out
- Check evidence severity - Verify that high-severity evidence items are directly related to root cause
- Verify timestamps - Ensure deployments/changes timestamp correctly relative to incident start
- Review agent contributions - See which specialized agents (Alex, Tony, Kai, Oliver) contributed findings
After Investigation
- Validate root cause confidence - If confidence < 0.7, manually verify findings before implementing remediation
- Implement prioritized remediation - Focus on critical-priority suggestions first
- Accept/dismiss severity suggestions - Review AI severity recommendations and confirm or dismiss
- Track remediation completion - Mark remediation actions as complete in incident workflow
- Enable severity updates - Let AI auto-update severity if investigation reveals higher/lower impact than initially reported
Multi-Run Investigations
- Use version history - Run RCA multiple times if new information emerges
- Compare runs - Switch between historical runs (v1, v2, v3…) to see how findings evolved
- Add context between runs - When rerunning, add new data points to guide agents
- Review run notes - Each run shows which investigations were tried and what was learned
RCA Configuration
- Set default severity - Configure workspace default severity for new incidents
- Enable topology - Turn on topology context for better service mapping
- Configure agent team - Activate relevant agents based on your infrastructure (AWS→Alex, K8s→Kai, etc.)
- Set up integrations - Connect cloud providers, databases, and monitoring tools for agent access