Incident Response
This page summarizes the key concepts from the official incident response docs. Read those for the complete reference.
The problem
When an alert fires at 3 AM, the on-call engineer has to:
- Open the incident platform to see what's wrong
- Switch to metrics dashboards
- Open Log Analytics for errors
- Check deployment history
- Search Slack/Teams for context
- Find a runbook that may be outdated
Azure SRE Agent does all of this in seconds.
How it works
Alert fires
→ Agent acknowledges
→ Queries all connected data sources
→ Correlates logs + metrics + traces + deployments
→ Checks memory for similar past incidents
→ Forms hypotheses, validates with evidence
→ Proposes fix (Review) or executes immediately (Autonomous)
The agent doesn't follow a static script — it reasons about your specific situation, adapting its investigation based on what it finds.
What makes it different
| Runbooks | Dashboards | Scripts | SRE Agent | |
|---|---|---|---|---|
| Adapts? | ❌ Static steps | ❌ Static views | ❌ Same steps every time | ✅ Reasons about context |
| Learns? | ❌ Goes stale | ❌ N/A | ❌ N/A | ✅ Persistent memory |
| Acts? | ❌ Manual steps | ❌ Just displays data | ✅ But blindly | ✅ With reasoning |
| Correlates? | ❌ Single source | ❌ Single source | ❌ Single source | ✅ All sources |
In this demo
The SRE Agent detects the checkout failure via:
- Application Insights — spike in 503 responses on
POST /api/orders - Span attributes — error details, the
DEMO_BROKEN_CHECKOUTflag value - Source code — traces to
src/app/api/orders/route.tsvia Deep Context
The agent then proposes resetting the environment variable — or executes it automatically in Autonomous mode.