Incident Response

This page summarizes the key concepts from the official incident response docs. Read those for the complete reference.

The problem

When an alert fires at 3 AM, the on-call engineer has to:

Open the incident platform to see what's wrong
Switch to metrics dashboards
Open Log Analytics for errors
Check deployment history
Search Slack/Teams for context
Find a runbook that may be outdated

Azure SRE Agent does all of this in seconds.

How it works

Alert fires
  → Agent acknowledges
  → Queries all connected data sources
  → Correlates logs + metrics + traces + deployments
  → Checks memory for similar past incidents
  → Forms hypotheses, validates with evidence
  → Proposes fix (Review) or executes immediately (Autonomous)

The agent doesn't follow a static script — it reasons about your specific situation, adapting its investigation based on what it finds.

What makes it different

	Runbooks	Dashboards	Scripts	SRE Agent
Adapts?	❌ Static steps	❌ Static views	❌ Same steps every time	✅ Reasons about context
Learns?	❌ Goes stale	❌ N/A	❌ N/A	✅ Persistent memory
Acts?	❌ Manual steps	❌ Just displays data	✅ But blindly	✅ With reasoning
Correlates?	❌ Single source	❌ Single source	❌ Single source	✅ All sources

In this demo

The SRE Agent detects the checkout failure via:

Application Insights — spike in 503 responses on POST /api/orders
Span attributes — error details, the DEMO_BROKEN_CHECKOUT flag value
Source code — traces to src/app/api/orders/route.ts via Deep Context

The agent then proposes resetting the environment variable — or executes it automatically in Autonomous mode.

The problem​

How it works​

What makes it different​

In this demo​

Further reading​

The problem

How it works

What makes it different

In this demo

Further reading