Skip to main content

Demo Script

A step-by-step presenter guide for running the Azure SRE Agent demo live. Designed for a 15–20 minute slot, but can be shortened to 10 minutes by trimming the setup narration.


Before the demo

Run through this checklist the morning of your presentation.

Infrastructure health

# PostgreSQL is running
az postgres flexible-server show \
--name <YOUR_POSTGRES_SERVER_NAME> \
--resource-group <YOUR_DEMO_RESOURCE_GROUP> \
--query "state" -o tsv
# Expected: "Ready"

# Container App is responding and database is connected
curl -s https://<YOUR_APP_FQDN>/api/health | jq .
# Expected: { "status": "ok", "database": { "status": "ok", ... } }

Verify checkout works

For a quick presenter sanity check, use the browser. The GitHub Actions workflows now use a configured seeded product CUID (TEST_PRODUCT_ID) for their own verification, so they no longer rely on the broken hardcoded productId: "1" payload.

  1. Open https://<YOUR_APP_FQDN> in a browser

  2. Add any product to the cart

  3. Complete a checkout — confirm you see the order confirmation page

  4. If checkout returns an error, the DEMO_BROKEN_CHECKOUT env var may still be set to true from a previous demo run. Reset it:

    gh workflow run "Demo: Reset Checkout" -f environment=<YOUR_GITHUB_ENVIRONMENT>

SRE Agent

  1. Open sre.azure.com and navigate to your agent
  2. Confirm the agent is Active (not "Building Knowledge Graph")
  3. Confirm the webstore GitHub repo is connected under data sources
  4. Confirm <YOUR_DEMO_RESOURCE_GROUP> is listed under Azure resources

Browser tabs (pre-open)

TabWhat to openPurpose
1Webstore storefrontShow the live app
2sre.azure.com — Agent chatWatch the agent investigate
3Azure Portal — App Insights (Live Metrics or Failures)Real-time telemetry
4GitHub Actions — this repoTrigger workflows

Part 1: Set the scene

⏱️ ~3 minutes

Show the app

  1. Switch to the webstore tab
  2. Browse a few products, add one to the cart
  3. Complete a checkout — show the success confirmation
Speaker notes

"This is a real e-commerce app running on Azure Container Apps. It's a Next.js storefront backed by PostgreSQL, fully instrumented with OpenTelemetry. Every request, every dependency call, every exception flows to Application Insights."

Show the telemetry

  1. Switch to Application Insights → Live Metrics (or the Failures blade)
  2. Point out healthy request rates, zero failures
Speaker notes

"Right now everything is green. The SRE Agent is quietly monitoring this environment — it has access to these metrics, the logs, and the source code repo on GitHub."

Show the SRE Agent

  1. Switch to sre.azure.com
  2. Show the agent dashboard — connected resources, connected code repo
  3. Optionally ask: "What Azure resources can you see in <YOUR_DEMO_RESOURCE_GROUP>?"
Speaker notes

"This is Azure SRE Agent. It's connected to our Application Insights, our resource group, and the GitHub repo with the webstore source code. It's running in Review mode — it'll investigate autonomously but ask before taking action."


Part 2: Break it

⏱️ ~2 minutes

Trigger the failure

  1. Switch to GitHub Actions
  2. Navigate to "Demo: Break Checkout"
  3. Click Run workflow → select your configured GitHub environment → click Run workflow
Speaker notes

"I'm simulating a bad deployment. All this does is flip one environment variable — DEMO_BROKEN_CHECKOUT=true. The checkout API will start returning 503 with a simulated timeout, while the rest of the site stays perfectly healthy."

Confirm the break

  1. Wait for the workflow to complete (~1 min)
  2. Switch to the storefront — try to check out, show the error
  3. Switch to App Insights — watch the 503s appear
Speaker notes

"Checkout is down. Products still load, the cart works, but placing an order gives a 503. If this were a simple health check, it would still say 'healthy.' But the SRE Agent is about to tell us what is wrong and how to fix it."


Part 3: Watch the agent

⏱️ ~5–10 minutes

The investigation

  1. Switch to the SRE Agent tab
  2. The agent may start investigating automatically (if you've set up an incident response plan), or prompt it:

"I'm seeing checkout failures on the webstore Container App in <YOUR_DEMO_RESOURCE_GROUP>. Can you investigate?"

  1. Watch the reasoning chain:
    • Queries App Insights for recent exceptions and failed requests
    • Notices the spike in 503s on POST /api/orders
    • Examines span attributes and exception details
    • Correlates with recent environment variable changes
    • Traces to source code — identifies the DEMO_BROKEN_CHECKOUT flag
Speaker notes

"Watch what's happening. The agent is querying App Insights, looking at traces, and forming hypotheses. It's not following a runbook — it's reasoning about what's different. And because it has access to the GitHub repo, it can trace the telemetry all the way back to the line of code."

The recommendation

  1. The agent proposes a remediation — resetting DEMO_BROKEN_CHECKOUT to false
  2. In Review mode, it shows Approve / Deny
Speaker notes

"The agent found the root cause and is proposing a specific fix. In Review mode, I approve or deny. In Autonomous mode, it would have already done this — checkout restored before anyone noticed."

Approve (or manual reset)

  1. Approve the agent's proposal if actionable, or:
gh workflow run "Demo: Reset Checkout" -f environment=<YOUR_GITHUB_ENVIRONMENT>

Part 4: Recovery

⏱️ ~2 minutes

  1. Wait ~30 s after the env var flips
  2. Switch to storefront — complete a checkout, show success
  3. Switch to App Insights — 503s stop, 201s resume
Speaker notes

"We're back. Zero downtime for browsing, and checkout was restored in under two minutes. The agent captured the full investigation in its memory — next time it sees this pattern, it'll resolve it even faster."


Key talking points

Use these throughout the demo or in Q&A:

"How is this different from just an alert?"

An alert tells you something is wrong. The SRE Agent tells you what is wrong, why it happened, and how to fix it. It correlates across metrics, logs, traces, deployments, and source code — the same investigation that takes an on-call engineer 15–30 minutes happens in seconds.

"What if I don't trust it to act?"

Start in Review mode. The agent investigates autonomously but proposes actions for your approval. Once you see patterns you're always approving, switch those to Autonomous.

"Does it only work with Container Apps?"

No — it works with any Azure resource accessible via ARM. App Services, AKS, VMs, Functions, databases, networking. Plus external tools via connectors and MCP.

"Does it remember past incidents?"

Yes. Every investigation gets captured in persistent memory. Institutional knowledge compounds instead of walking out the door.

"How do I get started?"

Go to sre.azure.com, sign in, and the wizard walks you through creating an agent in about 5 minutes. Connect your Azure resources and a code repo, and you're up and running.


Timing guide

SectionDurationNotes
Set the scene3 minShow app, telemetry, agent
Break it2 minTrigger workflow, confirm 503
Agent investigates5–10 minVaries based on agent response time
Recovery2 minShow restoration
Q&A5 minUse the talking points above
Total15–20 minCompress to 10 min by prompting the agent directly

Cleanup

After the demo, make sure checkout is restored:

gh workflow run "Demo: Reset Checkout" -f environment=<YOUR_GITHUB_ENVIRONMENT>