Architecture
System diagramβ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Azure SRE Agent (eastus2) β
β rg-webstore-sre-agent β
β β
β βββββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β SRE Agent β β Application Insights β β
β β (Microsoft.App/ β β + Log Analytics Workspace β β
β β agents) β β β β
β βββββββββββββ¬ββββββββββββ ββββββββββββββββ¬βββββββββββββββ β
β β monitors & investigates β β
ββββββββββββββββΌββββββββββββββββββββββββββββββββΌβββββββββββββββββββ
β β
βΌ β telemetry
ββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββ
β Webstore Application (centralus) β β
β rg-webstore-demo β β
β β β
β βββββββββββββββββ ββββββββββββββ ββββββββββ΄βββ ββββββββββββ β
β β Container App β β PostgreSQL β β App β β Key Vaultβ β
β β (Next.js + β β Flexible β β Insights β β β β
β β OpenTelemetryββββ Server β β (telemetryβ β β β
β β SDK) β β β β sink) β β β β
β βββββββββββββββββ ββββββββββββββ βββββββββββββ ββββββββββββ β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Azure Container Registry (ACR) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Telemetry flowβ
The webstore uses the Azure Monitor OpenTelemetry SDK to instrument every request:
- Container App runs Next.js with a preload script that initializes OpenTelemetry before the app starts
- Every HTTP request, database query, and exception becomes a span in Application Insights
- The checkout API (
POST /api/orders) adds custom span attributes β order details, error codes, theDEMO_BROKEN_CHECKOUTflag - Azure SRE Agent queries Application Insights to detect anomalies and correlate failures
Next.js standalone mode's file tracing can't follow dynamic imports in instrumentation.ts. The preload script (node --require ./telemetry-preload.js server.js) bypasses the Next.js instrumentation hook entirely and ensures OpenTelemetry is initialized before any application code runs.
The failure mechanismβ
When DEMO_BROKEN_CHECKOUT=true is set on the Container App:
| Aspect | Behavior |
|---|---|
| Browsing | β Works normally β product listing, detail pages, cart |
| Health check | β Always returns 200 (decoupled from business logic) |
| Checkout | β Returns 503 after a 1.5 s simulated delay |
| Telemetry | Records error attributes, exception events, and 503 status on the span |
This creates a realistic "partial outage" scenario β exactly the kind of subtle failure that traditional health checks miss but an SRE Agent can detect through telemetry correlation.
Resource groupsβ
| Resource Group | Region | Contains |
|---|---|---|
rg-webstore-demo | centralus | Container App, PostgreSQL, ACR, Key Vault, App Insights (telemetry) |
rg-webstore-sre-agent | eastus2 | SRE Agent, App Insights (agent diagnostics), Log Analytics, Managed Identity |
Azure SRE Agent is available in specific regions (eastus2, swedencentral, australiaeast, uksouth). The agent can monitor resources in any region β it just needs to be deployed in a supported one.
CI/CDβ
| Workflow | Repo | Purpose |
|---|---|---|
webstore-demo-break.yml | azure-sre-agent-demo | Break checkout (workflow dispatch) |
webstore-demo-reset.yml | azure-sre-agent-demo | Restore checkout (workflow dispatch) |
ci.yml | webstore | Lint, typecheck, test on every push/PR |
deploy.yml | webstore | Build Docker image, push to ACR, update Container App |
pr-staging.yml | webstore | Ephemeral staging environments for PRs |