Skip to main content

Architecture

System diagram​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Azure SRE Agent (eastus2) β”‚
β”‚ rg-webstore-sre-agent β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ SRE Agent β”‚ β”‚ Application Insights β”‚ β”‚
β”‚ β”‚ (Microsoft.App/ β”‚ β”‚ + Log Analytics Workspace β”‚ β”‚
β”‚ β”‚ agents) β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚ monitors & investigates β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β–Ό β”‚ telemetry
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Webstore Application (centralus) β”‚ β”‚
β”‚ rg-webstore-demo β”‚ β”‚
β”‚ β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Container App β”‚ β”‚ PostgreSQL β”‚ β”‚ App β”‚ β”‚ Key Vaultβ”‚ β”‚
β”‚ β”‚ (Next.js + β”‚ β”‚ Flexible β”‚ β”‚ Insights β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ OpenTelemetry│──│ Server β”‚ β”‚ (telemetryβ”‚ β”‚ β”‚ β”‚
β”‚ β”‚ SDK) β”‚ β”‚ β”‚ β”‚ sink) β”‚ β”‚ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Azure Container Registry (ACR) β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Telemetry flow​

The webstore uses the Azure Monitor OpenTelemetry SDK to instrument every request:

  1. Container App runs Next.js with a preload script that initializes OpenTelemetry before the app starts
  2. Every HTTP request, database query, and exception becomes a span in Application Insights
  3. The checkout API (POST /api/orders) adds custom span attributes β€” order details, error codes, the DEMO_BROKEN_CHECKOUT flag
  4. Azure SRE Agent queries Application Insights to detect anomalies and correlate failures
Why a preload script?

Next.js standalone mode's file tracing can't follow dynamic imports in instrumentation.ts. The preload script (node --require ./telemetry-preload.js server.js) bypasses the Next.js instrumentation hook entirely and ensures OpenTelemetry is initialized before any application code runs.

The failure mechanism​

When DEMO_BROKEN_CHECKOUT=true is set on the Container App:

AspectBehavior
Browsingβœ… Works normally β€” product listing, detail pages, cart
Health checkβœ… Always returns 200 (decoupled from business logic)
Checkout❌ Returns 503 after a 1.5 s simulated delay
TelemetryRecords error attributes, exception events, and 503 status on the span

This creates a realistic "partial outage" scenario β€” exactly the kind of subtle failure that traditional health checks miss but an SRE Agent can detect through telemetry correlation.

Resource groups​

Resource GroupRegionContains
rg-webstore-democentralusContainer App, PostgreSQL, ACR, Key Vault, App Insights (telemetry)
rg-webstore-sre-agenteastus2SRE Agent, App Insights (agent diagnostics), Log Analytics, Managed Identity
Why different regions?

Azure SRE Agent is available in specific regions (eastus2, swedencentral, australiaeast, uksouth). The agent can monitor resources in any region β€” it just needs to be deployed in a supported one.

CI/CD​

WorkflowRepoPurpose
webstore-demo-break.ymlazure-sre-agent-demoBreak checkout (workflow dispatch)
webstore-demo-reset.ymlazure-sre-agent-demoRestore checkout (workflow dispatch)
ci.ymlwebstoreLint, typecheck, test on every push/PR
deploy.ymlwebstoreBuild Docker image, push to ACR, update Container App
pr-staging.ymlwebstoreEphemeral staging environments for PRs