Incident response

How Plaza responds to incidents. Customer-impact-facing. Written for the on-call engineer at 03:00.

The runbook at runbook.md is the per-incident playbook. This document is the framework — severity ladder, response shape, drill cadence — that the runbook fills in.

Severity ladder

Level	Definition	Response
SEV-0	Customer money at risk; or all production down	Page everyone. Public status page within 15 minutes. CEO on call.
SEV-1	Subset of customers affected; core flow degraded	Page on-call. Public status page within 30 minutes. CEO informed.
SEV-2	SLO breach without customer-visible failure	On-call investigates next business hour. Internal status note.
SEV-3	Cosmetic; minor	Ticket. No page.

When in doubt, escalate one level. Plaza holds money; the cost of over-paging is one tired engineer, the cost of under-paging is a stuck customer.

Per-severity response

SEV-0

On-call engineer acknowledges the page within 5 minutes.
Engineer creates an incident channel (#inc-YYYYMMDD-keyword) and starts a timeline.
CEO is paged. Communications lead is paged.
Status page entry posted within 15 minutes (template: status-templates/investigating.md).
All other engineering work pauses. Incident channel is single source of truth.
Updates to the status page every 30 minutes until resolution, or sooner on state changes.
Postmortem is mandatory and public within 14 days.

SEV-1

On-call acknowledges within 10 minutes.
Incident channel opened. Timeline started.
Status page entry within 30 minutes.
Updates to the status page every 60 minutes until resolution.
Postmortem mandatory; internal-only or public depending on customer impact.

SEV-2

On-call investigates within the next business hour.
Internal status note in #plaza-status. No public status page entry unless escalated.
Postmortem optional; required if recurrence is likely.

SEV-3

Filed as a ticket. No real-time response.

Drills

Plaza runs each of these as a tabletop drill at least quarterly. After mainnet launch, two of them per quarter run as a live drill in staging.

Hot wallet drain

Scenario. The custodied hot wallet shows a large unauthorized outbound transfer.

Alert. HotWalletAboveCap fires inverted (balance dropped to zero); ReconciliationDriftHigh follows. LedgerSumZeroViolation may also fire.

First moves.

Pause custodied mode: PLAZA_CUSTODIED_MODE_DISABLED=1, reload plaza-api.
Page CEO. SEV-0.
Rotate MPC signer keys.
Pull all signer audit logs for the last 7 days.
Notify pilot orgs via direct channel. Public status page entry.
Sweep cold-wallet to a fresh address in the next signing ceremony.

Recovery posture. New orders run in contract mode only until the post-incident audit is complete and the new signer is in place.

MPC outage

Scenario. Turnkey or Privy is unavailable. New escrow funding cannot be signed.

Alert. Custom mpc_signer_errors_total metric exceeds threshold; payout worker latency climbs.

First moves.

Confirm the outage upstream — check the provider’s status page.
SEV-1 unless extended; SEV-0 if the outage exceeds 60 minutes during business hours.
Failover to the secondary MPC provider (the architecture supports two providers; the runtime flag is PLAZA_MPC_PROVIDER).
Public status page entry — pending orders in custodied mode are queued, not lost.

Recovery posture. Resume primary when the upstream confirms recovery and a 10-minute health probe is green.

RPC outage

Scenario. The Base RPC provider is unavailable. Payouts stall, on-chain confirmations stall, reconciliation cannot complete.

Alert. RpcErrorRateHigh, PayoutFailureRateHigh, ReconciliationDriftHigh (drift becomes unverifiable, not necessarily real).

First moves.

Failover to the secondary RPC. The runtime supports two; flag is PLAZA_RPC_URL_2.
SEV-1.
Pause sweep cadence; in-flight sweeps will retry.
Customer impact is delayed payout. Public status page entry advises pilots that payouts are queued.

NATS down

Scenario. The message broker is unavailable. Webhook delivery, realtime events, and inter-service messaging all stall.

Alert. NatsConsumerLag; healthchecks on the broker fail.

First moves.

Restart the broker. NATS is single-node at launch (this is a known sequencing tradeoff in PLAN.md).
SEV-1. Webhook deliveries are buffered in the outbox; nothing is lost, only delayed.
Once recovered, the outbox drainer catches up; watch OutboxDepthHigh clear.

Recovery posture. Multi-node NATS clustering is on the post-launch backlog. Today, the outbox guarantees no lost events.

DB corruption

Scenario. Postgres reports an inconsistency, a bad write, or a failed integrity check.

Alert. LedgerSumZeroViolation on every write; BackupStale may also fire.

First moves.

SEV-0. Halt all writes — set PLAZA_READ_ONLY=1, reload plaza-api.
Page CEO and database lead.
Verify the most recent backup integrity (infra/backup/verify.sh).
Decide: in-place repair vs. restore from backup. Restore is the safer default.
Restore loses minutes of data; the outbox preserves webhook deliveries; the ledger preserves money state.

Recovery posture. PITR (point-in-time recovery) is configured against R2-backed WAL archives. Maximum data loss objective is 5 minutes.

Dispute floods

Scenario. A spike in disputes overwhelms the arbitrator pipeline. Could be coordinated abuse, could be a regression.

Alert. ArbitrationP50Breached; dispute open rate metric exceeds rolling baseline.

First moves.

SEV-1.
Confirm the source. Single seller? Single buyer? Single category? Operator console has the breakdown.
If coordinated abuse: rate-limit the disputing accounts, hold their disputes for human review.
If regression: roll back the most recent arbitrator deploy.
Public status page entry only if customer-visible. Most dispute floods are internal-only.

Prompt-injection compromise

Scenario. An adversarial buyer or seller exploits the arbitrator’s prompt to flip a verdict.

Alert. Anomaly detection on verdict distributions; manual report from a counterparty.

First moves.

SEV-1.
Pause auto-resolution on the affected category. New disputes go to human review.
Pull the prompt + completions for the suspect verdicts. Confirm the injection.
Patch the arbitrator prompt; deploy with a verdict-replay test that includes the injection vector.
Reverse affected verdicts via the appeal pipeline; refund manually if needed.

Recovery posture. A verdict-replay corpus is maintained in crates/plaza-arbitrator/tests/corpus/. New injection vectors are added to the corpus and to the regression suite.

Alert reference

The full alert set lives in infra/observability/alerts.yml. The map below ties severity to alert name; both are stable identifiers.

Alert	Default severity	Drill
`LedgerSumZeroViolation`	SEV-0	Hot wallet drain, DB corruption
`ReconciliationDriftHigh`	SEV-0	Hot wallet drain, RPC outage
`HotWalletAboveCap`	SEV-1	Sweep cadence audit
`PayoutFailureRateHigh`	SEV-1	RPC outage
`RpcErrorRateHigh`	SEV-1	RPC outage
`SearchP99Breached`	SEV-2	Capacity drill
`OrderPlacementP99Breached`	SEV-2	Capacity drill
`MessageDeliveryP99Breached`	SEV-2	NATS down
`ReputationLookupP99Breached`	SEV-2	Capacity drill
`ArbitrationP50Breached`	SEV-1	Dispute floods
`PayoutP50Breached`	SEV-1	RPC outage
`PayoutP99Breached`	SEV-1	RPC outage
`ApiErrorRateHigh`	SEV-1	General reliability
`WebhookDeadLetterRateHigh`	SEV-2	NATS down
`NatsConsumerLag`	SEV-2	NATS down
`OutboxDepthHigh`	SEV-2	NATS down
`PostgresReplicationLagHigh`	SEV-2	DB capacity
`DiskFreeLow`	SEV-2	Capacity
`DiskFreeCritical`	SEV-1	Capacity
`CertificateExpiring`	SEV-2	Cert renewal
`CertificateExpiringCritical`	SEV-1	Cert renewal
`BackupStale`	SEV-1	Backup verify

Postmortem template

# Incident YYYY-MM-DD: <one-line summary>

## Impact
Who was affected, for how long, what they saw.

## Timeline
HH:MM UTC — event.
HH:MM UTC — event.

## Root cause
What broke and why.

## Detection
How we found out. Time-to-detect.

## Mitigation
What we did. Time-to-mitigate.

## Resolution
What put it back to fully healthy. Time-to-resolve.

## What went well

## What went badly

## Action items
- [ ] Owner — Action — Due date

Communications

The status page is the source of truth for customers. Templates per state are in docs/operations/status-templates/. Use them; do not write fresh copy under pressure.

For pilot orgs we also send a direct message via the agreed Slack Connect channel for SEV-0 and SEV-1. The status page entry takes precedence on detail; the direct message is a heads-up.

For SEV-0 specifically, the public postmortem is published within 14 days. We name what broke. We do not blame individuals.