Skip to content

Incident response

How Plaza responds to incidents. Customer-impact-facing. Written for the on-call engineer at 03:00.

The runbook at runbook.md is the per-incident playbook. This document is the framework — severity ladder, response shape, drill cadence — that the runbook fills in.

LevelDefinitionResponse
SEV-0Customer money at risk; or all production downPage everyone. Public status page within 15 minutes. CEO on call.
SEV-1Subset of customers affected; core flow degradedPage on-call. Public status page within 30 minutes. CEO informed.
SEV-2SLO breach without customer-visible failureOn-call investigates next business hour. Internal status note.
SEV-3Cosmetic; minorTicket. No page.

When in doubt, escalate one level. Plaza holds money; the cost of over-paging is one tired engineer, the cost of under-paging is a stuck customer.

  • On-call engineer acknowledges the page within 5 minutes.
  • Engineer creates an incident channel (#inc-YYYYMMDD-keyword) and starts a timeline.
  • CEO is paged. Communications lead is paged.
  • Status page entry posted within 15 minutes (template: status-templates/investigating.md).
  • All other engineering work pauses. Incident channel is single source of truth.
  • Updates to the status page every 30 minutes until resolution, or sooner on state changes.
  • Postmortem is mandatory and public within 14 days.
  • On-call acknowledges within 10 minutes.
  • Incident channel opened. Timeline started.
  • Status page entry within 30 minutes.
  • Updates to the status page every 60 minutes until resolution.
  • Postmortem mandatory; internal-only or public depending on customer impact.
  • On-call investigates within the next business hour.
  • Internal status note in #plaza-status. No public status page entry unless escalated.
  • Postmortem optional; required if recurrence is likely.
  • Filed as a ticket. No real-time response.

Plaza runs each of these as a tabletop drill at least quarterly. After mainnet launch, two of them per quarter run as a live drill in staging.

Scenario. The custodied hot wallet shows a large unauthorized outbound transfer.

Alert. HotWalletAboveCap fires inverted (balance dropped to zero); ReconciliationDriftHigh follows. LedgerSumZeroViolation may also fire.

First moves.

  1. Pause custodied mode: PLAZA_CUSTODIED_MODE_DISABLED=1, reload plaza-api.
  2. Page CEO. SEV-0.
  3. Rotate MPC signer keys.
  4. Pull all signer audit logs for the last 7 days.
  5. Notify pilot orgs via direct channel. Public status page entry.
  6. Sweep cold-wallet to a fresh address in the next signing ceremony.

Recovery posture. New orders run in contract mode only until the post-incident audit is complete and the new signer is in place.

Scenario. Turnkey or Privy is unavailable. New escrow funding cannot be signed.

Alert. Custom mpc_signer_errors_total metric exceeds threshold; payout worker latency climbs.

First moves.

  1. Confirm the outage upstream — check the provider’s status page.
  2. SEV-1 unless extended; SEV-0 if the outage exceeds 60 minutes during business hours.
  3. Failover to the secondary MPC provider (the architecture supports two providers; the runtime flag is PLAZA_MPC_PROVIDER).
  4. Public status page entry — pending orders in custodied mode are queued, not lost.

Recovery posture. Resume primary when the upstream confirms recovery and a 10-minute health probe is green.

Scenario. The Base RPC provider is unavailable. Payouts stall, on-chain confirmations stall, reconciliation cannot complete.

Alert. RpcErrorRateHigh, PayoutFailureRateHigh, ReconciliationDriftHigh (drift becomes unverifiable, not necessarily real).

First moves.

  1. Failover to the secondary RPC. The runtime supports two; flag is PLAZA_RPC_URL_2.
  2. SEV-1.
  3. Pause sweep cadence; in-flight sweeps will retry.
  4. Customer impact is delayed payout. Public status page entry advises pilots that payouts are queued.

Scenario. The message broker is unavailable. Webhook delivery, realtime events, and inter-service messaging all stall.

Alert. NatsConsumerLag; healthchecks on the broker fail.

First moves.

  1. Restart the broker. NATS is single-node at launch (this is a known sequencing tradeoff in PLAN.md).
  2. SEV-1. Webhook deliveries are buffered in the outbox; nothing is lost, only delayed.
  3. Once recovered, the outbox drainer catches up; watch OutboxDepthHigh clear.

Recovery posture. Multi-node NATS clustering is on the post-launch backlog. Today, the outbox guarantees no lost events.

Scenario. Postgres reports an inconsistency, a bad write, or a failed integrity check.

Alert. LedgerSumZeroViolation on every write; BackupStale may also fire.

First moves.

  1. SEV-0. Halt all writes — set PLAZA_READ_ONLY=1, reload plaza-api.
  2. Page CEO and database lead.
  3. Verify the most recent backup integrity (infra/backup/verify.sh).
  4. Decide: in-place repair vs. restore from backup. Restore is the safer default.
  5. Restore loses minutes of data; the outbox preserves webhook deliveries; the ledger preserves money state.

Recovery posture. PITR (point-in-time recovery) is configured against R2-backed WAL archives. Maximum data loss objective is 5 minutes.

Scenario. A spike in disputes overwhelms the arbitrator pipeline. Could be coordinated abuse, could be a regression.

Alert. ArbitrationP50Breached; dispute open rate metric exceeds rolling baseline.

First moves.

  1. SEV-1.
  2. Confirm the source. Single seller? Single buyer? Single category? Operator console has the breakdown.
  3. If coordinated abuse: rate-limit the disputing accounts, hold their disputes for human review.
  4. If regression: roll back the most recent arbitrator deploy.
  5. Public status page entry only if customer-visible. Most dispute floods are internal-only.

Scenario. An adversarial buyer or seller exploits the arbitrator’s prompt to flip a verdict.

Alert. Anomaly detection on verdict distributions; manual report from a counterparty.

First moves.

  1. SEV-1.
  2. Pause auto-resolution on the affected category. New disputes go to human review.
  3. Pull the prompt + completions for the suspect verdicts. Confirm the injection.
  4. Patch the arbitrator prompt; deploy with a verdict-replay test that includes the injection vector.
  5. Reverse affected verdicts via the appeal pipeline; refund manually if needed.

Recovery posture. A verdict-replay corpus is maintained in crates/plaza-arbitrator/tests/corpus/. New injection vectors are added to the corpus and to the regression suite.

The full alert set lives in infra/observability/alerts.yml. The map below ties severity to alert name; both are stable identifiers.

AlertDefault severityDrill
LedgerSumZeroViolationSEV-0Hot wallet drain, DB corruption
ReconciliationDriftHighSEV-0Hot wallet drain, RPC outage
HotWalletAboveCapSEV-1Sweep cadence audit
PayoutFailureRateHighSEV-1RPC outage
RpcErrorRateHighSEV-1RPC outage
SearchP99BreachedSEV-2Capacity drill
OrderPlacementP99BreachedSEV-2Capacity drill
MessageDeliveryP99BreachedSEV-2NATS down
ReputationLookupP99BreachedSEV-2Capacity drill
ArbitrationP50BreachedSEV-1Dispute floods
PayoutP50BreachedSEV-1RPC outage
PayoutP99BreachedSEV-1RPC outage
ApiErrorRateHighSEV-1General reliability
WebhookDeadLetterRateHighSEV-2NATS down
NatsConsumerLagSEV-2NATS down
OutboxDepthHighSEV-2NATS down
PostgresReplicationLagHighSEV-2DB capacity
DiskFreeLowSEV-2Capacity
DiskFreeCriticalSEV-1Capacity
CertificateExpiringSEV-2Cert renewal
CertificateExpiringCriticalSEV-1Cert renewal
BackupStaleSEV-1Backup verify
# Incident YYYY-MM-DD: <one-line summary>
## Impact
Who was affected, for how long, what they saw.
## Timeline
HH:MM UTC — event.
HH:MM UTC — event.
## Root cause
What broke and why.
## Detection
How we found out. Time-to-detect.
## Mitigation
What we did. Time-to-mitigate.
## Resolution
What put it back to fully healthy. Time-to-resolve.
## What went well
## What went badly
## Action items
- [ ] Owner — Action — Due date

The status page is the source of truth for customers. Templates per state are in docs/operations/status-templates/. Use them; do not write fresh copy under pressure.

For pilot orgs we also send a direct message via the agreed Slack Connect channel for SEV-0 and SEV-1. The status page entry takes precedence on detail; the direct message is a heads-up.

For SEV-0 specifically, the public postmortem is published within 14 days. We name what broke. We do not blame individuals.