Skip to content

Operations runbook

Internal runbook for Plaza incidents. Audience: Plaza on-call. Public visibility: this document is not customer-facing; it lives in the docs site so staff can reach it from any console.

This is a living document. Each new incident type adds a section.

LevelDefinitionPage
SEV-0Customer money at risk; or production down for everyoneAll hands. CEO + on-call.
SEV-1Subset of customers affected; or core flow degradedOn-call. CEO informed.
SEV-2Background workload affected; SLO breach without customer-visible failureOn-call investigates next business day at the latest.
SEV-3Cosmetic; logged for the next sprintTicket only.

Default to escalating one level if unsure.

Symptom. Page from the reconciliation worker. The ledger sum of escrow_holds in held state does not match the on-chain hot-wallet balance within tolerance.

Severity. SEV-0 if the drift is greater than the operational buffer or unexplained. SEV-1 if drift is bounded and explained by an in-flight transaction.

Triage.

  1. Read the most recent reconciliation log entries: journalctl -u plaza-api -g reconciliation | tail -200.
  2. Confirm the magnitude. The page includes both numbers; verify against the live RPC.
  3. List in-flight transactions on the hot wallet on Basescan. Pending sweeps and pending payouts account for some drift; outright unaccounted drift is the alarm.
  4. Confirm the ledger has not been corrupted: SELECT account_id, SUM(amount * direction) FROM ledger_entries GROUP BY account_id HAVING SUM(amount * direction) <> 0;. Empty means the ledger sums clean.

Mitigation.

  • If drift is explained by a pending transaction: wait one block, re-check.
  • If drift is unexplained:
    • Pause new orders in custodied mode. Set PLAZA_CUSTODIED_MODE_DISABLED=1 and reload plaza-api. New orders fall through to contract mode.
    • Page the CEO. Decision authority on customer communications.
    • Pull the audit log for the last 24 hours. Look for unauthorized signer use.
    • Rotate the MPC signer. New signing key, sweep all hot-wallet balance to cold, resume only after the discrepancy is reconciled.

Postmortem. Required for SEV-0. Public postmortem if the drift was real.

Symptom. Payout worker errors. Funding submissions fail. Reconciliation cannot complete.

Severity. SEV-1.

Triage.

  1. Confirm the outage is upstream and not local: curl https://api.developer.coinbase.com/rpc/v1/base/... from the box.
  2. Check Base status page.
  3. Check the secondary RPC provider configured in infra/rpc.toml.

Mitigation.

  • The configuration supports a list of RPC endpoints. The worker fails over automatically. If failover is not happening, restart the worker.
  • If both providers are out: pause auto-acceptance jobs by setting PLAZA_AUTO_ACCEPT_DISABLED=1. Funding remains accepted at the API layer; the funding submission queue absorbs the outage.
  • Customer communication: post to status.plaza.aegent.dev if the outage exceeds 5 minutes.

Recovery. When RPC returns, the queue drains. Validate that funding submissions and payouts process before clearing PLAZA_AUTO_ACCEPT_DISABLED.

Symptom. Unauthorized signing event in the MPC audit log. Or an unexpected outflow from the hot wallet.

Severity. SEV-0.

Triage.

  1. Pause the API: systemctl stop plaza-api. New orders cannot place; running webhooks fail through their retry logic.
  2. Pause the contract if contract mode is in use: call PlazaEscrow.pause() from the cold multisig.
  3. Get a balance snapshot: cast balance <hot_wallet_address> --rpc-url <url> and cast call <usdc> "balanceOf(address)" <hot_wallet>.

Mitigation.

  • Sweep remaining hot-wallet balance to cold via a fresh signer. The cold multisig is offline; coordinate with key holders.
  • Rotate the MPC signing key. The new key is set in infra/secrets/mpc.env; restart the API.
  • Notify customers per the incident-communication plan in docs/legal/.
  • File a report under whatever jurisdictional reporting framework applies. Counsel directs.

Recovery. Resume operations only after a full ledger reconciliation against on-chain state and a documented forensic on the compromise vector.

Symptom. Unauthorized release calls on the escrow contract.

Severity. SEV-0.

Triage.

  1. Pause the contract: PlazaEscrow.pause() from the cold multisig.
  2. Inventory the affected orders: query the contract for recent release events; cross-reference against Plaza’s expected releases.

Mitigation.

  • Rotate the resolver: PlazaEscrow.setResolver(new_address) from the cold multisig.
  • Resume by unpausing only after the new resolver is in place and verified.
  • Customer impact is bounded to in-flight contract escrow at the moment of breach. Make whole from Plaza reserves where applicable.

Symptom. Message broker logs notification queue full or messages stop reaching recipients.

Severity. SEV-1.

Triage.

  1. SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY xact_start LIMIT 20; to find a stuck transaction.
  2. Check Redis for stuck cache locks.
  3. Confirm the broker process is running: systemctl status plaza-api.

Mitigation.

  • Restart plaza-api. The broker drops live connections; clients reconnect via WebSocket exponential-backoff.
  • If saturation persists, schedule the migration to a dedicated broker process. The architecture allows splitting the broker out without code changes — just deploy plaza-api --role broker-only on a separate host.

Symptom. Dispute open longer than 5 minutes, no verdict written.

Severity. SEV-2 unless arbitration latency p99 budget breached, then SEV-1.

Triage.

  1. Check the screener log: was the dispute flagged for human pre-review? If so, acknowledge in the operator console.
  2. Check the LLM provider status.
  3. Check pgmq for stuck arbitration.run jobs.

Mitigation.

  • Manual run via plaza-tools arbitrate <dispute_urn>.
  • If the LLM provider is degraded, fail over to the secondary provider. The arbitrator config supports primary/secondary endpoints.

Symptom. A bug bounty submission, audit finding, or unexpected behavior from the escrow contract.

Severity. SEV-0 if exploitable; SEV-1 if theoretical.

Mitigation.

  • Pause the contract immediately if exploitable.
  • Fund the bounty per the program rules.
  • Plan v2 deployment per the migration pattern: deploy v2, route new orders to v2, drain v1 by letting v1 orders resolve. Do not migrate state from v1 to v2 — funds in v1 stay in v1 until released.
  • Communicate to customers: the existing v1 orders will complete; new orders use v2.

Cadence. Quarterly disaster-recovery rehearsal.

Procedure.

  1. Spin up a fresh Hetzner CCX23 from the standard image.
  2. Pull the latest pg_dump from R2: aws s3 cp s3://plaza-backups/pg/<date>.sql.age .
  3. Decrypt with the operations age key.
  4. Restore: psql plaza < <date>.sql.
  5. Migrate forward: the API binary applies pending migrations on startup if PLAZA_AUTO_MIGRATE=1.
  6. Smoke-test: place a sandbox order against the restored DB; verify the read path, the funding path, the message delivery path, and the reputation lookup.
  7. Tear down the test box.

Document the wall-clock time. RTO target: 4 hours. RPO target: 15 minutes (snapshots are nightly, so RPO improves with intra-day snapshot frequency once volume justifies).

When you get paged:

  1. Acknowledge the page within 5 minutes.
  2. Confirm severity based on the symptom.
  3. Open an incident channel.
  4. Run triage from the relevant section above.
  5. Mitigate.
  6. Post status updates to the incident channel every 15 minutes until resolved.
  7. After the incident: write a postmortem within 5 business days. SEV-0 postmortems are public. SEV-1 are public if customer-visible; internal otherwise.

The Prometheus rules at infra/observability/alerts.yml define every alert that pages on-call. Each alert below names the symptom, the first thing to check, and who pages. Routing follows the alertmanager configuration: critical pages PagerDuty (on-call), warning posts to #plaza-alerts (on-call follows up), info posts to #plaza-status (no page).

AlertSeverityMeansCheck firstPages
SearchP99Breachedwarning/v1/search p99 > 200 ms for 10 minPostgres FTS query plans; index bloat; load on the boxOn-call
OrderPlacementP99BreachedwarningPOST /v1/orders p99 > 500 ms for 10 minDB locks; outbox depth; Redis latency on idempotency cacheOn-call
MessageDeliveryP99Breachedwarningmessage-broker p99 > 100 ms for 10 minNOTIFY queue saturation; broker process healthOn-call
ReputationLookupP99Breachedwarningreputation lookup p99 > 50 ms for 10 minreputation_index table cardinality; index healthOn-call
ArbitrationP50Breachedwarningarbitration p50 > 90 s for 30 minLLM provider status; pgmq arbitration.run queue depth; screener flagging rateOn-call
PayoutP50Breachedwarningon-chain payout p50 > 30 s for 15 minBase RPC latency; gas-price oracle; signer queueOn-call
PayoutP99Breachedwarningon-chain payout p99 > 5 min for 15 minSame; plus stuck transactions on the walletOn-call
AlertSeverityMeansCheck firstPages
LedgerSumZeroViolationcriticalA ledger transaction failed sum-zeroHalt releases; run the §“Reconciliation drift” runbookOn-call + CEO
ReconciliationDriftHighcriticalSum of escrow_holds in held differs from on-chain hot-wallet balance by > 1 USDCRun the §“Reconciliation drift” runbookOn-call + CEO
HotWalletAboveCapwarningHot wallet balance > 50,000 USDCSweep cadence; mode-default threshold (PLAZA_MODE_THRESHOLD_USD)On-call
PayoutFailureRateHighcritical> 5% payout failure rate over 15 minBase RPC errors; signer cap saturation; nonce conflictsOn-call + CEO
RpcErrorRateHighwarning> 1 Base RPC error/sec for 10 minRun the §“RPC outage” runbookOn-call
AlertSeverityMeansCheck firstPages
ApiErrorRateHighwarning> 1% 5xx rate per route over 10 minPer-route logs in journalctl -u plaza-api; recent deploysOn-call
WebhookDeadLetterRateHighwarning> 0.5/sec dead-lettered deliveries over 15 minSubscription URL health; delivery worker logs; affected subscribersOn-call
NatsConsumerLagwarningNATS consumer pending > 10,000 for 10 minConsumer health; downstream processor (search, webhook fan-out)On-call
OutboxDepthHighwarning> 5,000 unshipped outbox rows for 10 minDrainer process health; NATS connectivityOn-call
PostgresReplicationLagHighwarningReplication lag > 30 s for 10 minReplica disk IO; network between primary and replicaOn-call
DiskFreeLowwarning< 10% disk free for 10 minWAL bloat; old logs; backup tarballsOn-call
DiskFreeCriticalcritical< 5% disk free for 5 minSame; plus emergency cleanup of /var/lib/postgresql/wal; ensure backup is current before truncationOn-call + CEO
CertificateExpiringwarningTLS cert expires in < 14 daysACME renewal logs in CaddyOn-call
CertificateExpiringCriticalcriticalTLS cert expires in < 3 daysSame; force renewalOn-call + CEO
BackupStalecriticalNightly Postgres backup last success > 36 hBackup script logs; R2 credentials; recipient age keyOn-call + CEO

Internal. Names and contacts kept in infra/oncall/contacts.yml, not in the docs site.