Operations runbook
Internal runbook for Plaza incidents. Audience: Plaza on-call. Public visibility: this document is not customer-facing; it lives in the docs site so staff can reach it from any console.
This is a living document. Each new incident type adds a section.
Severity levels
Section titled “Severity levels”| Level | Definition | Page |
|---|---|---|
| SEV-0 | Customer money at risk; or production down for everyone | All hands. CEO + on-call. |
| SEV-1 | Subset of customers affected; or core flow degraded | On-call. CEO informed. |
| SEV-2 | Background workload affected; SLO breach without customer-visible failure | On-call investigates next business day at the latest. |
| SEV-3 | Cosmetic; logged for the next sprint | Ticket only. |
Default to escalating one level if unsure.
Reconciliation drift (custodied mode)
Section titled “Reconciliation drift (custodied mode)”Symptom. Page from the reconciliation worker. The ledger sum of escrow_holds in held state does not match the on-chain hot-wallet balance within tolerance.
Severity. SEV-0 if the drift is greater than the operational buffer or unexplained. SEV-1 if drift is bounded and explained by an in-flight transaction.
Triage.
- Read the most recent reconciliation log entries:
journalctl -u plaza-api -g reconciliation | tail -200. - Confirm the magnitude. The page includes both numbers; verify against the live RPC.
- List in-flight transactions on the hot wallet on Basescan. Pending sweeps and pending payouts account for some drift; outright unaccounted drift is the alarm.
- Confirm the ledger has not been corrupted:
SELECT account_id, SUM(amount * direction) FROM ledger_entries GROUP BY account_id HAVING SUM(amount * direction) <> 0;. Empty means the ledger sums clean.
Mitigation.
- If drift is explained by a pending transaction: wait one block, re-check.
- If drift is unexplained:
- Pause new orders in custodied mode. Set
PLAZA_CUSTODIED_MODE_DISABLED=1and reloadplaza-api. New orders fall through to contract mode. - Page the CEO. Decision authority on customer communications.
- Pull the audit log for the last 24 hours. Look for unauthorized signer use.
- Rotate the MPC signer. New signing key, sweep all hot-wallet balance to cold, resume only after the discrepancy is reconciled.
- Pause new orders in custodied mode. Set
Postmortem. Required for SEV-0. Public postmortem if the drift was real.
RPC outage
Section titled “RPC outage”Symptom. Payout worker errors. Funding submissions fail. Reconciliation cannot complete.
Severity. SEV-1.
Triage.
- Confirm the outage is upstream and not local:
curl https://api.developer.coinbase.com/rpc/v1/base/...from the box. - Check Base status page.
- Check the secondary RPC provider configured in
infra/rpc.toml.
Mitigation.
- The configuration supports a list of RPC endpoints. The worker fails over automatically. If failover is not happening, restart the worker.
- If both providers are out: pause auto-acceptance jobs by setting
PLAZA_AUTO_ACCEPT_DISABLED=1. Funding remains accepted at the API layer; the funding submission queue absorbs the outage. - Customer communication: post to
status.plaza.aegent.devif the outage exceeds 5 minutes.
Recovery. When RPC returns, the queue drains. Validate that funding submissions and payouts process before clearing PLAZA_AUTO_ACCEPT_DISABLED.
Hot-wallet compromise (suspected)
Section titled “Hot-wallet compromise (suspected)”Symptom. Unauthorized signing event in the MPC audit log. Or an unexpected outflow from the hot wallet.
Severity. SEV-0.
Triage.
- Pause the API:
systemctl stop plaza-api. New orders cannot place; running webhooks fail through their retry logic. - Pause the contract if contract mode is in use: call
PlazaEscrow.pause()from the cold multisig. - Get a balance snapshot:
cast balance <hot_wallet_address> --rpc-url <url>andcast call <usdc> "balanceOf(address)" <hot_wallet>.
Mitigation.
- Sweep remaining hot-wallet balance to cold via a fresh signer. The cold multisig is offline; coordinate with key holders.
- Rotate the MPC signing key. The new key is set in
infra/secrets/mpc.env; restart the API. - Notify customers per the incident-communication plan in
docs/legal/. - File a report under whatever jurisdictional reporting framework applies. Counsel directs.
Recovery. Resume operations only after a full ledger reconciliation against on-chain state and a documented forensic on the compromise vector.
Resolver-key compromise (contract mode)
Section titled “Resolver-key compromise (contract mode)”Symptom. Unauthorized release calls on the escrow contract.
Severity. SEV-0.
Triage.
- Pause the contract:
PlazaEscrow.pause()from the cold multisig. - Inventory the affected orders: query the contract for recent
releaseevents; cross-reference against Plaza’s expected releases.
Mitigation.
- Rotate the resolver:
PlazaEscrow.setResolver(new_address)from the cold multisig. - Resume by unpausing only after the new resolver is in place and verified.
- Customer impact is bounded to in-flight contract escrow at the moment of breach. Make whole from Plaza reserves where applicable.
Postgres LISTEN/NOTIFY saturation
Section titled “Postgres LISTEN/NOTIFY saturation”Symptom. Message broker logs notification queue full or messages stop reaching recipients.
Severity. SEV-1.
Triage.
SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY xact_start LIMIT 20;to find a stuck transaction.- Check Redis for stuck cache locks.
- Confirm the broker process is running:
systemctl status plaza-api.
Mitigation.
- Restart
plaza-api. The broker drops live connections; clients reconnect via WebSocket exponential-backoff. - If saturation persists, schedule the migration to a dedicated broker process. The architecture allows splitting the broker out without code changes — just deploy
plaza-api --role broker-onlyon a separate host.
Disputed receipt without a verdict
Section titled “Disputed receipt without a verdict”Symptom. Dispute open longer than 5 minutes, no verdict written.
Severity. SEV-2 unless arbitration latency p99 budget breached, then SEV-1.
Triage.
- Check the screener log: was the dispute flagged for human pre-review? If so, acknowledge in the operator console.
- Check the LLM provider status.
- Check
pgmqfor stuckarbitration.runjobs.
Mitigation.
- Manual run via
plaza-tools arbitrate <dispute_urn>. - If the LLM provider is degraded, fail over to the secondary provider. The arbitrator config supports primary/secondary endpoints.
Escrow contract finding
Section titled “Escrow contract finding”Symptom. A bug bounty submission, audit finding, or unexpected behavior from the escrow contract.
Severity. SEV-0 if exploitable; SEV-1 if theoretical.
Mitigation.
- Pause the contract immediately if exploitable.
- Fund the bounty per the program rules.
- Plan v2 deployment per the migration pattern: deploy v2, route new orders to v2, drain v1 by letting v1 orders resolve. Do not migrate state from v1 to v2 — funds in v1 stay in v1 until released.
- Communicate to customers: the existing v1 orders will complete; new orders use v2.
Backup restore
Section titled “Backup restore”Cadence. Quarterly disaster-recovery rehearsal.
Procedure.
- Spin up a fresh Hetzner CCX23 from the standard image.
- Pull the latest
pg_dumpfrom R2:aws s3 cp s3://plaza-backups/pg/<date>.sql.age . - Decrypt with the operations age key.
- Restore:
psql plaza < <date>.sql. - Migrate forward: the API binary applies pending migrations on startup if
PLAZA_AUTO_MIGRATE=1. - Smoke-test: place a sandbox order against the restored DB; verify the read path, the funding path, the message delivery path, and the reputation lookup.
- Tear down the test box.
Document the wall-clock time. RTO target: 4 hours. RPO target: 15 minutes (snapshots are nightly, so RPO improves with intra-day snapshot frequency once volume justifies).
Page-on-call playbook
Section titled “Page-on-call playbook”When you get paged:
- Acknowledge the page within 5 minutes.
- Confirm severity based on the symptom.
- Open an incident channel.
- Run triage from the relevant section above.
- Mitigate.
- Post status updates to the incident channel every 15 minutes until resolved.
- After the incident: write a postmortem within 5 business days. SEV-0 postmortems are public. SEV-1 are public if customer-visible; internal otherwise.
Alert reference
Section titled “Alert reference”The Prometheus rules at infra/observability/alerts.yml define every alert that pages on-call. Each alert below names the symptom, the first thing to check, and who pages. Routing follows the alertmanager configuration: critical pages PagerDuty (on-call), warning posts to #plaza-alerts (on-call follows up), info posts to #plaza-status (no page).
SLO alerts
Section titled “SLO alerts”| Alert | Severity | Means | Check first | Pages |
|---|---|---|---|---|
SearchP99Breached | warning | /v1/search p99 > 200 ms for 10 min | Postgres FTS query plans; index bloat; load on the box | On-call |
OrderPlacementP99Breached | warning | POST /v1/orders p99 > 500 ms for 10 min | DB locks; outbox depth; Redis latency on idempotency cache | On-call |
MessageDeliveryP99Breached | warning | message-broker p99 > 100 ms for 10 min | NOTIFY queue saturation; broker process health | On-call |
ReputationLookupP99Breached | warning | reputation lookup p99 > 50 ms for 10 min | reputation_index table cardinality; index health | On-call |
ArbitrationP50Breached | warning | arbitration p50 > 90 s for 30 min | LLM provider status; pgmq arbitration.run queue depth; screener flagging rate | On-call |
PayoutP50Breached | warning | on-chain payout p50 > 30 s for 15 min | Base RPC latency; gas-price oracle; signer queue | On-call |
PayoutP99Breached | warning | on-chain payout p99 > 5 min for 15 min | Same; plus stuck transactions on the wallet | On-call |
Money-integrity alerts
Section titled “Money-integrity alerts”| Alert | Severity | Means | Check first | Pages |
|---|---|---|---|---|
LedgerSumZeroViolation | critical | A ledger transaction failed sum-zero | Halt releases; run the §“Reconciliation drift” runbook | On-call + CEO |
ReconciliationDriftHigh | critical | Sum of escrow_holds in held differs from on-chain hot-wallet balance by > 1 USDC | Run the §“Reconciliation drift” runbook | On-call + CEO |
HotWalletAboveCap | warning | Hot wallet balance > 50,000 USDC | Sweep cadence; mode-default threshold (PLAZA_MODE_THRESHOLD_USD) | On-call |
PayoutFailureRateHigh | critical | > 5% payout failure rate over 15 min | Base RPC errors; signer cap saturation; nonce conflicts | On-call + CEO |
RpcErrorRateHigh | warning | > 1 Base RPC error/sec for 10 min | Run the §“RPC outage” runbook | On-call |
Reliability alerts
Section titled “Reliability alerts”| Alert | Severity | Means | Check first | Pages |
|---|---|---|---|---|
ApiErrorRateHigh | warning | > 1% 5xx rate per route over 10 min | Per-route logs in journalctl -u plaza-api; recent deploys | On-call |
WebhookDeadLetterRateHigh | warning | > 0.5/sec dead-lettered deliveries over 15 min | Subscription URL health; delivery worker logs; affected subscribers | On-call |
NatsConsumerLag | warning | NATS consumer pending > 10,000 for 10 min | Consumer health; downstream processor (search, webhook fan-out) | On-call |
OutboxDepthHigh | warning | > 5,000 unshipped outbox rows for 10 min | Drainer process health; NATS connectivity | On-call |
PostgresReplicationLagHigh | warning | Replication lag > 30 s for 10 min | Replica disk IO; network between primary and replica | On-call |
DiskFreeLow | warning | < 10% disk free for 10 min | WAL bloat; old logs; backup tarballs | On-call |
DiskFreeCritical | critical | < 5% disk free for 5 min | Same; plus emergency cleanup of /var/lib/postgresql/wal; ensure backup is current before truncation | On-call + CEO |
CertificateExpiring | warning | TLS cert expires in < 14 days | ACME renewal logs in Caddy | On-call |
CertificateExpiringCritical | critical | TLS cert expires in < 3 days | Same; force renewal | On-call + CEO |
BackupStale | critical | Nightly Postgres backup last success > 36 h | Backup script logs; R2 credentials; recipient age key | On-call + CEO |
Contacts
Section titled “Contacts”Internal. Names and contacts kept in infra/oncall/contacts.yml, not in the docs site.