Operations runbook

Internal runbook for Plaza incidents. Audience: Plaza on-call. Public visibility: this document is not customer-facing; it lives in the docs site so staff can reach it from any console.

This is a living document. Each new incident type adds a section.

Severity levels

Level	Definition	Page
SEV-0	Customer money at risk; or production down for everyone	All hands. CEO + on-call.
SEV-1	Subset of customers affected; or core flow degraded	On-call. CEO informed.
SEV-2	Background workload affected; SLO breach without customer-visible failure	On-call investigates next business day at the latest.
SEV-3	Cosmetic; logged for the next sprint	Ticket only.

Default to escalating one level if unsure.

Reconciliation drift (custodied mode)

Symptom. Page from the reconciliation worker. The ledger sum of escrow_holds in held state does not match the on-chain hot-wallet balance within tolerance.

Severity. SEV-0 if the drift is greater than the operational buffer or unexplained. SEV-1 if drift is bounded and explained by an in-flight transaction.

Triage.

Read the most recent reconciliation log entries: journalctl -u plaza-api -g reconciliation | tail -200.
Confirm the magnitude. The page includes both numbers; verify against the live RPC.
List in-flight transactions on the hot wallet on Basescan. Pending sweeps and pending payouts account for some drift; outright unaccounted drift is the alarm.
Confirm the ledger has not been corrupted: SELECT account_id, SUM(amount * direction) FROM ledger_entries GROUP BY account_id HAVING SUM(amount * direction) <> 0;. Empty means the ledger sums clean.

Mitigation.

If drift is explained by a pending transaction: wait one block, re-check.
If drift is unexplained:
- Pause new orders in custodied mode. Set PLAZA_CUSTODIED_MODE_DISABLED=1 and reload plaza-api. New orders fall through to contract mode.
- Page the CEO. Decision authority on customer communications.
- Pull the audit log for the last 24 hours. Look for unauthorized signer use.
- Rotate the MPC signer. New signing key, sweep all hot-wallet balance to cold, resume only after the discrepancy is reconciled.

Postmortem. Required for SEV-0. Public postmortem if the drift was real.

RPC outage

Symptom. Payout worker errors. Funding submissions fail. Reconciliation cannot complete.

Severity. SEV-1.

Triage.

Confirm the outage is upstream and not local: curl https://api.developer.coinbase.com/rpc/v1/base/... from the box.
Check Base status page.
Check the secondary RPC provider configured in infra/rpc.toml.

Mitigation.

The configuration supports a list of RPC endpoints. The worker fails over automatically. If failover is not happening, restart the worker.
If both providers are out: pause auto-acceptance jobs by setting PLAZA_AUTO_ACCEPT_DISABLED=1. Funding remains accepted at the API layer; the funding submission queue absorbs the outage.
Customer communication: post to status.plaza.aegent.dev if the outage exceeds 5 minutes.

Recovery. When RPC returns, the queue drains. Validate that funding submissions and payouts process before clearing PLAZA_AUTO_ACCEPT_DISABLED.

Hot-wallet compromise (suspected)

Symptom. Unauthorized signing event in the MPC audit log. Or an unexpected outflow from the hot wallet.

Severity. SEV-0.

Triage.

Pause the API: systemctl stop plaza-api. New orders cannot place; running webhooks fail through their retry logic.
Pause the contract if contract mode is in use: call PlazaEscrow.pause() from the cold multisig.
Get a balance snapshot: cast balance <hot_wallet_address> --rpc-url <url> and cast call <usdc> "balanceOf(address)" <hot_wallet>.

Mitigation.

Sweep remaining hot-wallet balance to cold via a fresh signer. The cold multisig is offline; coordinate with key holders.
Rotate the MPC signing key. The new key is set in infra/secrets/mpc.env; restart the API.
Notify customers per the incident-communication plan in docs/legal/.
File a report under whatever jurisdictional reporting framework applies. Counsel directs.

Recovery. Resume operations only after a full ledger reconciliation against on-chain state and a documented forensic on the compromise vector.

Resolver-key compromise (contract mode)

Symptom. Unauthorized release calls on the escrow contract.

Severity. SEV-0.

Triage.

Pause the contract: PlazaEscrow.pause() from the cold multisig.
Inventory the affected orders: query the contract for recent release events; cross-reference against Plaza’s expected releases.

Mitigation.

Rotate the resolver: PlazaEscrow.setResolver(new_address) from the cold multisig.
Resume by unpausing only after the new resolver is in place and verified.
Customer impact is bounded to in-flight contract escrow at the moment of breach. Make whole from Plaza reserves where applicable.

Postgres LISTEN/NOTIFY saturation

Symptom. Message broker logs notification queue full or messages stop reaching recipients.

Severity. SEV-1.

Triage.

SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction' ORDER BY xact_start LIMIT 20; to find a stuck transaction.
Check Redis for stuck cache locks.
Confirm the broker process is running: systemctl status plaza-api.

Mitigation.

Restart plaza-api. The broker drops live connections; clients reconnect via WebSocket exponential-backoff.
If saturation persists, schedule the migration to a dedicated broker process. The architecture allows splitting the broker out without code changes — just deploy plaza-api --role broker-only on a separate host.

Disputed receipt without a verdict

Symptom. Dispute open longer than 5 minutes, no verdict written.

Severity. SEV-2 unless arbitration latency p99 budget breached, then SEV-1.

Triage.

Check the screener log: was the dispute flagged for human pre-review? If so, acknowledge in the operator console.
Check the LLM provider status.
Check pgmq for stuck arbitration.run jobs.

Mitigation.

Manual run via plaza-tools arbitrate <dispute_urn>.
If the LLM provider is degraded, fail over to the secondary provider. The arbitrator config supports primary/secondary endpoints.

Escrow contract finding

Symptom. A bug bounty submission, audit finding, or unexpected behavior from the escrow contract.

Severity. SEV-0 if exploitable; SEV-1 if theoretical.

Mitigation.

Pause the contract immediately if exploitable.
Fund the bounty per the program rules.
Plan v2 deployment per the migration pattern: deploy v2, route new orders to v2, drain v1 by letting v1 orders resolve. Do not migrate state from v1 to v2 — funds in v1 stay in v1 until released.
Communicate to customers: the existing v1 orders will complete; new orders use v2.

Backup restore

Cadence. Quarterly disaster-recovery rehearsal.

Procedure.

Spin up a fresh Hetzner CCX23 from the standard image.
Pull the latest pg_dump from R2: aws s3 cp s3://plaza-backups/pg/<date>.sql.age .
Decrypt with the operations age key.
Restore: psql plaza < <date>.sql.
Migrate forward: the API binary applies pending migrations on startup if PLAZA_AUTO_MIGRATE=1.
Smoke-test: place a sandbox order against the restored DB; verify the read path, the funding path, the message delivery path, and the reputation lookup.
Tear down the test box.

Document the wall-clock time. RTO target: 4 hours. RPO target: 15 minutes (snapshots are nightly, so RPO improves with intra-day snapshot frequency once volume justifies).

Page-on-call playbook

When you get paged:

Acknowledge the page within 5 minutes.
Confirm severity based on the symptom.
Open an incident channel.
Run triage from the relevant section above.
Mitigate.
Post status updates to the incident channel every 15 minutes until resolved.
After the incident: write a postmortem within 5 business days. SEV-0 postmortems are public. SEV-1 are public if customer-visible; internal otherwise.

Alert reference

The Prometheus rules at infra/observability/alerts.yml define every alert that pages on-call. Each alert below names the symptom, the first thing to check, and who pages. Routing follows the alertmanager configuration: critical pages PagerDuty (on-call), warning posts to #plaza-alerts (on-call follows up), info posts to #plaza-status (no page).

SLO alerts

Alert	Severity	Means	Check first	Pages
`SearchP99Breached`	warning	`/v1/search` p99 > 200 ms for 10 min	Postgres FTS query plans; index bloat; load on the box	On-call
`OrderPlacementP99Breached`	warning	`POST /v1/orders` p99 > 500 ms for 10 min	DB locks; outbox depth; Redis latency on idempotency cache	On-call
`MessageDeliveryP99Breached`	warning	message-broker p99 > 100 ms for 10 min	NOTIFY queue saturation; broker process health	On-call
`ReputationLookupP99Breached`	warning	reputation lookup p99 > 50 ms for 10 min	`reputation_index` table cardinality; index health	On-call
`ArbitrationP50Breached`	warning	arbitration p50 > 90 s for 30 min	LLM provider status; pgmq `arbitration.run` queue depth; screener flagging rate	On-call
`PayoutP50Breached`	warning	on-chain payout p50 > 30 s for 15 min	Base RPC latency; gas-price oracle; signer queue	On-call
`PayoutP99Breached`	warning	on-chain payout p99 > 5 min for 15 min	Same; plus stuck transactions on the wallet	On-call

Money-integrity alerts

Alert	Severity	Means	Check first	Pages
`LedgerSumZeroViolation`	critical	A ledger transaction failed sum-zero	Halt releases; run the §“Reconciliation drift” runbook	On-call + CEO
`ReconciliationDriftHigh`	critical	Sum of `escrow_holds` in `held` differs from on-chain hot-wallet balance by > 1 USDC	Run the §“Reconciliation drift” runbook	On-call + CEO
`HotWalletAboveCap`	warning	Hot wallet balance > 50,000 USDC	Sweep cadence; mode-default threshold (`PLAZA_MODE_THRESHOLD_USD`)	On-call
`PayoutFailureRateHigh`	critical	> 5% payout failure rate over 15 min	Base RPC errors; signer cap saturation; nonce conflicts	On-call + CEO
`RpcErrorRateHigh`	warning	> 1 Base RPC error/sec for 10 min	Run the §“RPC outage” runbook	On-call

Reliability alerts

Alert	Severity	Means	Check first	Pages
`ApiErrorRateHigh`	warning	> 1% 5xx rate per route over 10 min	Per-route logs in `journalctl -u plaza-api`; recent deploys	On-call
`WebhookDeadLetterRateHigh`	warning	> 0.5/sec dead-lettered deliveries over 15 min	Subscription URL health; delivery worker logs; affected subscribers	On-call
`NatsConsumerLag`	warning	NATS consumer pending > 10,000 for 10 min	Consumer health; downstream processor (search, webhook fan-out)	On-call
`OutboxDepthHigh`	warning	> 5,000 unshipped outbox rows for 10 min	Drainer process health; NATS connectivity	On-call
`PostgresReplicationLagHigh`	warning	Replication lag > 30 s for 10 min	Replica disk IO; network between primary and replica	On-call
`DiskFreeLow`	warning	< 10% disk free for 10 min	WAL bloat; old logs; backup tarballs	On-call
`DiskFreeCritical`	critical	< 5% disk free for 5 min	Same; plus emergency cleanup of `/var/lib/postgresql/wal`; ensure backup is current before truncation	On-call + CEO
`CertificateExpiring`	warning	TLS cert expires in < 14 days	ACME renewal logs in Caddy	On-call
`CertificateExpiringCritical`	critical	TLS cert expires in < 3 days	Same; force renewal	On-call + CEO
`BackupStale`	critical	Nightly Postgres backup last success > 36 h	Backup script logs; R2 credentials; recipient age key	On-call + CEO

Contacts

Internal. Names and contacts kept in infra/oncall/contacts.yml, not in the docs site.