On-call

For Plaza on-call engineers. Read once before your first shift; read again on the first day of each shift.

Scope

The on-call engineer is responsible for:

Acknowledging pages within the per-severity SLA in incident-response.md.
Triaging incidents, declaring severity, and either resolving or escalating.
Updating the public status page during an incident.
Writing the postmortem (or assigning it before signing off).
Watching the dashboards listed below at least twice during the shift.

The on-call engineer is not responsible for:

Routine feature work. Defer it to non-on-call days.
Customer support tickets that are not incidents. Those route to pilot-support@plaza.aegent.dev and are handled by the on-duty support engineer.
Approving deploys outside the on-call shift’s deploy window.
Decisions that require the CEO. Page the CEO on SEV-0 or any decision touching customer money beyond an SLO credit.

Rotation

Two-engineer rotation at launch. Primary and secondary. Weekly handoff at 10:00 UTC every Monday. Handoff is a 15-minute synchronous call; the rotation calendar is in PagerDuty.

Compensation: a per-shift on-call stipend; full pay for any work performed during a page; pay-it-back time off after a SEV-0.

If the primary cannot reach a page within 5 minutes, the secondary is paged. If the secondary cannot reach within 5 minutes, the CEO is paged. This is the escalation path.

Handoff template

Used at the Monday 10:00 UTC handoff call. The outgoing engineer fills it in; the incoming engineer reads it back.

# Plaza on-call handoff — week of YYYY-MM-DD

## Open incidents
- None / list them with severity, status, and owner

## Recently resolved (last 7 days)
- INC-... — one line summary — postmortem status

## Active changes
- Deploys planned this week and their owners.
- Any flag changes still in progress.

## Watchpoints
- Anything that has been quiet but you want the next person to keep an eye on.
- Recent alert tuning that has not yet had a full week's worth of signal.

## Vacations / known-out
- Anyone the on-call cannot escalate to this week.

## Anything else

The previous handoff is logged in infra/oncall/handoffs/ (private repo).

First day on call

Run through this list at the start of every shift. Most of it takes minutes; do not skip it.

Dashboards to watch

Production. Bookmarked in the team Grafana folder Plaza / On-call.

Money integrity. Reconciliation drift, ledger sum-zero violations, hot-wallet balance, payout success rate, payout latency. The single most important panel; check at least once per shift.
API health. Request rate, error rate, p50/p99 latency by route. SLO compliance bands rendered.
Webhook delivery. Delivery success rate, dead-letter rate, outbox depth, retry distribution.
Infrastructure. Postgres replication lag, NATS consumer lag, disk free, certificate expiry, backup recency.

A fifth panel — Pilot org watch — exists during the pilot period only. It surfaces the pilot orgs’ transaction success and any error spikes correlated to their account URNs.

Tools you will need

PagerDuty. For ack and escalation.
Grafana. For dashboards.
Slack. For incident channels and status updates.
Status page admin. Login via the shared 1Password vault.
Production SSH access. Via the bastion. Your key is provisioned at hire.
plaza CLI. With an on-call-scoped token that can read but not write production.
The runbook. docs/operations/runbook.md, indexed by alert name.
The incident response framework. docs/operations/incident-response.md.

When you are not sure

Three rules.

If money is at risk, escalate. Page the CEO. The cost of escalating is one short call. The cost of not escalating is an unrecoverable customer.
If you are tired, hand off. A second engineer at hour two is more useful than the first engineer at hour eight. Plaza’s on-call is paid; use it to share load.
Communicate often. A short status page update every 30 minutes — even saying “still investigating” — is what customers expect. Silence is worse than no progress.

What good looks like

A SEV-1 ack within 10 minutes, a status page entry within 30, a clear timeline in the incident channel, a resolution within an hour, and a postmortem in the repo within two weeks. Nobody got woken up unnecessarily. The customer-facing comms named what happened and what we did.

What bad looks like

A SEV-0 with a delayed CEO page. An incident channel with no timeline. A status page that says “all systems operational” while customers are stuck. A postmortem that blames an individual. Avoid all of these.