On-call
For Plaza on-call engineers. Read once before your first shift; read again on the first day of each shift.
The on-call engineer is responsible for:
- Acknowledging pages within the per-severity SLA in
incident-response.md. - Triaging incidents, declaring severity, and either resolving or escalating.
- Updating the public status page during an incident.
- Writing the postmortem (or assigning it before signing off).
- Watching the dashboards listed below at least twice during the shift.
The on-call engineer is not responsible for:
- Routine feature work. Defer it to non-on-call days.
- Customer support tickets that are not incidents. Those route to
pilot-support@plaza.aegent.devand are handled by the on-duty support engineer. - Approving deploys outside the on-call shift’s deploy window.
- Decisions that require the CEO. Page the CEO on SEV-0 or any decision touching customer money beyond an SLO credit.
Rotation
Section titled “Rotation”Two-engineer rotation at launch. Primary and secondary. Weekly handoff at 10:00 UTC every Monday. Handoff is a 15-minute synchronous call; the rotation calendar is in PagerDuty.
Compensation: a per-shift on-call stipend; full pay for any work performed during a page; pay-it-back time off after a SEV-0.
If the primary cannot reach a page within 5 minutes, the secondary is paged. If the secondary cannot reach within 5 minutes, the CEO is paged. This is the escalation path.
Handoff template
Section titled “Handoff template”Used at the Monday 10:00 UTC handoff call. The outgoing engineer fills it in; the incoming engineer reads it back.
# Plaza on-call handoff — week of YYYY-MM-DD
## Open incidents- None / list them with severity, status, and owner
## Recently resolved (last 7 days)- INC-... — one line summary — postmortem status
## Active changes- Deploys planned this week and their owners.- Any flag changes still in progress.
## Watchpoints- Anything that has been quiet but you want the next person to keep an eye on.- Recent alert tuning that has not yet had a full week's worth of signal.
## Vacations / known-out- Anyone the on-call cannot escalate to this week.
## Anything elseThe previous handoff is logged in infra/oncall/handoffs/ (private repo).
First day on call
Section titled “First day on call”Run through this list at the start of every shift. Most of it takes minutes; do not skip it.
- Acknowledge the rotation in PagerDuty. Confirm your phone number and Slack are correct.
- Read the previous handoff document.
- Open the four dashboards listed below in pinned tabs.
- Confirm
kubectl(orssh plaza-prod) credentials still work. - Confirm
plazaCLI works against production with your on-call token. - Confirm you can post to the status page (test in staging first).
- Confirm you can page the CEO via the documented number, not Slack.
- Verify the secondary on-call knows they are secondary this week.
- Read the most recent SEV-0 or SEV-1 postmortem if you were not on the response. Twenty minutes is enough.
- Run one drill from
incident-response.mdmentally — pick the one you remember least about.
Dashboards to watch
Section titled “Dashboards to watch”Production. Bookmarked in the team Grafana folder Plaza / On-call.
- Money integrity. Reconciliation drift, ledger sum-zero violations, hot-wallet balance, payout success rate, payout latency. The single most important panel; check at least once per shift.
- API health. Request rate, error rate, p50/p99 latency by route. SLO compliance bands rendered.
- Webhook delivery. Delivery success rate, dead-letter rate, outbox depth, retry distribution.
- Infrastructure. Postgres replication lag, NATS consumer lag, disk free, certificate expiry, backup recency.
A fifth panel — Pilot org watch — exists during the pilot period only. It surfaces the pilot orgs’ transaction success and any error spikes correlated to their account URNs.
Tools you will need
Section titled “Tools you will need”- PagerDuty. For ack and escalation.
- Grafana. For dashboards.
- Slack. For incident channels and status updates.
- Status page admin. Login via the shared 1Password vault.
- Production SSH access. Via the bastion. Your key is provisioned at hire.
plazaCLI. With an on-call-scoped token that can read but not write production.- The runbook.
docs/operations/runbook.md, indexed by alert name. - The incident response framework.
docs/operations/incident-response.md.
When you are not sure
Section titled “When you are not sure”Three rules.
- If money is at risk, escalate. Page the CEO. The cost of escalating is one short call. The cost of not escalating is an unrecoverable customer.
- If you are tired, hand off. A second engineer at hour two is more useful than the first engineer at hour eight. Plaza’s on-call is paid; use it to share load.
- Communicate often. A short status page update every 30 minutes — even saying “still investigating” — is what customers expect. Silence is worse than no progress.
What good looks like
Section titled “What good looks like”A SEV-1 ack within 10 minutes, a status page entry within 30, a clear timeline in the incident channel, a resolution within an hour, and a postmortem in the repo within two weeks. Nobody got woken up unnecessarily. The customer-facing comms named what happened and what we did.
What bad looks like
Section titled “What bad looks like”A SEV-0 with a delayed CEO page. An incident channel with no timeline. A status page that says “all systems operational” while customers are stuck. A postmortem that blames an individual. Avoid all of these.