GitHub Merge Queue Escalation ACK Timeout Remediation Runbook: On-Call Handoffs, Breach Timers, and Recovery Templates (2026)
Most merge queue incident playbooks define who to page. Fewer define what to do when nobody acknowledges the page on time. ACK timeout breaches are where ownership fails silently and rollback windows remain exposed.
This guide gives a practical ACK-timeout remediation runbook for GitHub merge queue governance: timer model, escalation tiers, handoff expectations, and copy-paste communication templates.
Table of contents
1. Why ACK timeout breaches are high-risk
A missed acknowledgement is not just a paging issue. It means incident ownership is undefined while branch-protection exceptions are still active, required checks are still unstable, or recovery decisions are still pending.
| Failure pattern | What usually happens next | Control you need |
|---|---|---|
| Primary owner silent past ACK window | Informal chat escalation with no timestamp trail | Automatic tier handoff at timeout boundary |
| Multiple responders answer late | Duplicate actions and conflicting decisions | Single acting owner declaration macro |
| No leadership fallback trigger | Incident drifts beyond rollback-safe window | Parallel leadership timer with hard cutover |
2. Three-timer model for escalation governance
Run three explicit timers from the moment the first escalation alert is sent. This prevents ownership ambiguity and avoids silent waiting loops.
3. Severity thresholds and ACK SLA targets
ACK windows should scale by impact. Use severity mapping that is short enough to stop drift but realistic for your on-call coverage model.
| Severity | ACK SLA | Action SLA | Leadership cutover |
|---|---|---|---|
| SEV-1 production rollback blocked | 5 minutes | 10 minutes from ACK | 15 minutes from initial alert |
| SEV-2 merge queue unstable, safe rollback path exists | 10 minutes | 20 minutes from ACK | 30 minutes from initial alert |
| SEV-3 policy drift risk, no active customer impact | 20 minutes | 40 minutes from ACK | 60 minutes from initial alert |
These values are starting points. Calibrate weekly with your closure-quality dashboard and adjust only with explicit owner sign-off.
4. Escalation path and owner handoff matrix
The path must be deterministic: each timeout has one destination owner and one expected action. No "whoever is online" routing.
| Timeout event | Escalates to | Expected action | Evidence required |
|---|---|---|---|
| ACK timer breached (Timer A) | Secondary on-call owner | Post owner-claim comment and begin action plan | UTC breach timestamp + pager receipt |
| Action timer breached (Timer B) | Incident commander | Take execution ownership and publish decision path | Missing-action proof + command takeover note |
| Leadership timer breached (Timer C) | Governance lead / manager on duty | Authorize bounded exception or enforce baseline restore | Risk justification + expiry + restoration plan |
5. 15-minute remediation workflow
When ACK SLA is missed, run this sequence as a fixed protocol. Avoid ad-hoc negotiation until after ownership is restored.
- Declare breach: post ACK-timeout breach comment with exact UTC timestamp.
- Trigger handoff: page tier-2 owner and assign acting owner immediately.
- Freeze ambiguity: record one acting owner in PR to avoid parallel decision makers.
- Reconfirm risk: summarize current merge queue risk and rollback status in two to three lines.
- Start action timer: set new action SLA checkpoint from handoff acceptance time.
- Escalate if needed: if first action is missed, route to incident commander without debate.
- Close with proof: finalize timeline with all breach and handoff timestamps.
6. Copy-paste communication templates
Use these templates directly in PR comments or incident channels to keep language consistent.
Template A: ACK timeout breach notice
[ACK TIMEOUT BREACH]
Incident: <id>
Severity: <SEV-1/2/3>
Primary owner: @<handle>
ACK deadline (UTC): <yyyy-mm-dd hh:mm>
Breach detected (UTC): <yyyy-mm-dd hh:mm>
Next owner (tier-2): @<handle>
Requested action: claim ownership and post first action within <N> min.
Template B: Tier handoff confirmation
[HANDOFF ACCEPTED]
Acting owner: @<handle>
Accepted at (UTC): <yyyy-mm-dd hh:mm>
Action deadline (UTC): <yyyy-mm-dd hh:mm>
Current risk summary:
- <required checks state>
- <rollback status>
- <branch protection exception state>
Template C: Leadership cutover
[LEADERSHIP CUTOVER]
Cutover trigger: <Timer B breach / Timer C breach>
Cutover owner: @<handle>
Cutover time (UTC): <yyyy-mm-dd hh:mm>
Decision needed: <bounded extension / enforce baseline restore / rollback path>
Decision deadline (UTC): <yyyy-mm-dd hh:mm>
Evidence links: <links>
Template D: Breach closure record
[ACK BREACH CLOSURE]
Incident: <id>
Initial breach at (UTC): <yyyy-mm-dd hh:mm>
Final owner: @<handle>
Remediation completed at (UTC): <yyyy-mm-dd hh:mm>
Restoration proof: <required checks + protection baseline links>
Follow-up owner: @<handle>
Follow-up due (UTC): <yyyy-mm-dd hh:mm>
7. Anti-patterns and guardrails
| Anti-pattern | Why it fails | Guardrail |
|---|---|---|
| Waiting "a bit longer" after timeout | Extends unowned risk window | Auto-handoff trigger with no manual approval |
| Escalation in private chat only | No audit trail and unclear accountability | Mandatory PR timeline update before action |
| Multiple responders acting simultaneously | Conflicting rollback decisions | Single acting-owner declaration per phase |
| No post-breach review | Same team keeps missing ACK SLA | Weekly threshold review with ownership correction |
FAQ
Should ACK timeout always trigger leadership escalation immediately?
Not always. Route first to tier-2 ownership if your timer model includes it, but leadership cutover should still be pre-timed and automatic if tier-2 fails.
Do we need separate timers for weekdays and weekends?
You can vary SLA values by staffing model, but keep the same timer structure and escalation path so responders do not relearn the process during incidents.
How do we measure if this runbook works?
Track ACK timeout frequency, median handoff completion time, repeated timeout rate by team, and closure completeness for breached incidents.
Can this runbook be used outside merge queue incidents?
Yes. The same timer and handoff model can be used for any incident class where delayed acknowledgement creates operational risk.
What is the minimum governance record for compliance reviews?
At minimum: initial alert timestamp, ACK deadline, breach timestamp, owner handoffs, final decision, and restoration proof links.
ACK timeout breaches are predictable. Treat them as first-class governance events, automate the handoff path, and keep timeline evidence strict. This is how you stop silent ownership failures from becoming recurring incidents.