GitHub Merge Queue Runner Saturation vs Queue Starvation: Incident Decision Guide (2026)
Rollback PR approved, but merge queue latency explodes. The hard part is not deciding whether to roll back. It is deciding why the queue is stalling: runner capacity bottleneck or queue starvation dynamics.
This guide gives you a practical decision model with explicit SLO thresholds so incident teams can unblock safely without breaking branch protection controls.
Table of contents
1. Quick decision matrix
| Observed signal | Likely class | Primary action |
|---|---|---|
| Many PRs wait 5-15+ min before checks start | Runner saturation | Increase runner concurrency, pause non-urgent workflows, prioritize incident queue labels. |
| Rollback PR keeps requeued after branch tip moves | Queue starvation | Freeze noisy merges briefly, rebase rollback branch once, requeue on stable snapshot. |
| Checks run, then invalidate repeatedly without completion | Queue starvation | Reduce queue churn and required-check volatility; lock workflow config during incident. |
| Checks never start and logs are empty | Trigger miswire (often mistaken for saturation) | Verify merge_group triggers and job-level guards. |
| Runner queue depth climbs while merge queue length climbs | Runner saturation | Scale runners and lower parallel non-incident CI load. |
2. Runner saturation vs queue starvation model
Runner saturation means jobs cannot start promptly because compute slots are full. You see long check start latency across many repositories or branches. Adding capacity or deprioritizing non-incident jobs helps quickly.
Queue starvation means queue entries are repeatedly invalidated or displaced before they can complete. Common triggers are rapid protected-branch movement, policy toggles, unstable required checks, and constant rebases. More runners alone may not solve this.
3. Incident SLO thresholds
Use explicit thresholds during incidents. Without thresholds, teams debate symptoms instead of shipping the rollback.
| Metric | Target | Escalation trigger |
|---|---|---|
| Check start latency (rollback PR) | < 3 minutes | > 5 minutes for two consecutive queue attempts |
| Queue-to-merge latency (rollback PR) | < 15 minutes | > 25 minutes with no policy failure |
| Queue invalidation rate | < 10% | > 25% during incident window |
| Runner utilization | < 80% sustained | > 90% for 10+ minutes |
These are practical starting points, not universal constants. Tune them per repo criticality and CI runtime distribution.
4. 5-minute triage workflow
- Confirm required checks mapped in branch protection are the exact checks emitted in
merge_groupruns. - Inspect runner queue depth and median job start delay for the last 10 minutes.
- Count queue invalidations/requeues on the rollback PR entry.
- Check whether protected branch tip moved repeatedly during rollback window.
- Classify as primary bottleneck: saturation, starvation, or trigger/policy miswire.
5. Queue-safe remediation playbook
A) If runner saturation is primary
- Temporarily increase self-hosted runner pool or GitHub-hosted parallelism budget.
- Pause non-incident pipelines (nightly/security backfills) for a short maintenance window.
- Prioritize rollback PR queue entry with incident labels and reviewer fast-path.
- Re-measure check start latency after 1 queue cycle.
B) If queue starvation is primary
- Announce a short merge freeze for non-incident PRs on affected protected branch.
- Avoid repeated rebases; create one fresh rollback branch from latest tip, then queue once.
- Stabilize required checks list and workflow definitions until rollback lands.
- Remove optional gates that create churn but do not reduce immediate incident risk.
C) If trigger/policy miswire is primary
- Add missing
merge_grouptriggers on required workflows. - Remove job-level
if:conditions that block merge-group contexts unintentionally. - Re-run queue entry and verify required check names exactly match branch policy.
6. CLI and operations recipes
Example commands you can adapt in incident playbooks:
# List recent workflow runs to inspect start delays
# (replace owner/repo and workflow name)
gh run list --repo owner/repo --workflow ci.yml --limit 20
# View detailed run timings for a suspected slow run
gh run view RUN_ID --repo owner/repo
# Requeue by updating rollback branch from latest protected tip
git fetch origin
git switch rollback-incident
git rebase origin/main
# resolve conflicts if any
git push --force-with-lease
# Revert safely for shared/protected branches
git revert <bad_commit_sha>
git push origin HEAD
For related incident patterns, see Pending Checks Guide, merge_group Trigger Guide, Flaky Required Checks Guide, and Emergency Bypass Governance Guide.
7. FAQ
How do I quickly tell runner saturation from queue starvation in merge queue incidents?
If checks start late across many PRs, runner saturation is likely. If checks keep restarting or never stabilize on one rollback PR despite available runners, queue starvation is more likely.
What SLO thresholds are useful during rollback incidents?
A practical baseline is check-start latency under 3 minutes, queue-to-merge under 15 minutes for rollback PRs, and queue invalidation rate under 10 percent during incident windows.
What is the first safe action when rollback PR checks are pending forever?
Confirm required workflows run on merge_group, then inspect runner backlog and queue invalidations before retrying. This keeps branch protection intact while isolating root cause.
Can increasing runners alone fix queue starvation?
Not always. Runner scale helps capacity bottlenecks, but starvation caused by queue churn, policy gates, or unstable required checks needs queue hygiene and policy stabilization.
When should emergency bypass be considered?
Only after queue-safe remediation cannot meet incident recovery deadlines and designated approvers explicitly accept risk. Record decisions for audit and postmortem learning.
Related rollback guides
Diagnose non-starting or slow-starting checks in queue entries.
merge_group Trigger GuideFix checks that never start because workflows ignore merge-group events.
Timeout/Cancelled Checks GuideClassify required-check timeout vs cancellation loops and stabilize rollback queue behavior.
Rollback Stuck GuideEnd-to-end triage for rollback PRs blocked in merge queue.
Flaky Required Checks GuideStabilize intermittent CI failures during rollback incidents without bypassing policy.