GitHub Merge Queue Runner Saturation vs Queue Starvation: Incident Decision Guide (2026)

Q: When should emergency bypass be considered?

Only after documented triage shows queue-safe remediation cannot meet incident recovery deadlines and explicit approvers accept risk. Keep bypass exceptional and auditable.

Published February 16, 2026 · 11 min read

Rollback PR approved, but merge queue latency explodes. The hard part is not deciding whether to roll back. It is deciding why the queue is stalling: runner capacity bottleneck or queue starvation dynamics.

This guide gives you a practical decision model with explicit SLO thresholds so incident teams can unblock safely without breaking branch protection controls.

⚙ Quick links: Timeout/Cancelled Checks Guide · Merge Queue Pending Checks Guide · Merge Queue merge_group Trigger Guide · Merge Queue Rollback Stuck Guide · Required Checks Guide · Emergency Bypass Governance Guide · Checks Keep Restarting Guide · Flaky Required Checks Guide · GitHub Actions CI/CD Guide · Protected Branch Revert Guide · Git Commands Cheat Sheet

Quick decision matrix
Runner saturation vs queue starvation model
Incident SLO thresholds
5-minute triage workflow
Queue-safe remediation playbook
CLI and operations recipes
FAQ

1. Quick decision matrix

Observed signal	Likely class	Primary action
Many PRs wait 5-15+ min before checks start	Runner saturation	Increase runner concurrency, pause non-urgent workflows, prioritize incident queue labels.
Rollback PR keeps requeued after branch tip moves	Queue starvation	Freeze noisy merges briefly, rebase rollback branch once, requeue on stable snapshot.
Checks run, then invalidate repeatedly without completion	Queue starvation	Reduce queue churn and required-check volatility; lock workflow config during incident.
Checks never start and logs are empty	Trigger miswire (often mistaken for saturation)	Verify `merge_group` triggers and job-level guards.
Runner queue depth climbs while merge queue length climbs	Runner saturation	Scale runners and lower parallel non-incident CI load.

Operational shortcut: If queue entries are invalidated faster than checks can finish, treat it as starvation first, even when runners are busy.

2. Runner saturation vs queue starvation model

Runner saturation means jobs cannot start promptly because compute slots are full. You see long check start latency across many repositories or branches. Adding capacity or deprioritizing non-incident jobs helps quickly.

Queue starvation means queue entries are repeatedly invalidated or displaced before they can complete. Common triggers are rapid protected-branch movement, policy toggles, unstable required checks, and constant rebases. More runners alone may not solve this.

Common mistake: teams add runners while branch tip churn continues every few minutes. Throughput improves slightly, but rollback PR still never exits queue because queue snapshots keep resetting.

3. Incident SLO thresholds

Use explicit thresholds during incidents. Without thresholds, teams debate symptoms instead of shipping the rollback.

Metric	Target	Escalation trigger
Check start latency (rollback PR)	< 3 minutes	> 5 minutes for two consecutive queue attempts
Queue-to-merge latency (rollback PR)	< 15 minutes	> 25 minutes with no policy failure
Queue invalidation rate	< 10%	> 25% during incident window
Runner utilization	< 80% sustained	> 90% for 10+ minutes

These are practical starting points, not universal constants. Tune them per repo criticality and CI runtime distribution.

4. 5-minute triage workflow

Confirm required checks mapped in branch protection are the exact checks emitted in merge_group runs.
Inspect runner queue depth and median job start delay for the last 10 minutes.
Count queue invalidations/requeues on the rollback PR entry.
Check whether protected branch tip moved repeatedly during rollback window.
Classify as primary bottleneck: saturation, starvation, or trigger/policy miswire.

Classification rule: Saturation = jobs do not start. Starvation = jobs start but queue entry lifecycle never stabilizes.

5. Queue-safe remediation playbook

A) If runner saturation is primary

Temporarily increase self-hosted runner pool or GitHub-hosted parallelism budget.
Pause non-incident pipelines (nightly/security backfills) for a short maintenance window.
Prioritize rollback PR queue entry with incident labels and reviewer fast-path.
Re-measure check start latency after 1 queue cycle.

B) If queue starvation is primary

Announce a short merge freeze for non-incident PRs on affected protected branch.
Avoid repeated rebases; create one fresh rollback branch from latest tip, then queue once.
Stabilize required checks list and workflow definitions until rollback lands.
Remove optional gates that create churn but do not reduce immediate incident risk.

C) If trigger/policy miswire is primary

Add missing merge_group triggers on required workflows.
Remove job-level if: conditions that block merge-group contexts unintentionally.
Re-run queue entry and verify required check names exactly match branch policy.

Bypass policy: emergency bypass is the last resort after documented triage and explicit risk acceptance. Keep a written timeline for audit.

6. CLI and operations recipes

Example commands you can adapt in incident playbooks:

# List recent workflow runs to inspect start delays
# (replace owner/repo and workflow name)
gh run list --repo owner/repo --workflow ci.yml --limit 20

# View detailed run timings for a suspected slow run
gh run view RUN_ID --repo owner/repo

# Requeue by updating rollback branch from latest protected tip
git fetch origin
git switch rollback-incident
git rebase origin/main
# resolve conflicts if any
git push --force-with-lease

# Revert safely for shared/protected branches
git revert <bad_commit_sha>
git push origin HEAD

For related incident patterns, see Pending Checks Guide, merge_group Trigger Guide, Flaky Required Checks Guide, and Emergency Bypass Governance Guide.

7. FAQ

How do I quickly tell runner saturation from queue starvation in merge queue incidents?

If checks start late across many PRs, runner saturation is likely. If checks keep restarting or never stabilize on one rollback PR despite available runners, queue starvation is more likely.

What SLO thresholds are useful during rollback incidents?

A practical baseline is check-start latency under 3 minutes, queue-to-merge under 15 minutes for rollback PRs, and queue invalidation rate under 10 percent during incident windows.

What is the first safe action when rollback PR checks are pending forever?

Confirm required workflows run on merge_group, then inspect runner backlog and queue invalidations before retrying. This keeps branch protection intact while isolating root cause.

Can increasing runners alone fix queue starvation?

Not always. Runner scale helps capacity bottlenecks, but starvation caused by queue churn, policy gates, or unstable required checks needs queue hygiene and policy stabilization.

When should emergency bypass be considered?

Only after queue-safe remediation cannot meet incident recovery deadlines and designated approvers explicitly accept risk. Record decisions for audit and postmortem learning.