GitHub Merge Queue Runner Saturation vs Queue Starvation: Incident Decision Guide (2026)

Published February 16, 2026 · 11 min read

Rollback PR approved, but merge queue latency explodes. The hard part is not deciding whether to roll back. It is deciding why the queue is stalling: runner capacity bottleneck or queue starvation dynamics.

This guide gives you a practical decision model with explicit SLO thresholds so incident teams can unblock safely without breaking branch protection controls.

⚙ Quick links: Timeout/Cancelled Checks Guide · Merge Queue Pending Checks Guide · Merge Queue merge_group Trigger Guide · Merge Queue Rollback Stuck Guide · Required Checks Guide · Emergency Bypass Governance Guide · Checks Keep Restarting Guide · Flaky Required Checks Guide · GitHub Actions CI/CD Guide · Protected Branch Revert Guide · Git Commands Cheat Sheet

Table of contents

  1. Quick decision matrix
  2. Runner saturation vs queue starvation model
  3. Incident SLO thresholds
  4. 5-minute triage workflow
  5. Queue-safe remediation playbook
  6. CLI and operations recipes
  7. FAQ

1. Quick decision matrix

Observed signal Likely class Primary action
Many PRs wait 5-15+ min before checks start Runner saturation Increase runner concurrency, pause non-urgent workflows, prioritize incident queue labels.
Rollback PR keeps requeued after branch tip moves Queue starvation Freeze noisy merges briefly, rebase rollback branch once, requeue on stable snapshot.
Checks run, then invalidate repeatedly without completion Queue starvation Reduce queue churn and required-check volatility; lock workflow config during incident.
Checks never start and logs are empty Trigger miswire (often mistaken for saturation) Verify merge_group triggers and job-level guards.
Runner queue depth climbs while merge queue length climbs Runner saturation Scale runners and lower parallel non-incident CI load.
Operational shortcut: If queue entries are invalidated faster than checks can finish, treat it as starvation first, even when runners are busy.

2. Runner saturation vs queue starvation model

Runner saturation means jobs cannot start promptly because compute slots are full. You see long check start latency across many repositories or branches. Adding capacity or deprioritizing non-incident jobs helps quickly.

Queue starvation means queue entries are repeatedly invalidated or displaced before they can complete. Common triggers are rapid protected-branch movement, policy toggles, unstable required checks, and constant rebases. More runners alone may not solve this.

Common mistake: teams add runners while branch tip churn continues every few minutes. Throughput improves slightly, but rollback PR still never exits queue because queue snapshots keep resetting.

3. Incident SLO thresholds

Use explicit thresholds during incidents. Without thresholds, teams debate symptoms instead of shipping the rollback.

Metric Target Escalation trigger
Check start latency (rollback PR) < 3 minutes > 5 minutes for two consecutive queue attempts
Queue-to-merge latency (rollback PR) < 15 minutes > 25 minutes with no policy failure
Queue invalidation rate < 10% > 25% during incident window
Runner utilization < 80% sustained > 90% for 10+ minutes

These are practical starting points, not universal constants. Tune them per repo criticality and CI runtime distribution.

4. 5-minute triage workflow

  1. Confirm required checks mapped in branch protection are the exact checks emitted in merge_group runs.
  2. Inspect runner queue depth and median job start delay for the last 10 minutes.
  3. Count queue invalidations/requeues on the rollback PR entry.
  4. Check whether protected branch tip moved repeatedly during rollback window.
  5. Classify as primary bottleneck: saturation, starvation, or trigger/policy miswire.
Classification rule: Saturation = jobs do not start. Starvation = jobs start but queue entry lifecycle never stabilizes.

5. Queue-safe remediation playbook

A) If runner saturation is primary

B) If queue starvation is primary

C) If trigger/policy miswire is primary

Bypass policy: emergency bypass is the last resort after documented triage and explicit risk acceptance. Keep a written timeline for audit.

6. CLI and operations recipes

Example commands you can adapt in incident playbooks:

# List recent workflow runs to inspect start delays
# (replace owner/repo and workflow name)
gh run list --repo owner/repo --workflow ci.yml --limit 20

# View detailed run timings for a suspected slow run
gh run view RUN_ID --repo owner/repo

# Requeue by updating rollback branch from latest protected tip
git fetch origin
git switch rollback-incident
git rebase origin/main
# resolve conflicts if any
git push --force-with-lease

# Revert safely for shared/protected branches
git revert <bad_commit_sha>
git push origin HEAD

For related incident patterns, see Pending Checks Guide, merge_group Trigger Guide, Flaky Required Checks Guide, and Emergency Bypass Governance Guide.

7. FAQ

How do I quickly tell runner saturation from queue starvation in merge queue incidents?

If checks start late across many PRs, runner saturation is likely. If checks keep restarting or never stabilize on one rollback PR despite available runners, queue starvation is more likely.

What SLO thresholds are useful during rollback incidents?

A practical baseline is check-start latency under 3 minutes, queue-to-merge under 15 minutes for rollback PRs, and queue invalidation rate under 10 percent during incident windows.

What is the first safe action when rollback PR checks are pending forever?

Confirm required workflows run on merge_group, then inspect runner backlog and queue invalidations before retrying. This keeps branch protection intact while isolating root cause.

Can increasing runners alone fix queue starvation?

Not always. Runner scale helps capacity bottlenecks, but starvation caused by queue churn, policy gates, or unstable required checks needs queue hygiene and policy stabilization.

When should emergency bypass be considered?

Only after queue-safe remediation cannot meet incident recovery deadlines and designated approvers explicitly accept risk. Record decisions for audit and postmortem learning.

Related rollback guides

Pending Checks Rollback Guide

Diagnose non-starting or slow-starting checks in queue entries.

merge_group Trigger Guide

Fix checks that never start because workflows ignore merge-group events.

Timeout/Cancelled Checks Guide

Classify required-check timeout vs cancellation loops and stabilize rollback queue behavior.

Rollback Stuck Guide

End-to-end triage for rollback PRs blocked in merge queue.

Flaky Required Checks Guide

Stabilize intermittent CI failures during rollback incidents without bypassing policy.