GitHub Merge Queue Required Checks Timed Out or Cancelled: Rollback Incident Runbook (2026)
Your rollback PR enters merge queue, required checks start, then one job times out. You requeue, and the next run gets cancelled before completion. During incidents, this timeout/cancel loop can block recovery even when the rollback itself is correct.
This guide gives a practical workflow for required checks timed out or cancelled failures in GitHub merge queue: classify failure mode fast, remove instability sources, and land rollback safely without abandoning branch protection.
Table of contents
1. Timeout vs cancelled: fast classification
| Observed signal | Likely class | Immediate action |
|---|---|---|
| Job reaches max runtime and exits with timeout | Execution timeout | Inspect slow step, dependency latency, and runner contention. |
| Job is cancelled shortly after queue entry updates | Queue invalidation cancellation | Reduce branch churn and requeue once on fresh snapshot. |
| Job cancelled by concurrency rule on same ref | Workflow concurrency conflict | Adjust concurrency group to avoid canceling incident rollback runs. |
| Timeouts spike only during incidents | Capacity collapse + queue pressure | Prioritize incident runners and defer non-critical workflows. |
| Timeout followed by random pass/fail outcomes | Mixed timeout + flake | Apply deterministic test profile and controlled retry policy. |
2. Root causes behind timeout/cancel loops
- Unbounded job runtime: long integration tests, hung network calls, or missing step-level timeout guards.
- Runner pressure: rollback jobs wait too long, then hit overall timeout windows.
- Concurrency misconfiguration:
cancel-in-progress: truecancels active rollback checks on new queue attempts. - Snapshot churn: protected branch moves repeatedly, invalidating in-flight queue entries.
- Policy churn: required check names or workflow mappings changed mid-incident.
- Dependency instability: package index latency, API rate limits, or transient DNS issues inflate runtimes.
3. 8-minute incident triage workflow
- Label each failed run as
timeoutorcancelled(never mix labels). - Measure check start delay vs execution duration for rollback runs.
- Verify recent protected-branch updates during each cancellation timestamp.
- Inspect workflow
concurrencygroups for cancel collisions on rollback refs. - Confirm required workflows include
merge_groupevent and correct job guards. - Identify top 1-2 slowest steps in timed-out jobs.
- Apply incident runner priority and pause non-urgent merges briefly.
- Requeue rollback once after controls are in place; avoid unlimited reruns.
Keep an incident ledger with one reason code per rerun (for example: timeout-network, cancel-base-churn, cancel-concurrency). This is critical for post-incident fixes.
4. Stabilization playbook for rollback PRs
A) Timeout containment
- Add explicit step-level timeouts to long-running integration tasks.
- Cache dependencies and pin mirrors to reduce network variance.
- Split high-variance checks out of required rollback gate when policy permits incident profile.
B) Cancellation containment
- Freeze non-incident merges for a short window (10-15 minutes).
- Rebase rollback branch once, then avoid additional churn until merge.
- Adjust concurrency keys so rollback queue jobs are not cancelled by unrelated PR pushes.
C) Governance containment
- Use a preapproved incident-required-check profile with automatic expiry.
- Track who changed queue policy and when during the incident.
- Open postmortem tasks before closing incident to restore strict baseline checks.
5. Workflow + CLI recipes
Queue-safe rollback refresh:
git checkout rollback/incident-2026-02-16
git fetch origin
git rebase origin/main
git push --force-with-lease
Merge queue workflow with explicit timeout and safer concurrency:
name: required-ci
on:
pull_request:
merge_group:
concurrency:
group: required-ci-${{ github.event_name }}-${{ github.ref }}
cancel-in-progress: true
jobs:
verify:
if: ${{ github.event_name == 'pull_request' || github.event_name == 'merge_group' }}
runs-on: ubuntu-latest
timeout-minutes: 20
steps:
- uses: actions/checkout@v4
- name: Install deps
timeout-minutes: 6
run: pip install -r requirements.txt --require-hashes
- name: Run tests
timeout-minutes: 12
run: pytest -q --maxfail=1
Incident retry ledger example:
# timestamp, check, outcome, reason
2026-02-16T18:08:00Z required-ci timeout timeout-dependency-latency
2026-02-16T18:14:00Z required-ci cancelled cancel-base-churn
2026-02-16T18:21:00Z required-ci pass retry-after-stabilization
6. Guardrails and SLO thresholds
| Signal | Healthy target | Incident threshold |
|---|---|---|
| Rollback required-check timeout rate | < 3% | > 10% in 30 minutes |
| Rollback required-check cancellation rate | < 5% | > 15% in 30 minutes |
| Queue invalidation rate (rollback entries) | < 8% | > 20% |
| Rollback queue-to-merge time | < 15 minutes | > 30 minutes |
Use dual triggers. Timeout rate without cancellation spike usually indicates runtime bottlenecks. Cancellation spike without timeout spike usually indicates queue churn and policy/concurrency issues.
7. FAQ
What is the difference between timed-out and cancelled required checks in merge queue?
Timed-out checks exceed allowed runtime. Cancelled checks are interrupted by queue invalidation, concurrency cancellation, manual actions, or policy changes before completion.
Should we remove required checks to land rollback faster during an incident?
Avoid permanent removal. Use a temporary incident required-check profile with explicit expiry and restore the baseline after stabilization.
Why do cancellations repeat even when rollback code is unchanged?
Merge queue evaluates integration snapshots. Base updates, queue reordering, or concurrency settings can invalidate the same rollback entry repeatedly.
What is the safest first action when rollback checks keep timing out?
Classify whether the issue is start-delay saturation, execution bottleneck, or queue cancellation. Stabilize runner priority and queue inputs before another rerun.
How do we prevent timeout and cancellation loops after incident closure?
Track timeout/cancellation metrics, enforce timeout budgets per workflow step, refine concurrency keys, and define a merge freeze protocol for severe rollback incidents.