Change failure rate: What it is and how to improve it in 2026
Change failure rate (CFR) tracks how often deployments cause production failures. This guide covers the DORA formula, 2026 benchmarks, and practices that actually move the number down.

Change failure rate (CFR) tracks how often deployments cause production failures. This guide covers the DORA formula, 2026 benchmarks, and practices that actually move the number down.
Every engineering team ships broken code. The question isn't whether it happens. It's how often, how quickly you catch it, and what it costs you. Change failure rate is the metric that turns that gut feeling into a number you can actually manage. If you're tracking DORA metrics, CFR tells you whether your speed is sustainable. If you're not tracking it yet, it's the fastest way to find out whether you're just shipping bugs faster. This guide walks through what CFR is, how to calculate it correctly, the benchmarks that matter in 2026, and the levers that actually move the number, including a few that most teams overlook.
The term change failure rate (CFR) in software development and DevOps is the percentage of deployments to production that result in a failure requiring remediation. A "failure" typically means degraded service or another production issue that requires remediation, such as a rollback, hotfix, patch, or incident response. If your team deploys 100 times in a month and 8 of those deployments require a rollback or hotfix, your CFR is 8 percent.
CFR sits in tension with deployment frequency on purpose. It's easy to ship more often. It's harder to ship more often without breaking things. CFR keeps teams honest about that tradeoff.
CFR is one of the four key metrics defined by Google's DORA research program, alongside deployment frequency, lead time for changes, and mean time to recovery. Together, these four numbers describe the full throughput-and-stability picture of a software delivery team. Deployment frequency and lead time measure speed. Change failure rate and mean time to recovery (MTTR) measure stability. High-performing teams move both pairs in the same direction. They ship more often and they break things less often, because the practices that make deployments safe are the same ones that make them fast.
In DORA's framing, CFR is the quality counterbalance. A team that ships 50 times a week with a 40 percent failure rate isn't high-performing. It's chaotic. That combined motion — more frequent, more reliable deployments — is what a successful DevOps transformation actually looks like in the metrics.
This is where teams get tripped up. CFR looks simple on paper, but the definition of "failure" varies by organization, and inconsistent definitions make the number useless.
A practical working definition: a deployed change is a failure if it caused service degradation or required unplanned work to fix. That includes rollbacks, hotfixes deployed within a short window after the original, forward-fix commits that address regressions from the deployment, and incidents traced back to the change. It generally does not include normal follow-up work, planned iterations, or bugs discovered weeks later that weren't directly caused by a specific deployment.
The key is consistency. Pick a definition, write it down, apply it across all teams in your organization, and review it quarterly. A CFR of 12 percent means nothing if half your teams count only rollbacks and the other half count every user-reported bug.
To calculate change failure rate, divide the number of failed production deployments by the total number of deployments over a defined window:
CFR = (Failed Deployments / Total Deployments) × 100
If you deployed 200 times last quarter and 14 of those deployments required a hotfix or rollback, your CFR is (14 / 200) × 100 = 7 percent.
Most teams calculate CFR on a rolling 30-day or 90-day window, the same cadence commonly used for tracking cycle time and other flow metrics. Shorter windows surface trends faster, but can be noisy for development teams with low deployment frequency. A team that deploys twice a week will want a longer window than a team that deploys 20 times a day.
Before you measure anything, answer three questions and write the answers down.
What counts as a deployment? A production release? A code commit that auto-deploys? A feature flag flip? Most software teams count production deployments, but if you use trunk-based development with continuous deployment, every merge to main may effectively become a deployment event. Be explicit about what your number of production deployments actually represents.
What counts as a failure? The minimum viable definition includes rollbacks and incidents. Teams with mature observability expand this to include any change that triggers a forward-fix commit within 24 hours, any change linked to an SLO breach, or any change that triggers a customer-reported bug within a defined window.
Who labels the failures? Your CFR calculation is only as reliable as the labeling. Some development teams automate this by tagging pull requests that revert or hotfix another PR. Others do it manually during incident retrospectives. Manual labeling is more accurate but harder to sustain.
Three mistakes show up constantly:
Change failure rate benchmarks give you a reference point, but context matters more than any industry average. A fintech team with strict regulatory requirements will target different numbers than a consumer app team iterating on a new feature.
The DORA research program groups software delivery teams into four performance clusters based on the four key metrics. Note that DORA has moved away from rigid percentile buckets in its more recent reports, emphasizing context over fixed tiers. That said, the commonly referenced practitioner benchmarks remain useful as directional guideposts:
| Performance Level | Change Failure Rate | Characteristics |
|---|---|---|
| Elite | 0 to 5% | Deploy on-demand, recover from incidents in under an hour, extensive test automation |
| High | 5 to 10% | Deploy weekly to daily, recover within a day, strong CI/CD practices |
| Medium | 10 to 15% | Deploy weekly to monthly, recover within a week, inconsistent automation |
| Low | 16%+ | Deploy monthly or slower, recover in weeks, manual processes dominate |
The 2024 DORA State of DevOps Report showed an unusual pattern: the medium performance cluster posted a lower change failure rate than the high cluster, which broke the historical pattern where all four DORA metrics moved together. That's a reminder that change failure rate doesn't exist in isolation, and that some teams accept slightly higher CFR in exchange for much higher throughput.
The shape of the industry has shifted with AI-assisted development. Teams shipping more code per developer are seeing CFR pressure in both directions. Some teams may see lower CFR when AI helps eliminate simple implementation errors earlier, while others may see higher CFR when review and testing practices fail to keep pace with increased throughput.
If you're benchmarking your team today, compare to similar organizations rather than to the DORA averages. A 50-person platform team at a SaaS company shouldn't benchmark against a 5,000-person engineering org at a bank. The internal comparison that matters most is your own trend line. Is your CFR going down quarter over quarter while deployment frequency stays flat or climbs? That's the signal.
Change failure rate isn't a vanity metric. It maps directly to cost, customer trust, and developer time.
Every failed deployment has three costs. The direct cost is the engineering time spent on remediation efforts: the rollback, the hotfix, the incident response, and the post-mortem. A single moderate-severity incident can easily consume 20 to 40 engineer-hours across detection, response, remediation, and retrospective.
The second cost is the opportunity cost. Engineers fixing last week's deployment failures aren't shipping next week's features. Software teams with high change failure rate trap themselves in a reactive cycle where firefighting crowds out forward motion.
The third cost is trust. Internal trust from product and design teams who stop believing the engineering team can ship safely. External trust from customers who notice the outages. Trust takes months to build and weeks to lose.
One of the biggest mistakes development teams make is treating change failure rate as a target to optimize directly. If you set "reduce CFR to 5 percent" as a quarterly goal, teams will hit it by shipping less, bundling changes, or reclassifying failures. You'll have a better number and a worse system.
CFR works best as a signal. When it rises, it's telling you something about your delivery processes: the tests aren't catching what they should, the code review process is missing context, the changes are too large, or the team doesn't understand the code they're changing. When those last two show up together, elevated cyclomatic complexity is often the underlying signal — hard-to-reason-about code is harder to change safely. The fix is upstream, not in the metric itself.
Change failure rate doesn't stand alone. It's most useful when read alongside the other DORA metrics. A team with high deployment frequency and low CFR is genuinely fast. A team with high deployment frequency and high CFR is shipping chaos. A team with low deployment frequency and low change failure rate might look stable, but it's often just shipping rarely enough to hide problems.
Read change failure rate with lead time for changes and mean time to recovery. If CFR is high but MTTR is low, your team catches and fixes problems fast, which is a recoverable position. If both are high, you're in a genuine reliability crisis that affects overall organizational performance. For the broader picture of how these metrics interact, see our write-up on engineering metrics that actually matter, and for how CFR fits alongside sprint-level signals, see our guide to agile metrics.
Most CFR-reduction advice focuses on automated testing and feature flags. Those matter, but they're downstream fixes. The upstream lever is making sure developers understand the code changes they're making before they make them. Here's the full stack of practices that improve your deployment processes, ordered from most commonly discussed to most overlooked.
Code review is your last line of defense before a change hits main. Software teams with low CFR tend to have reviewers who actually understand the code being changed, not just reviewers who approve to clear the queue. That means assigning reviewers based on code ownership and context, keeping pull requests small enough to review meaningfully (ideally under 400 lines), and blocking merges on at least one substantive review rather than a rubber-stamp approval. Strong code review catches issues early and improves code quality across the development process.
Automated testing catches the deployment failures you can predict. Unit tests catch logic errors. Integration tests catch contract violations between services. End-to-end tests catch regressions in user-facing flows. Change failure rate drops when tests cover the paths that actually fail in production, not when the test count goes up. Inadequate testing is one of the most common root causes of failed changes reaching the production environment.
The most common testing gap isn't coverage percentage. It's missing tests for the failure modes a team has already seen. After every incident, add a test that would have caught it. Change failure rate moves down when the test suite reflects the real failure history of the codebase.
Feature flags and progressive delivery decouple deployment from release. You can ship code to production behind a flag, roll it out to 1 percent of traffic, watch for errors, and fix forward or roll back without a full redeployment. This doesn't eliminate deployment failures, but it dramatically reduces the blast radius of each failure, which lowers change failure rate if you're measuring customer-impacting incidents rather than raw deployment issues.
Small deployed changes fail less often, and when they do fail, they're easier to diagnose and reverse. Software teams that ship many small changes per day tend to have lower CFR than teams that ship one large change per week, because the failure surface of each deployment is smaller. Consider a platform team shipping weekly releases that sees CFR spike from 8 percent to 18 percent after kicking off a cross-service API migration. The migration itself isn't the problem. The problem is that a single large change touches dozens of call sites across multiple services, and any one missed update becomes a production failure. Breaking that migration into smaller, service-by-service changes brings CFR back down because each deployment is independently reversible.
When a change needs to be applied consistently across many repositories, like a dependency bump, a config migration, or an API version upgrade, the risk multiplies. One team updates correctly, another misses a repo, a third applies the change with a subtle variation, and the inconsistency surfaces later as a production incident. Batch Changes addresses this directly by applying the same change across every repository that needs it, eliminating the "forgot to update repo X" class of failures that show up in CFR data as mysterious cross-service regressions.
This is an underinvested lever, especially in large, multi-repo systems. Most advice on reducing CFR assumes the developer understands the code they're changing. That assumption breaks down at scale. At a company with hundreds of services and millions of lines of code across dozens of languages, engineers regularly change code they've never seen before. They guess at the blast radius. They miss the caller they didn't know existed. The change passes tests, passes review, ships to production, and breaks a dependency nobody traced. These unintended consequences of failed changes are among the hardest deployment failures to track and prevent.
Code intelligence closes that gap. Before changing a function, a developer can use Code Search to find every usage across the entire codebase, including repos they don't own, to understand what calls it, how it's called, and what happens if the signature changes. Code navigation (go-to-definition, find-references) across repositories turns a black-box change into a traced, understood change. The failures this prevents are the ones that don't show up in local tests because the calling code lives in a different service.
For teams tracking CFR over time, Code Insights lets you track code changes and patterns across repositories, which, when combined with incident data, helps identify failure-prone areas and where to invest remediation effort. This is more useful than a single CFR number because it tells you where the risk is concentrated.
AI code generation changes the CFR equation in ways most teams aren't measuring yet. The volume of code entering repositories has gone up sharply. DX research found that daily AI users ship roughly 60 percent more pull requests than non-users. That shift pressures change failure rate in both directions and surfaces new failure modes that traditional testing doesn't catch. Measuring CFR gets harder as both code volume and deployment velocity increase.
Three dynamics matter:
Development teams using AI coding assistants should track change failure rate separately for AI-augmented changes versus human-authored changes. Tracking CFR by source shows whether AI-assisted changes are performing on par with human-authored ones. If AI-augmented CFR is higher, the gap tells you where to invest in guardrails.
The most direct CFR reduction for AI-assisted teams is making sure the AI generates code with the right context. An AI assistant that can see only the open file will hallucinate. An assistant with access to the full codebase, cross-repo dependencies, and team conventions produces code that matches how the codebase actually works. This is where the code intelligence layer matters: the same cross-repo understanding that helps human developers trace dependencies before making changes also provides AI assistants with the context they need to generate reliable code. When developers use AI to explain unfamiliar code before modifying it, the risk of introducing failures drops because the change is informed rather than guessed.
Change failure rate is one lens on this. Pair it with rework rate (pull requests that modify recently-merged code), and you get a fuller picture of whether AI is accelerating your team or just accelerating your cleanup work. For more on measuring the full picture of engineering health, see our guide to developer productivity metrics.
Change failure rate is the quality counterbalance to deployment frequency. Track it alongside the other DORA metrics, including lead time for changes and mean time to recovery. Define "failure" consistently, and treat it as a signal rather than a target. The durable ways to reduce change failure rate are the upstream ones: smaller changes, better review, stronger tests, progressive delivery, and code intelligence that helps developers understand what they're changing before they ship it.
If your change failure rate is climbing, the first question isn't "what test should I add?" It's "do the engineers changing this code understand it?" The answer to that question drives everything downstream. Start with visibility. Explore how code search and code intelligence help developers understand the full impact of a change before it reaches production, and you'll find the leverage point for CFR that no amount of additional testing can reach.
According to DORA benchmarks, elite teams operate at 0 to 5 percent, high performers at 5 to 10 percent, and medium performers at 10 to 15 percent. Anything above 15 percent typically signals systemic delivery issues. Context matters more than the absolute number: a regulated fintech team should hold a lower bar than a consumer app team iterating on experimental features. The most meaningful comparison is your own trend line over time, not the industry average.
Divide the number of failed production deployments by the total number of production deployments over a rolling window, usually 30 or 90 days, then multiply by 100. The hard part isn't the formula — it's defining what counts as a deployment and what counts as a failure, then applying those definitions consistently across every team. Most organizations label failures through a combination of automated tagging (hotfix or revert pull requests) and manual review during incident retrospectives.
Change failure rate measures how often deployments cause problems. Mean time to recovery (MTTR) measures how fast you fix them. Read together, the two tell you whether your system is brittle (high CFR, high MTTR), resilient (high CFR, low MTTR), or stable (low CFR, low MTTR). DORA treats both as stability metrics, balancing the speed metrics of deployment frequency and lead time for changes.
CFR is one of the four DORA metrics defined by Google's DORA research program, alongside deployment frequency, lead time for changes, and mean time to recovery. DORA groups software delivery teams into elite, high, medium, and low performance clusters based on how they score across all four metrics together. CFR serves as the quality counterbalance to deployment frequency, keeping teams honest about whether their velocity is sustainable.
It can move the number in either direction. AI helps eliminate simple implementation errors, which tends to lower CFR. But daily AI users ship roughly 60 percent more pull requests, which pressures review capacity and can raise CFR if the review system doesn't scale with throughput. The most reliable mitigation is giving AI assistants full codebase and cross-repo context rather than single-file context, so generated code matches how the calling code actually works. Teams should also track CFR separately for AI-augmented versus human-authored changes to surface the gap when it exists.

With Sourcegraph, the code understanding platform for enterprise.
Schedule a demo