The $76 Million Problem
The New Relic 2025 Observability Forecast puts the median annual cost of business-impacting outages at $76 million per organization. Not the worst case. Not the Fortune 100 average. The median.
That figure covers direct revenue losses, remediation costs, SLA penalties, regulatory fines, and incident response overhead. It does not include eroded customer trust, engineering burnout from on-call rotations, or the strategic work that never ships because the team was firefighting.
Here is what makes the number hard to explain: the ITSM market hit $13.5 billion in 2024 (Grand View Research) and is projected to reach $29.9 billion by 2030. Organizations are spending aggressively on tooling. Outage costs keep rising. Companies spend $13.5B on ITSM tools and outage costs keep rising. The tools aren’t broken. They’re solving the wrong problem.
They were designed for an era of monthly deployments and manual infrastructure. Today, a mid-market company pushes dozens to hundreds of changes per week across CI/CD pipelines, infrastructure-as-code, and feature flags. The volume of change has outpaced the processes meant to govern it. The cost data reflects that gap.
Cost Per Minute: What Downtime Actually Costs
The annual figure is staggering. The per-minute cost is what drives urgency.
Per-Minute Cost by Organization Size
Granular per-minute data from the EMA/BigPanda 2024 research:
| Organization Size | Average Cost Per Minute | Annual Frequency |
|---|---|---|
| Enterprise (5,000+ employees) | $23,750 | 15 – 20 major outages/year |
| Upper Mid-Market (1,000 – 5,000) | $14,056 | 12 – 18 major outages/year |
| Mid-Market (500 – 1,000) | $8,000 – $12,000 | 10 – 15 major outages/year |
| SMB (100 – 500) | $2,000 – $5,000 | 8 – 12 major outages/year |
The $14,056 per minute blended average is the most widely cited figure. For enterprises, it climbs to $23,750 per minute, or $1.425 million per hour. A four-hour enterprise outage: $5.7 million before you count customer attrition or regulatory fallout.
The Observability Multiplier
The New Relic 2024 Observability Forecast found that organizations without full-stack observability pay $2 million per hour of downtime, roughly 40% above the enterprise average. Without observability, detection takes longer. Diagnosis takes longer. Every extra minute at $23,750 compounds the total.
How These Costs Accumulate
A typical mid-market organization’s math:
| Metric | Value |
|---|---|
| Major outages per year | 14 |
| Average duration per outage | 97 minutes |
| Average cost per minute | $14,056 |
| Total annual direct cost | $19.1 million |
| Indirect costs (reputation, overtime, opportunity) | 3 – 4x direct cost |
| Total annual loaded cost | $57M – $76M |
The gap between $19.1M in direct costs and the $76M median is entirely indirect: reputation damage, engineer overtime, and opportunity cost that never shows up on an invoice.
Where Do Outages Come From?
Understanding cost is necessary. The actionable question: where do outages originate?
The Root Cause Breakdown
The Uptime Institute 2025 Global Data Center Survey:
| Root Cause Category | Percentage of Major Outages |
|---|---|
| Change and configuration issues | 62% |
| Hardware failure | 18% |
| External factors (power, network, natural disaster) | 12% |
| Software bugs (non-change-related) | 5% |
| Capacity and demand issues | 3% |
62% trace back to changes. Not hardware. Not power. Not acts of nature. Changes made by people to production systems.
Within the Change Category
Of change-related outages, the Uptime Institute found 85% resulted from procedure failure or inadequacy. The changes themselves were not dangerous. The processes surrounding them were either insufficient, too complex, or bypassed because they were too slow.
The New Relic 2024 Observability Forecast breaks it down further:
| Change Type | Percentage of Outages |
|---|---|
| Deploying software changes (code, releases) | 27% |
| Environment and infrastructure changes | 28% |
| Configuration changes | ~7% (within change/config total) |
Software deployments and environment changes together account for 55% of outages. Add configuration changes, and the total aligns with the Uptime Institute’s 62%. Two independent studies, different methods, same conclusion.
The Procedure Problem
That 85% procedure failure rate reframes everything. The conventional story is that engineers make mistakes. The data says something different: the procedures designed to prevent mistakes are either inadequate or impossible to follow at the speed modern operations demand.
A 47-field change request form in ServiceNow does not prevent outages. It incentivizes engineers to route around the process. A weekly CAB meeting does not reduce risk. It batches changes into larger, riskier deployments. A mandatory peer review that takes three days does not improve quality. It pushes config changes directly to production.
This is not a people problem. It is a tooling problem. The tools governing change management were built for monthly deployments. They have not adapted to a world where the average organization pushes dozens of changes daily.
Real-World Examples
These are not small companies with underfunded IT. These are organizations with world-class engineering, unlimited budgets, and massive infrastructure investments. If it happened to them, it can happen to anyone.
CrowdStrike (July 2024): $5.4 Billion
A routine content update to CrowdStrike’s Falcon platform triggered a logic error that sent 8.5 million Windows devices into unrecoverable boot loops. Airlines grounded flights. Hospitals went to paper. Banks stopped processing. Parametrix estimated Fortune 500 losses at $5.4 billion.
Root cause: a configuration update that bypassed staged rollout. The engineering talent and monitoring existed to prevent this. The change governance did not. See our full breakdown in The $5.4B Wake-Up Call.
AT&T (February 2024): 92 Million Blocked Calls
A network configuration change cascaded across AT&T’s signaling infrastructure, blocking 92 million calls over 12+ hours. The FCC investigation found the change was routine. The testing process was not adequate for the blast radius.
Meta (March 2024): $28 – $40 Million
Facebook, Instagram, WhatsApp, and Messenger went down simultaneously for two hours. Based on quarterly earnings, $28 – $40 million in advertising revenue disappeared. An infrastructure-level change took down four products at once because governance did not account for blast radius across the service portfolio.
The Common Thread
Different industries, different technologies, different failure modes. Same root cause: a change deployed without adequate risk assessment. Each organization had the technical capability to prevent it. What was missing was the intelligence layer connecting change deployment to risk to stakeholder awareness.
Cost by Industry
Per-minute costs vary dramatically by sector. Data synthesized from EMA, Gartner, and industry-specific research:
| Industry | Estimated Cost Per Minute | Key Cost Drivers |
|---|---|---|
| Financial Services | $25,000 – $50,000 | Lost transactions, regulatory fines, trading window exposure, customer attrition |
| Healthcare | $15,000 – $30,000 | Patient safety risk, HIPAA exposure, delayed care delivery, malpractice liability |
| Retail / E-commerce | $10,000 – $25,000 | Lost sales, abandoned carts, promotional window losses, competitor switching |
| Technology / SaaS | $8,000 – $20,000 | SLA penalties, customer churn, reputation damage, trial conversion loss |
| Telecommunications | $15,000 – $35,000 | Subscriber churn, regulatory penalties, interconnect SLA breaches, public safety |
| Manufacturing | $10,000 – $20,000 | Production line stoppage, supply chain disruption, spoilage, contract penalties |
| Government / Public Sector | $5,000 – $15,000 | Service delivery disruption, citizen trust, compliance obligations, public safety |
Financial services tops the list because a trading platform outage doesn’t just lose transactions. It exposes the firm to position risk, regulatory scrutiny, and institutional client attrition. Retail outage costs are seasonal: a one-hour outage in February might cost $600K, the same outage on Black Friday costs $10 million or more.
Calculate Your Risk
Step 1: Establish Your Per-Minute Cost
Calculate revenue per minute during business hours. A $500M company generates roughly $950/minute (24/7). Multiply by 3 – 5x for total cost including SLA penalties and response labor: $2,850 – $4,750 per minute.
Step 2: Estimate Annual Outage Minutes
Review 12 months of incident data. If you don’t have clean records (which is itself a data point), use these benchmarks:
| Maturity Level | Annual Outage Minutes (Estimate) |
|---|---|
| Low maturity (reactive, manual processes) | 2,000 – 5,000 minutes |
| Medium maturity (some automation, basic monitoring) | 800 – 2,000 minutes |
| High maturity (full observability, automated response) | 200 – 800 minutes |
| Elite (change intelligence, proactive prevention) | Less than 200 minutes |
Step 3: Apply the 62% Change Attribution
If your total annual outage cost is $30 million, your change-related exposure is ~$18.6 million.
Step 4: Estimate Reduction Potential
Your annual reduction potential = Total outage cost × 62% × 50% reduction
For a tailored estimate, our ROI calculator factors in your outage frequency, MTTR, revenue profile, and team size.
Prevention vs. Recovery
The IT industry has historically invested more in recovery than prevention. Incident management, war rooms, status pages, on-call scheduling. All designed to minimize impact after the outage happens. Important capabilities. But they address the symptom, not the cause.
Recovery tools have measurably improved MTTR over the past decade. But they have not reduced outage frequency. Organizations recover faster without preventing more. Total cost keeps rising because incident volume grows faster than recovery time shrinks.
The Prevention Gap
The tools that govern changes (ITSM platforms) and the tools that detect failures (observability platforms) operate in separate systems. No shared intelligence. The change is recorded in ServiceNow. The alert fires in PagerDuty. The correlation between them happens manually, during the post-mortem, after the damage is done.
Closing this gap requires capabilities traditional tools don’t provide: pre-deployment risk scoring, intelligent awareness routing, and automatic change-to-incident correlation. We wrote about how this applies to reducing change failure rates specifically.
The Economics
| Approach | Investment | Impact | Estimated Annual Savings |
|---|---|---|---|
| Recovery optimization (faster MTTR) | $200K – $2M/year | 10 – 20% reduction in outage duration | $3M – $8M |
| Prevention (change intelligence) | $50K – $250K/year | 30 – 50% reduction in change-related outages | $14M – $24M |
| Combined | $250K – $2.25M/year | Fewer outages + faster recovery when they occur | $17M – $32M |
Prevention delivers 3 – 5x the savings at a fraction of the cost. Not because recovery tools are ineffective, but because prevention operates on a larger cost base. Preventing an outage eliminates 100% of its cost. Reducing its duration eliminates only a fraction.
$76 million per year. 62% caused by changes. 85% of those from procedure failures that better tooling can address. The data is not ambiguous.