Back to Blog
IT Ops Best Practices

Shadow Changes: The Hidden Cause of IT Outages

The changes that cause outages are the ones you don't know about. How shadow changes happen and how to detect them.

February 28, 20267 min read

The 80% Problem

Gartner, IDC, and the IT Process Institute all arrived at the same number independently: 80 percent of unplanned downtime traces back to people and process issues, not hardware failure, not software bugs in the traditional sense. People changing things. The infrastructure does not spontaneously break. Somebody changes it, and when that change is invisible to every system of record you own, the resulting failure looks inexplicable.

On October 4, 2021, every Facebook-owned service disappeared from the internet for six hours. The root cause was a configuration change to backbone routers that withdrew BGP routes to Facebook’s own DNS servers. A config change. Not a hack, not a hardware failure. An engineer issued a command that effectively disconnected Facebook from the global routing table. The cascading effect locked engineers out of their own remote management tools, which meant someone had to physically walk into a data center to undo it.

That incident was visible in hindsight because of its sheer scale. Most shadow changes are not. They are the feature flag toggled at 6 PM that nobody logs. The connection pool setting adjusted through a cloud console because the change request process would have taken three days. The Terraform state that drifted from what’s actually running in production. These changes exist in the gap between what your systems of record believe and what production actually looks like.

We call them shadow changes. And they are the single biggest diagnostic blind spot in modern IT operations.

What Shadow Changes Actually Look Like

A shadow change is any modification to a production system that occurs outside the official change management process. Not malicious. Almost never intentional circumvention. Just an engineer facing a choice between a process that takes hours and a fix that takes thirty seconds.

The most common variant is the feature flag toggle. When an engineer flips a flag, traffic shifts from one code path to another. Resource consumption patterns change. Downstream API call patterns change. For every practical purpose it is a deployment. But it does not trigger a pipeline, does not create a commit, does not appear in CI/CD history. CloudBees research calls feature flags “blind spots in the audit trail,” and that is accurate. A flag that enables a new database query pattern can saturate a connection pool in minutes. A flag that activates a new caching strategy can serve stale data to every user. None of it leaves a trace in the systems your incident responders check first.

Then there is infrastructure drift. IaC was supposed to solve the visibility problem: if all infrastructure is defined in version-controlled code, every change is tracked. In practice, engineers make manual changes through cloud consoles during incidents, intending to backport into Terraform “later.” Later rarely arrives. The IaC repo says one thing. Production says another. The next terraform plan surfaces the drift as an unexpected diff, or worse, silently reverts the manual fix. Environment variables, secrets, runtime settings often live entirely outside the IaC boundary. A change to a database connection pooling parameter is a production change that Terraform never sees.

And then the quick config tweaks. An engineer SSHs into a server to adjust a thread pool size. A DBA modifies a query timeout through a database console. An ops engineer adjusts a load balancer weight through a cloud provider UI. Each change is minor. Most of the time, it works exactly as intended. That success rate is what makes it dangerous: 99 successful tweaks build a habit. The one tweak that triggers a cascading failure reveals the accumulated risk.

Knight Capital learned this in August 2012. A technician deploying new trading code failed to copy it to one of eight servers. The deployment tooling had a bug: when it couldn’t open an SSH connection to a machine, it failed silently and reported success. The old code on that eighth server began sending orders in an infinite loop. In 45 minutes, Knight Capital lost $440 million. No written deployment procedures existed. No peer review was required. The company nearly ceased to exist because of a single unverified deployment step.

Why Engineers Bypass the Process

Here’s the thing. Shadow changes are not a discipline problem. They are a friction problem. Every shadow change is evidence that your change management workflow has failed to keep pace with how your team actually works.

The most common driver is a change process that treats every modification identically regardless of risk. When a one-line config tweak requires the same approval workflow as a major infrastructure migration, engineers will find a way around it for the small stuff. A Keepnet Labs study found that 61 percent of employees are dissatisfied with existing processes, and 38 percent are driven toward shadow IT as a direct result. They are not objecting to the principle of change management. They are objecting to a three-day overhead on a thirty-second fix.

Change Advisory Boards amplify the problem. When the only path to production runs through a weekly meeting, every change that misses the agenda gets delayed by a week. DORA research has consistently shown that CABs do not reduce change failure rates. They just force larger batch deployments, which are inherently riskier. But the less-discussed consequence is what happens to the changes that teams refuse to batch. Those are the modifications that get pushed through unofficial channels because waiting for the next CAB slot is not a viable option. The CAB does not prevent these changes. It just makes them invisible.

Tool sprawl creates its own category of accidental shadow changes. A single production change might originate from a CI/CD pipeline, a feature flag platform, a cloud provider console, an IaC repo, or a direct SSH session. When your change tracking system only sees the CI/CD pipeline, every modification made through another channel is a shadow change by default. Nobody bypassed the process. The process simply has no visibility into the tool they used.

AI Makes This Worse

We need to talk about AI-generated changes, because they are creating an entirely new class of shadow changes that traditional tracking was never designed to see.

Engineers using Copilot, Cursor, and Claude Code are generating infrastructure-as-code, Kubernetes manifests, and configuration files at a pace that manual review cannot match. GitClear’s analysis of over 153 million lines of code found that AI-assisted coding is linked to four times more code duplication than before. Code is being pasted into production faster than any change management process can absorb it.

But the code generation is only half the problem. AI agents are now toggling feature flags as part of automated experiments. They are modifying configuration parameters in response to performance signals. They are creating and applying Terraform plans. Each of these is a production change. Each one happens in a channel that your change calendar, your CMDB, and your weekly CAB meeting know nothing about.

The 2024 DORA State of DevOps report found that as AI adoption increased, delivery stability decreased by an estimated 7.2 percent. That number should not be surprising. When you accelerate the rate of change without proportionally expanding the aperture of change visibility, you get more invisible modifications. More invisible modifications mean longer incident diagnosis times. The math is straightforward.

This is our opinionated take: within two years, AI-generated changes will be the primary source of shadow changes in most engineering organizations. The tooling to track them does not exist yet in most shops. If you are not thinking about this now, you are building a visibility gap that will bite you during your next major incident.

Detection Requires Observation, Not Process

Detecting shadow changes means shifting from tracking what goes through your workflow to observing what actually changes in production. That is a different mental model entirely.

The first layer is treating your CI/CD pipeline as a change registration system, not just a deployment mechanism. Every pipeline execution should emit a structured change event that feeds into your change intelligence platform. This creates a baseline of known changes. When an incident occurs and the change correlation engine shows no pipeline ran, that absence becomes a strong signal that a shadow change is the cause.

Feature flag platforms need the same treatment. Every flag toggle should generate a change event with the same metadata as a deployment: who, when, expected impact, affected systems. Most platforms (LaunchDarkly, Split, Unleash) provide webhook integrations that can feed flag changes into your change timeline. The integration is straightforward. It is just rarely prioritized because teams do not think of flag toggles as changes. That perception gap is exactly what makes them shadows.

Configuration drift detection compares actual system state against the expected state in your IaC or config management system. Any divergence is, by definition, a shadow change. Something changed production without updating the source of truth. Effective drift detection runs continuously, not as a quarterly audit. When it detects a divergence, it should fire an alert that includes what changed, on which resource, and which principal made the modification according to cloud provider audit logs.

Cloud audit logs themselves are the ground-truth layer. AWS CloudTrail, GCP Audit Logs, Azure Activity Log. These capture changes that bypass every other detection mechanism: console clicks, CLI commands, API calls from scripts on someone’s laptop. Cross-referencing audit logs against your official change records reveals the delta. The changes that happened but were never tracked.

The key insight is that all of these sources need to feed into a single timeline. When your incident responders can see official deployments, feature flag toggles, detected drift, and cloud audit events on one screen, the diagnostic power multiplies. Shadow changes stop being shadows. They become visible, correlated, and actionable.

Make the Right Path the Fast Path

Detection is necessary but not sufficient. As long as the official change process is slower than the workaround, engineers will keep bypassing it. So stop making it slow.

Risk-tiered approvals are the foundation. Low-risk changes, which make up the bulk of shadow change volume, should be auto-approved based on predefined criteria. When a config tweak that would take three days through the CAB can instead be logged, risk-scored, and approved in seconds, the incentive to skip the process evaporates. Automated CAB workflows that tier changes by risk level make this concrete. Standard changes flow through automatically. Normal changes route to the right approver asynchronously. Only genuinely high-risk changes require synchronous review.

Change registration itself needs to be embedded in the tools engineers already use. A Slack command. A CLI flag. A git hook. An API call from the feature flag platform. The change record should be created as a side effect of making the change, not as a separate step in a separate tool. The effort to document a change should be proportional to its risk. For low-risk config updates, the effort should be zero beyond making the change itself.

Then close the feedback loop. When a change from any source correlates with an incident, that correlation should feed back into the risk scoring model. A feature flag toggle that caused an outage raises the risk score for similar toggles going forward. A config change that caused drift triggers a review of which parameters need stricter controls. Over time, the system learns which changes are most likely to cause problems, which teams generate the most shadow changes, and which tools are the most common sources of untracked modifications.

This is the cultural shift that actually works. Not “stop making undocumented changes” (nobody listens to that). Instead: make the compliant path faster than the non-compliant one. When documenting a change takes less effort than explaining an undocumented change during a 2 AM incident call, shadow changes stop being rational. They stop being tempting.

The Invisible Input

Cloudflare went dark on July 2, 2019, because an engineer deployed a single WAF rule containing a poorly written regular expression. The rule went out globally rather than through a staged rollout. CPU usage across Cloudflare’s edge network spiked to 100 percent. Every site behind Cloudflare returned errors. Twenty-seven minutes of global impact from one config push that skipped the staging process.

These stories repeat because the underlying dynamic never changes. The changes that break production are the changes you do not know about. The ones that bypassed the approval queue. The ones that happened in a tool your change system does not watch. The ones that an AI agent executed at 3 AM in response to a metric threshold.

The solution is not more process. More process is what created the incentive to bypass it in the first place. The solution is better observation. Treat every tool that can modify production as a change source. Feed every change into a single timeline. Make the official path fast enough that nobody has a reason to go around it. The organizations that get this right do not just reduce their shadow change volume. They cut their mean time to resolution because when something breaks, the answer to “what changed?” is never “we don’t know.”

Ready to modernize your change management?

Get started for free or book a personalized demo.