1.1.1.1 Went Down — Here’s What Actually Happened at Cloudflare

On July 14, 2025, something strange happened on the Internet.

Cloudflare’s ultra-popular DNS service — 1.1.1.1 — suddenly stopped working.

For over an hour, people across the globe couldn’t resolve domain names.

To many users, it felt like the entire Internet had gone dark.

And naturally, confusion followed:

“Is 1.1.1.1 under attack?”

“Was this a BGP routing issue?”

“What just happened to Cloudflare DNS?”

Let’s break it down — clearly, step by step.

Table of Contents

What Actually Happened

Cloudflare accidentally withdrew its 1.1.1.1 IP addresses from the global Internet.

The cause?

A misconfiguration in the way Cloudflare manages its internal service topologies.

This wasn’t an attack. It wasn’t a hijack. It was an internal mistake — and a serious one.

What Went Wrong

Here’s the simple version:

Cloudflare maintains complex routing rules called service topologies.
These topologies determine which IPs are active where.
In June, someone made a quiet mistake — they linked 1.1.1.1’s IPs to a non-production service.
That misconfiguration sat quietly for over a month.
On July 14, a routine update to that non-production system accidentally pulled 1.1.1.1 offline worldwide.

So… Was This a BGP Hijack?

No — but it briefly looked like one.

When Cloudflare accidentally stopped advertising 1.1.1.0/24, Tata Communications (AS4755) began announcing that prefix.

That’s called a BGP origin hijack, but in this case, it wasn’t the root cause — just an opportunistic side effect of Cloudflare pulling back the prefix.

Why Did This Happen?

Cloudflare’s routing system is in a transition phase.

The June config wasn’t caught — no alerts fired.
The July refresh triggered a global topology rewrite.
There was no staged deployment — the change hit all data centers at once.

Even though the change was peer-reviewed, it lacked safeguards like canary testing or progressive rollout.

Some Advice, Earned the Hard Way

If you want to avoid ending up in a Cloudflare-like situation — where a safer system exists but a legacy one still has sharp edges — here are a few suggestions:

1. Make migrations boring

The most successful migrations are the ones that feel like nothing happened. That means:

Staged rollouts
Canary testing
Rehearsals in non-prod
Alerts that actually fire before users notice anything

Migrations go sideways when they depend on manual oversight, tribal knowledge, or lucky timing. Automate the rollout. Pre-bake the rollback. Assume failure, plan for fast recovery.

Boring migrations are safe migrations.

2. Track migration debt like tech debt

If half your systems are on “the new thing” and half are still running “the old thing,” you’re carrying migration debt.

It adds cognitive load, slows you down, and increases risk.

You wouldn’t ship code with broken tests or missing logs — don’t ship a platform that’s half-migrated.

3. Reward migration work

This one’s cultural.

Migrations are often invisible labor — not flashy, not résumé-building. But they’re essential for reliability and operational maturity.

So treat them that way. Give engineers time. Give them support. Make migrations part of your planning cycles, not afterthoughts.

Better yet: start treating migration engineering as a career track.

Final Thoughts

Cloudflare’s incident wasn’t about malicious actors or exotic bugs.

It was a config mistake — made worse by an unfinished migration and a legacy system still in play.

Most companies aren’t Cloudflare. They don’t have the global scale — but they do have fragile systems, deferred migrations, and mounting complexity.

The lesson here isn’t just “use staged rollouts.”

Because if you don’t build migration muscle, the system will fail when you least expect it — and then, like Cloudflare, the whole Internet might notice.