On July 14, 2025, something strange happened on the Internet.
Cloudflare’s ultra-popular DNS service — 1.1.1.1 — suddenly stopped working.
For over an hour, people across the globe couldn’t resolve domain names.
To many users, it felt like the entire Internet had gone dark.
And naturally, confusion followed:
“Is 1.1.1.1 under attack?”
“Was this a BGP routing issue?”
“What just happened to Cloudflare DNS?”
Let’s break it down — clearly, step by step.
Table of Contents
What Actually Happened
Cloudflare accidentally withdrew its 1.1.1.1 IP addresses from the global Internet.
See also: Mastering the Linux Command Line — Your Complete Free Training Guide
The cause?
A misconfiguration in the way Cloudflare manages its internal service topologies.
This wasn’t an attack. It wasn’t a hijack. It was an internal mistake — and a serious one.
What Went Wrong
Here’s the simple version:
- Cloudflare maintains complex routing rules called service topologies.
- These topologies determine which IPs are active where.
- In June, someone made a quiet mistake — they linked 1.1.1.1’s IPs to a non-production service.
- That misconfiguration sat quietly for over a month.
- On July 14, a routine update to that non-production system accidentally pulled 1.1.1.1 offline worldwide.
So… Was This a BGP Hijack?
No — but it briefly looked like one.
When Cloudflare accidentally stopped advertising 1.1.1.0/24, Tata Communications (AS4755) began announcing that prefix.
That’s called a BGP origin hijack, but in this case, it wasn’t the root cause — just an opportunistic side effect of Cloudflare pulling back the prefix.
Why Did This Happen?
Cloudflare’s routing system is in a transition phase.
- The June config wasn’t caught — no alerts fired.
- The July refresh triggered a global topology rewrite.
- There was no staged deployment — the change hit all data centers at once.
Even though the change was peer-reviewed, it lacked safeguards like canary testing or progressive rollout.
Some Advice, Earned the Hard Way
If you want to avoid ending up in a Cloudflare-like situation — where a safer system exists but a legacy one still has sharp edges — here are a few suggestions:
1. Make migrations boring
The most successful migrations are the ones that feel like nothing happened. That means:
- Staged rollouts
- Canary testing
- Rehearsals in non-prod
- Alerts that actually fire before users notice anything
Migrations go sideways when they depend on manual oversight, tribal knowledge, or lucky timing. Automate the rollout. Pre-bake the rollback. Assume failure, plan for fast recovery.
Boring migrations are safe migrations.
2. Track migration debt like tech debt
If half your systems are on “the new thing” and half are still running “the old thing,” you’re carrying migration debt.
It adds cognitive load, slows you down, and increases risk.
You wouldn’t ship code with broken tests or missing logs — don’t ship a platform that’s half-migrated.
3. Reward migration work
This one’s cultural.
Migrations are often invisible labor — not flashy, not résumé-building. But they’re essential for reliability and operational maturity.
So treat them that way. Give engineers time. Give them support. Make migrations part of your planning cycles, not afterthoughts.
Better yet: start treating migration engineering as a career track.
Final Thoughts
Cloudflare’s incident wasn’t about malicious actors or exotic bugs.
It was a config mistake — made worse by an unfinished migration and a legacy system still in play.
Most companies aren’t Cloudflare. They don’t have the global scale — but they do have fragile systems, deferred migrations, and mounting complexity.
The lesson here isn’t just “use staged rollouts.”
Because if you don’t build migration muscle, the system will fail when you least expect it — and then, like Cloudflare, the whole Internet might notice.




