Cloudflare’s Worst Outage Since 2019 Disrupts Global Internet

A corrupted internal file triggered cascading failures that took down some of the world’s most-used platforms within minutes.

MIT SMR Editors 2 hours ago

Topics

[Image source: Krishna Prasad/MITSMR Middle East]

Cloudflare, the internet infrastructure firm that routes and secures traffic for millions of sites, suffered its most serious global outage since 2019 on Tuesday, November 18, knocking out major services including ChatGPT, GitHub and X.

The disruption began at 11:20 UTC and spread within minutes, triggering waves of 5xx errors across websites, apps and APIs.

Initial fears of a cyberattack faded as Cloudflare said the breakdown stemmed from an internal change that spiraled into a network-wide failure.

Cloudflare CEO Matthew Prince said the collapse began with a permissions update in one of its ClickHouse database clusters. The update inadvertently allowed the system to produce duplicate entries in a crucial “feature file” used by Cloudflare’s Bot Management module, a system that analyzes and scores whether each request across the network is from a human or a bot.

This feature file, refreshed every few minutes and distributed rapidly worldwide, suddenly doubled in size, exceeding the memory limits of the software running in Cloudflare’s global network of machines, causing those systems to fail as they attempted to read it.

What followed was a chaotic, fluctuating cycle as every five minutes, as the feature file regenerated, some machines received a valid version and recovered, while others received the corrupted file and crashed again. This rise-and-fall pattern made the outage appear at first like a hyperscale DDoS attack.

Adding to the confusion, Cloudflare’s own status page, hosted entirely off its infrastructure, also went down at the same time due to an unrelated issue, intensifying internal suspicion of a coordinated attack.

The company has since issued a detailed postmortem, acknowledging both the scale of the failure and the gravity of its consequences.

A Global Check Point

Cloudflare serves millions of websites and apps, acting as the traffic cop between users and the internet’s most popular destinations. When its core proxy system, known internally as FL, began returning 5xx errors, everything from websites to mobile apps to APIs stalled worldwide.

The new FL2 proxy engine, which Cloudflare is in the midst of migrating customers to, also buckled. While older FL-based traffic didn’t throw errors, it delivered incorrect bot scores, labeling all traffic with a score of zero, a decision that would have triggered mass false positives for customers who use bot detection rules.

Beyond failed requests, Cloudflare’s internal debugging systems consumed excessive CPU cycles during the crisis, inflating latency across its content delivery network.

By 14:30 UTC, engineers identified the root cause and stopped the propagation of the faulty file. They manually replaced it with a known-good configuration and restarted the core proxy infrastructure. Systems gradually stabilized, fully returning to normal by 17:06 UTC.

Cloudflare’s Apology

In its account of the outage, Cloudflare’s CEO issued an apology:

“We let the Internet down today. On behalf of the entire Cloudflare team, I’m sorry.”

The company described this as its most severe outage since 2019, one that resulted not just in dashboard downtime or peripheral service disruptions, but a stoppage of core global traffic.

Cloudflare’s engineers emphasized that the failure exposed gaps in internal safeguards, particularly around the ingestion of automated configuration files and error-handling mechanisms.

Cloudflare’s Worst Outage Since 2019 Disrupts Global Internet

Topics

News

A Global Check Point

Cloudflare’s Apology

Topics

About the Author

Tags:

Topics

Share