Cloudflare, a leading provider of internet infrastructure and security services, has released a comprehensive post-mortem report regarding a significant service outage experienced yesterday. The company has unequivocally confirmed that the widespread disruption was not caused by a security incident, and, crucially, no customer data was lost during the event.
While the issue has been largely mitigated, the outage, which began at 17:52 UTC yesterday, had a substantial impact across a wide array of Cloudflare’s edge computing and AI services. This transparency from Cloudflare is essential for maintaining trust and providing clarity to the millions of users and businesses that rely on their services daily. The incident highlights the intricate dependencies within modern cloud ecosystems and the challenges of maintaining uninterrupted service in a complex global network.
Understanding the Root Cause: Workers KV Failure
The core of yesterday’s widespread outage was traced to a critical component within Cloudflare’s architecture.
The Role of Workers KV
The primary catalyst for the service disruption was a complete offline event for Workers KV (Key-Value) system. Workers KV is a globally distributed, highly consistent key-value store. It serves as a fundamental building block for Cloudflare Workers, the company’s serverless computing platform.
This means that Workers KV is deeply embedded in the operational fabric of numerous Cloudflare services. Its role is critical for storing and retrieving essential data such as configuration settings, authentication tokens, and static assets. Due to its foundational nature, a failure in Workers KV can inevitably trigger cascading issues across many dependent components within Cloudflare’s vast network. The system’s global distribution is designed for high availability and low latency, making its complete failure a particularly impactful event.
Third-Party Cloud Provider Outage Identified
In its detailed post-mortem analysis, Cloudflare pinpointed the root cause of the outage. The company stated that the disruption, which lasted for almost 2.5 hours, stemmed from a failure in the underlying storage infrastructure used by their Workers KV service. A critical revelation was that “part of this infrastructure is backed by a third-party cloud provider.” This external provider experienced its own outage, which directly impacted the availability and functionality of Cloudflare’s Workers KV service.
Cloudflare explicitly stated: “The cause of this outage was due to a failure in the underlying storage infrastructure used by our Workers KV service, which is a critical dependency for many Cloudflare products and relied upon for configuration, authentication, and asset delivery across the affected services.”
This dependency on an external vendor for a core service highlights a common vulnerability in interconnected cloud environments, where even major infrastructure providers can be affected by issues beyond their direct control. The incident underscores the importance of diversifying dependencies to enhance overall system resilience.
Widespread Impact Across Cloudflare Services
The failure of Workers KV had a ripple effect, causing significant disruptions across numerous Cloudflare products, some of which also impacted other widely used services.
Critical Service Failures
Cloudflare meticulously detailed the specific impact on each of its affected services:
- Workers KV: As the origin of the problem, Workers KV experienced a 90.22% failure rate. This high error rate was due to the unavailability of its backend storage, directly affecting all uncached reads and writes to the system. This meant that any service attempting to retrieve fresh data or store new information in Workers KV faced overwhelming failures.
- Access, WARP, Gateway: These services, integral to identity-based authentication, secure session handling, and policy enforcement, suffered critical failures. Their reliance on Workers KV for these functions meant that users experienced significant issues. Specifically, WARP was unable to register new devices, preventing new users from connecting securely, and the Gateway proxying and DoH (DNS over HTTPS) queries were severely disrupted, impacting secure internet Browse and access controls.
- Dashboard, Turnstile, Challenges: The Cloudflare Dashboard, which customers use to manage their services, experienced widespread login failures. Similarly, CAPTCHA verification processes and Turnstile, Cloudflare’s bot detection and security solution, suffered significant disruptions. This also introduced a token reuse risk for Turnstile due to the activation of its kill switch, a measure taken to mitigate ongoing issues.
- Browser Isolation & Browser Rendering: These security and performance services, which rely on secure link-based sessions and remote browser rendering, failed to initiate or maintain operations. This was a direct consequence of the cascading failures in the underlying Access and Gateway services, demonstrating the interconnectedness of Cloudflare’s offerings.
- Stream, Images, Pages: Services focused on content delivery and web development were severely impacted. Stream playback and live streaming functionalities failed entirely, leading to interruptions for video content. Image uploads dropped to a 0% success rate, preventing new media from being processed. Furthermore, Pages builds and serving peaked at approximately 100% failure, meaning websites hosted on Cloudflare Pages were largely inaccessible or unable to update.
- Workers AI & AutoRAG: Cloudflare’s burgeoning AI services, Workers AI and AutoRAG, were rendered completely unavailable. Their dependence on Workers KV for critical functions like model configuration, routing, and indexing meant that these cutting-edge AI capabilities could not function, halting various AI applications and agents.
- Durable Objects, D1, Queues: These developer-focused services, built on a similar storage layer to Workers KV, also faced severe degradation. They experienced up to 22% error rates or, in some cases, complete unavailability for message queuing and data operations. This impacted applications relying on stateful serverless functions, database access, and inter-service communication.
- Realtime & AI Gateway: These services faced near-total service disruption due to their inability to retrieve essential configuration data from Workers KV. Realtime TURN/SFU (Traversal Using Relays around NAT/Selective Forwarding Unit) services for real-time communication and AI Gateway requests were heavily impacted, leading to failures in interactive applications and AI model interactions.
- Zaraz & Workers Assets: Services like Zaraz (a tag manager) and Workers Assets (for static file delivery) saw full or partial failure in loading or updating configurations and static assets. While the impact on end-users for these specific services was relatively limited in scope, it still highlighted the pervasive nature of the underlying KV dependency.
- CDN, Workers for Platforms, Workers Builds: Even Cloudflare’s core CDN (Content Delivery Network), while generally resilient, experienced increased latency and regional errors in some locations. New Workers builds failed at a 100% rate during the incident, preventing developers from deploying new serverless functions. Workers for Platforms, which allows other companies to build on Cloudflare’s Workers platform, also saw disruptions.
The incident also notably impacted services used by millions, including those relying on the Google Cloud Platform, suggesting a broader interconnectedness of cloud services and the potential for ripple effects across the internet.
Cloudflare’s Response and Future Resilience Plans
In the wake of this significant outage, Cloudflare has outlined immediate and long-term strategies to bolster its infrastructure and prevent similar incidents.
Eliminating Single-Provider Reliance
Cloudflare has announced that it will be accelerating several resilience-focused changes in response to this outage. A primary and crucial initiative is the elimination of reliance on a single third-party cloud provider for Workers KV backend storage. This strategic shift acknowledges the vulnerability exposed by the recent incident.
By diversifying their storage infrastructure, Cloudflare aims to prevent a single point of failure from causing such widespread disruption again. This move is a testament to learning from incidents and proactively building more robust systems.
Migrating to R2 Object Storage
A key component of this diversification strategy is the gradual migration of Workers KV’s central store to Cloudflare’s own R2 object storage. Cloudflare R2 is their proprietary, S3-compatible object storage service that boasts zero egress fees, offering both cost-effectiveness and greater control over their infrastructure.
By moving a critical dependency like Workers KV to their internally managed and optimized R2 storage, Cloudflare significantly reduces its external dependency on third-party cloud providers. This migration is a complex undertaking but is crucial for enhancing the overall stability and reliability of their entire platform, particularly for services that are heavily reliant on Workers KV. This internal migration will grant Cloudflare more direct control over the performance, availability, and resilience of the data store.
Implementing Cross-Service Safeguards and New Tooling
Beyond infrastructure changes, Cloudflare is also focusing on operational improvements. The company plans to implement cross-service safeguards. These safeguards will be designed to isolate failures and prevent a problem in one service from cascading and overwhelming others. This involves building more robust error handling and fault tolerance mechanisms between different Cloudflare products.
Furthermore, Cloudflare intends to develop new tooling to gradually restore services during storage outages. This new tooling will be crucial for managing the recovery process more effectively, preventing scenarios where a sudden surge of traffic to recovering systems could overwhelm them and cause secondary failures.
Such intelligent recovery mechanisms are vital for minimizing downtime and ensuring a smoother restoration process in the event of future disruptions. This proactive approach to resilience, driven by the lessons learned from this outage, is a commitment to ensuring Cloudflare’s services remain highly available and reliable for its global user base.