Analyzing a Bottleneck in ClickHouse: Cloudflare's Experience
Cloudflare encountered a significant issue with their billing pipeline when a hidden bottleneck in their ClickHouse online analytical processing (OLAP) database caused delays. This problem, which followed a migration, had severe implications for invoice reconciliation and revenue streams. In this analysis, we explore the technical challenges, the root cause, and the solutions that were implemented to resolve the issue.
Overview of ClickHouse Usage at Cloudflare
ClickHouse serves as a core component in Cloudflare's operations, functioning as a high-performance OLAP database for handling vast amounts of data. The platform processes millions of queries daily to calculate billing for Cloudflares product usage. With over a hundred petabytes of data spread across dozens of clusters, the platform supports revenue management, fraud detection systems, and more.
To streamline data ingestion for internal teams, Cloudflare introduced the ReadyAnalytics system in 2022. This centralized solution allows various teams to feed data into a singular, massive table distinguished by a namespace. Each record utilizes a standardized schema featuring multiple float fields, string fields, timestamps, and an indexID. This schema ensures efficient data sorting and query performance.
Impact of the Bottleneck on Billing Operations
The bottleneck was particularly problematic because it disrupted the timely completion of daily aggregation jobs in ClickHouse. These jobs are essential for generating accurate invoices for Cloudflare's customers. Delays in processing led to challenges in reconciling invoices, which in turn impacted hundreds of millions of dollars in revenue and related systems.
Initially, common performance metrics such as IO operations, memory usage, rows scanned, and parts read appeared normal, which made diagnosing the issue particularly challenging. As the slowdown persisted, it became clear that the root cause was buried deeper within the system.
Discovery of the Hidden Bottleneck
After a detailed investigation, Cloudflare engineers identified a hidden bottleneck in the internal workings of ClickHouse. This bottleneck was related to how data was stored and processed, particularly in the context of the primary key structure. The primary key, which included namespace, indexID, and timestamp, was not optimized for the increasing scale and query demands of the database.
The issue was compounded by the system's retention policy. A singular retention policy applied across diverse datasets created inefficiencies, as data with varying access patterns and lifecycles were treated uniformly. This one-size-fits-all approach led to suboptimal resource utilization and hindered performance.
The Role of the Retention Policy
The retention policy in question was designed to manage the vast amount of data stored in ClickHouse. However, the policy did not account for the unique requirements of different data namespaces. As a result, frequently queried data and rarely accessed data were retained in the same way, leading to unnecessary resource consumption and slower query performance.
This limitation became more pronounced as the system grew, reaching over 2 petabytes of data and an ingestion rate of millions of rows per second by December 2024. The lack of a flexible retention policy became a critical bottleneck, requiring immediate attention.
Solutions Implemented to Resolve the Issue
Cloudflare's engineers addressed the bottleneck by introducing three distinct patches to the ClickHouse system. These patches focused on optimizing the internal processes and addressing inefficiencies in the retention policy. The patches allowed for more granular management of data retention and improved the performance of the primary key structure.
By tailoring the retention policy to the specific needs of different datasets, the engineers significantly reduced resource consumption and enhanced query performance. These changes restored the efficiency of the billing pipeline, ensuring timely invoice generation and reconciliation.
Lessons Learned and Broader Implications
This experience underscores the importance of continuous monitoring and optimization in large-scale analytics platforms. Even seemingly minor inefficiencies can have cascading effects, particularly in systems that handle high volumes of data and transactions. Cloudflare's proactive approach to diagnosing and resolving the issue highlights the value of thorough system analysis and targeted optimization.
As organizations increasingly rely on data-driven operations, the need for robust and scalable database solutions like ClickHouse becomes more critical. However, this case study also serves as a reminder that such systems require careful configuration and ongoing maintenance to perform effectively at scale.