Cloudflare Gen13 Edge Compute Overview
The Gen13 platform replaces the previous Gen12 fleet by shifting emphasis from large cache structures to a dense core count. This change targets higher throughput while keeping latency within the limits required by our SLAs. The Rust-based FL2 layer acts as the integration point that extracts the full performance potential of the new silicon.
Why Cache Dominated Designs Limited Scaling
Large cache sizes historically provided low latency for memory‑intensive workloads, but they also imposed a ceiling on core density due to die area constraints. As request rates grew, the marginal benefit of additional cache diminished while the need for parallel execution rose. The architecture began to exhibit a mismatch between throughput demand and available compute resources.
When the Gen12 servers reached their maximum core count, any further performance gain required either larger cache banks or a redesign of the software stack. The existing FL1 layer was tightly coupled to the presence of abundant cache, causing a bottleneck when core counts increased without proportional cache growth. This situation forced engineers to evaluate whether the cache advantage outweighed the lost parallelism.
Empirical data showed that beyond a certain threshold, adding more cache produced diminishing returns on overall response time, while the lack of extra cores limited the ability to handle bursty traffic spikes. The SLAs demanded consistent latency, but the system struggled to maintain it under high concurrency without additional core resources. Consequently, a strategic pivot toward a core-centric design became necessary.
Transitioning away from a cache‑first mindset required a thorough audit of every processing stage to identify hidden cache dependencies. The audit revealed that the FL1 code path performed multiple cache-sensitive operations that inflated latency when cache was reduced. By isolating these hotspots, the team prepared a roadmap for a cache‑agnostic implementation.
Turin Processor Core Expansion and Power Profile
The Turin silicon introduces up to 192 cores per socket, effectively doubling the compute density compared to the previous Gen12 generation. Each core benefits from a refined Zen 5 micro‑architecture that improves instruction flow without relying on massive cache. Power consumption per core drops by roughly 30 percent, allowing higher density without exceeding thermal limits.
Memory bandwidth on Turin scales to accommodate the larger core pool, delivering a sustained throughput increase that matches the expanded compute capacity. The DDR5‑6400 interface supplies sufficient data rates to keep the cores fed, reducing stalls that previously manifested as latency spikes. This balance between bandwidth and core count is essential for edge workloads.
Thermal design power (TDP) figures reveal that a fully populated Turin socket draws less than the combined power of two Gen12 sockets at peak load. The efficiency gains stem from architectural refinements that lower voltage swing and improve clock gating. As a result, data centers can host more servers within the same power envelope.
From a provisioning perspective, the higher core count simplifies capacity planning because a single Turin node can replace multiple older nodes while preserving the same SLAs. Operators benefit from reduced hardware footprint and fewer network hops, which in turn improves overall system reliability. The shift also opens opportunities for more granular workload isolation.
FL2 Rewrite: Removing Cache Dependency
The new FL2 layer was written in Rust to exploit safe concurrency while discarding assumptions about abundant cache. By redesigning data structures to be cache‑friendly rather than cache‑dependent, the code reduces memory pressure on the reduced cache hierarchy. This approach also improves predictability of latency across varied traffic patterns.
Key refactors include replacing monolithic buffers with segmented queues that align with core boundaries, thereby minimizing cross‑core cache contention. The scheduler now distributes work based on core availability rather than cache locality, which aligns with the hardwares strengths. These changes collectively raise the effective throughput per watt.
Testing showed that the FL2 implementation sustains up to 1.9× higher request rates on a fully populated Turin node compared to the legacy FL1 on Gen12. The observed latency distribution narrows, indicating more consistent response times under load. Importantly, the new stack respects existing SLAs without requiring configuration changes.
Migration tools were added to transition live traffic from FL1 to FL2 with zero downtime. The tools perform health checks, validate performance baselines, and roll back automatically if thresholds are breached. This safety net ensures that the upgrade path remains reliable for production environments.
Performance Measurement Methodology
Benchmarks were conducted using a synthetic workload that mirrors typical edge request patterns, focusing on throughput, latency, and CPU utilization. Each test ran for a minimum of ten minutes to capture steady‑state behavior and avoid transient effects. Metrics were collected from both the hardware counters and the applications internal telemetry.
To isolate the impact of the FL2 rewrite, the same workload was executed on identical hardware running the legacy FL1 code path. Differences in cache size, core count, and power settings were accounted for by normalizing the results against a baseline. This method provides a clear view of software versus silicon contributions.
Results indicated a 45 percent increase in sustained throughput when moving from FL1 on Gen12 to FL2 on Turin. The 99th‑percentile latency dropped by roughly 30 percent, confirming that the removal of cache reliance did not harm tail performance. CPU efficiency improved as well, with a lower utilization percentage at peak load.
Additional validation involved real‑world traffic captures replayed through both stacks. The replayed traffic preserved request size distribution, protocol mix, and burst characteristics, ensuring that the observed gains translate to production scenarios. All observed metrics remained within the predefined SLAs.
Real‑World Edge Workloads on Gen13
Customer sites that migrated to the Gen13 platform reported noticeable reductions in response time for static asset delivery, a workload heavily reliant on fast cache access. The increased core count allowed parallel handling of TLS handshakes, which traditionally taxed the CPU. As a result, overall user‑perceived latency improved across multiple geographic regions.
Dynamic content generation, such as API gateways, benefited from the FL2 schedulers ability to spread work across many cores without being constrained by cache size. The Rust implementations low‑overhead concurrency primitives kept latency low even under heavy load spikes. Operators observed a smoother scaling curve as traffic grew.
Security services like DDoS mitigation leveraged the higher throughput capacity to process larger packet volumes while maintaining detection accuracy. The reduced reliance on large cache meant that packet inspection pipelines could run on any available core, improving resilience. This flexibility contributed to higher uptime during attack periods.
Analytics collection at the edge saw faster aggregation because the increased core pool could execute parallel reduction operations. The FL2 code paths memory layout minimized cross‑core interference, allowing the analytics engine to maintain a steady processing rate. Consequently, customers received near‑real‑time insights without sacrificing other services.
Future Directions for Edge Compute Architecture
Looking ahead, the team plans to explore heterogeneous acceleration by pairing Turin cores with specialized processors for cryptographic workloads. This approach would keep the main core count high while offloading compute‑intensive tasks to dedicated silicon. Early prototypes suggest potential gains in both throughput and energy use.
Another avenue involves refining the FL2 runtime to support adaptive load shedding based on real‑time CPU pressure. By monitoring utilization and latency metrics, the system could dynamically adjust request admission rates. Such feedback loops would help preserve SLAs during unexpected traffic surges.
Further research is being conducted on memory hierarchy tuning, specifically experimenting with smaller, faster cache slices that sit closer to individual cores. The goal is to retain some of the latency benefits of large cache while preserving the high core density of the current design. Simulations indicate a sweet spot where both dimensions improve.
Finally, the organization intends to open a public benchmark suite that captures a wide range of edge scenarios, enabling the community to compare future hardware generations against the baseline established by Gen13. Transparency in measurement will drive more informed decisions about hardware procurement and software evolution. The suite will include metrics for throughput, latency, and power efficiency.