In Part 1 of this series (See Part I), we outlined the how the bandwidth requirement for high performance buffers was outstripping the abilities of traditional memories and at the same time the capacity required of such buffers was declining. In Part II we are considering the memory solution requirements for packet header processing.
When considering buffer applications, the primary performance metric is throughput when considering packet header processing the critical performance metric is the ability to quickly access tables and decision trees. Again, as was the case for buffer memories a commodity DRAM has been the solution of choice simply by nature of its market driven pricing, multi-sourced nature and sheer economies of scale. Around the year 2000, when packet data rates transitioned through 10G aggregate bandwidth an important threshold was crossed; the packet-processing rate for the first time exceeded the random access rate of generic DRAM. From that point onward, illustrated in Figure 1, increasingly complex memory control solutions have been used to allow DRAM to scale up for high performance networking applications.
Over the last decade the solutions used to address this performance gap have fallen into four broad categories; some of which overlap.
High performance, specialty memory solutions such as Low Latency DRAM or QDR SRAM have stepped in to address external memory performance, and memory subsystem architectural improvements of either caching or load balancing have been deployed.
These solutions have been scaled to 100G performance levels albeit, not without considerable effort and performance compromises.
In order to handle the line rate requirements packet processors utilize long, deterministic pipelines which mitigate the need for ultra-low latency memories but does not relieve the access rate requirements to process packet header data in real time. Specialty memory devices, such as low latency DRAM, are derived from their commodity heritage and through brute force design techniques pushed to the limits of the underlying array technology. These specialty memory products which emerged in the late 1990’s have been a good stopgap measure but are only evolving at the same rate as their core technology. Now, as we approach processing rates 15x faster than the DRAM cycle time the clever system tricks or specialty memory products used are reaching the limits of diminishing returns and going forward the performance gap will continue to grow.
Access Rate and Latency Diverge
Prior to CPUs architectures transitioning from single core to multi-core the clock rate and performance were closely linked.
Similarly there has always been a relationship between access rate and latency shown in Figure 2 for traditional memory devices.
Now with CPUs/NPUs going to many-core the throughput of these processors, aggregating the performance of the available parallel resources continues to grow to meet market demands.
Many applications are architected for high throughput and are based on efficient pipelines which require unprecedented memory access rate.
Similarly, the Bandwidth Engine device architecture is optimized for memory access performance and utilizes a 90% efficient transport protocol running on industry standard serial CEI-11 and XFI compatible interfaces.
Instead of focusing solely on low latency as the traditional memory devices the Bandwidth Engine Architecture is highly parallel, allowing pipelined, deterministic, concurrent accesses which are complementary to the pipelines of multi-core network processors.
Let’s consider competing solutions to address specific networking applications.
The devices we will be considering are the highest performance commodity DRAM available, Quad-Data Rate (QDR) or Sigma Quad SRAM, Low Latency (LL) or Reduced Latency (RL) DRAM and the MoSys Bandwidth Engine IC. APPLICATION: Lookup Table; with infrequent updates
Some applications simply need a high random access read rate. At 200G the packet arrival rate is 300Mpps. Table 4 illustrates the performance differences for the read dominated access patterns.
The Bandwidth Engine with two internal read ports delivers the lowest power, highest efficiency solution for the generic high read rate application running in native access mode.
QDR SRAM is also dual ported but has one dedicated write port and one dedicated read port.
For the purpose of comparison a DDR SRAM performs as well as a QDR SRAM but uses fewer pins to interface to the host.
Although the applications can tolerate DRAM like latency it is the random read rate, constrained by bank to bank timing limitations (tFAW, tRRD) that make DRAM an inefficient choice for the application. The fast cycle time of the RL/LL DRAM is a significant improvement compared to commodity DRAM, but in this example would still require 3 devices, with the corresponding power and pin count, to meet the performance requirements. APPLICATION: State Memory
Various networking applications maintain state, such as Network Policing, Network Address Translation, Stateful Firewalling, TCP Intercept, Network Based Application Recognition, Server Load balancing, URL Switching, to name a few. Maintaining state is a memory intensive read-modify-write operation and requires low, deterministic latency and completely random access.
The entire RMW operation is required to complete within the packet arrival time (3.2ns @ 100GE) and both operations are to the same memory location. Traditionally QDR SRAM with its independent write and read ability has been used to meet this requirement. The packet processing elements typically have long pipelines and cannot tolerate a stall from a non-deterministic response.