Design Article
Networking memories--High access rates for packet-header processing—Part II
Michael Sporer, MoSys
9/6/2012 12:23 PM EDT
In Part 1 of this series (See Part I), we outlined the how the bandwidth requirement for high performance buffers was outstripping the abilities of traditional memories and at the same time the capacity required of such buffers was declining. In Part II we are considering the memory solution requirements for packet header processing.
When considering buffer applications, the primary performance metric is throughput when considering packet header processing the critical performance metric is the ability to quickly access tables and decision trees. Again, as was the case for buffer memories a commodity DRAM has been the solution of choice simply by nature of its market driven pricing, multi-sourced nature and sheer economies of scale. Around the year 2000, when packet data rates transitioned through 10G aggregate bandwidth an important threshold was crossed; the packet-processing rate for the first time exceeded the random access rate of generic DRAM. From that point onward, illustrated in Figure 1, increasingly complex memory control solutions have been used to allow DRAM to scale up for high performance networking applications.
These solutions have been scaled to 100G performance levels albeit, not without considerable effort and performance compromises. In order to handle the line rate requirements packet processors utilize long, deterministic pipelines which mitigate the need for ultra-low latency memories but does not relieve the access rate requirements to process packet header data in real time. Specialty memory devices, such as low latency DRAM, are derived from their commodity heritage and through brute force design techniques pushed to the limits of the underlying array technology. These specialty memory products which emerged in the late 1990’s have been a good stopgap measure but are only evolving at the same rate as their core technology. Now, as we approach processing rates 15x faster than the DRAM cycle time the clever system tricks or specialty memory products used are reaching the limits of diminishing returns and going forward the performance gap will continue to grow.
Access Rate and Latency Diverge
Prior to CPUs architectures transitioning from single core to multi-core the clock rate and performance were closely linked. Similarly there has always been a relationship between access rate and latency shown in Figure 2 for traditional memory devices.
Instead of focusing solely on low latency as the traditional memory devices the Bandwidth Engine Architecture is highly parallel, allowing pipelined, deterministic, concurrent accesses which are complementary to the pipelines of multi-core network processors. Let’s consider competing solutions to address specific networking applications. The devices we will be considering are the highest performance commodity DRAM available, Quad-Data Rate (QDR) or Sigma Quad SRAM, Low Latency (LL) or Reduced Latency (RL) DRAM and the MoSys Bandwidth Engine IC.
APPLICATION: Lookup Table; with infrequent updates
Some applications simply need a high random access read rate. At 200G the packet arrival rate is 300Mpps. Table 4 illustrates the performance differences for the read dominated access patterns.
Although the applications can tolerate DRAM like latency it is the random read rate, constrained by bank to bank timing limitations (tFAW, tRRD) that make DRAM an inefficient choice for the application. The fast cycle time of the RL/LL DRAM is a significant improvement compared to commodity DRAM, but in this example would still require 3 devices, with the corresponding power and pin count, to meet the performance requirements.
APPLICATION: State Memory
Various networking applications maintain state, such as Network Policing, Network Address Translation, Stateful Firewalling, TCP Intercept, Network Based Application Recognition, Server Load balancing, URL Switching, to name a few. Maintaining state is a memory intensive read-modify-write operation and requires low, deterministic latency and completely random access.
The entire RMW operation is required to complete within the packet arrival time (3.2ns @ 100GE) and both operations are to the same memory location. Traditionally QDR SRAM with its independent write and read ability has been used to meet this requirement. The packet processing elements typically have long pipelines and cannot tolerate a stall from a non-deterministic response.
When considering buffer applications, the primary performance metric is throughput when considering packet header processing the critical performance metric is the ability to quickly access tables and decision trees. Again, as was the case for buffer memories a commodity DRAM has been the solution of choice simply by nature of its market driven pricing, multi-sourced nature and sheer economies of scale. Around the year 2000, when packet data rates transitioned through 10G aggregate bandwidth an important threshold was crossed; the packet-processing rate for the first time exceeded the random access rate of generic DRAM. From that point onward, illustrated in Figure 1, increasingly complex memory control solutions have been used to allow DRAM to scale up for high performance networking applications.

These solutions have been scaled to 100G performance levels albeit, not without considerable effort and performance compromises. In order to handle the line rate requirements packet processors utilize long, deterministic pipelines which mitigate the need for ultra-low latency memories but does not relieve the access rate requirements to process packet header data in real time. Specialty memory devices, such as low latency DRAM, are derived from their commodity heritage and through brute force design techniques pushed to the limits of the underlying array technology. These specialty memory products which emerged in the late 1990’s have been a good stopgap measure but are only evolving at the same rate as their core technology. Now, as we approach processing rates 15x faster than the DRAM cycle time the clever system tricks or specialty memory products used are reaching the limits of diminishing returns and going forward the performance gap will continue to grow.
Access Rate and Latency Diverge
Prior to CPUs architectures transitioning from single core to multi-core the clock rate and performance were closely linked. Similarly there has always been a relationship between access rate and latency shown in Figure 2 for traditional memory devices.

Instead of focusing solely on low latency as the traditional memory devices the Bandwidth Engine Architecture is highly parallel, allowing pipelined, deterministic, concurrent accesses which are complementary to the pipelines of multi-core network processors. Let’s consider competing solutions to address specific networking applications. The devices we will be considering are the highest performance commodity DRAM available, Quad-Data Rate (QDR) or Sigma Quad SRAM, Low Latency (LL) or Reduced Latency (RL) DRAM and the MoSys Bandwidth Engine IC.
APPLICATION: Lookup Table; with infrequent updates
Some applications simply need a high random access read rate. At 200G the packet arrival rate is 300Mpps. Table 4 illustrates the performance differences for the read dominated access patterns.

Although the applications can tolerate DRAM like latency it is the random read rate, constrained by bank to bank timing limitations (tFAW, tRRD) that make DRAM an inefficient choice for the application. The fast cycle time of the RL/LL DRAM is a significant improvement compared to commodity DRAM, but in this example would still require 3 devices, with the corresponding power and pin count, to meet the performance requirements.
APPLICATION: State Memory
Various networking applications maintain state, such as Network Policing, Network Address Translation, Stateful Firewalling, TCP Intercept, Network Based Application Recognition, Server Load balancing, URL Switching, to name a few. Maintaining state is a memory intensive read-modify-write operation and requires low, deterministic latency and completely random access.
The entire RMW operation is required to complete within the packet arrival time (3.2ns @ 100GE) and both operations are to the same memory location. Traditionally QDR SRAM with its independent write and read ability has been used to meet this requirement. The packet processing elements typically have long pipelines and cannot tolerate a stall from a non-deterministic response.
Navigate to related information

