Design Article
Tell us What You Think
We want to know what you thought about this Design. Let us know by adding a comment.
Networking Memories: Intelligence for 400G app acceleration and host offload—Part III
Michael Sporer, MoSys
11/1/2012 12:07 PM EDT
Traditional counting implemented directly in high speed SRAM becomes economically infeasible at higher line rates and increased number of flows that need to be tracked. Multiple proposals have been published to implement counters in commodity DRAM but some of these solutions may not address certain performance characteristics:
- Counting every flow rather than sampling
- Continuous counting which requires sufficient bandwidth to read while counting
- Counting both packet count and length:
- Packet count increments by unit value for each packet
- Packet Length requires adding or subtracting an immediate value
The performance of hybrid counter implementations are based on the ratio of SRAM cycle time to DRAM cycle time, the number of DRAM banks, the cache size and the queue length. The baseline implementation references a 100GE flow with one million 64b counters built with 9Mb of on chip SRAM and 1.25Gb/s off-chip DRAM assuming typical access rates for DRAM. Scaling up to a practical implementation results in a linear scale up; ie. A typical 4x100GE [ 4 ports * (Packet + Byte) * (Ingress + Egress) @100GE ] would require 144Mb SRAM and 20Gb/s DRAM bandwidth. Counters require a high access rate of small record sizes. The burst oriented capability of the DRAM architecture is not well suited for the application and requires a significant amount of SRAM capacity when scaling up to higher performance points. For these reasons, it becomes apparent why the fast cycle time of the RLDRAM is more suited for counter applications and even why legacy QDR SRAM continues to be used for counter applications based on its high access rate and low latency capabilities.

The MoSys Bandwidth Engine Macro device, shown in Figure 4, with its onboard accelerators, is capable of atomic fire-forward counter update operations which can increment records wholly internal to the device, reducing the number of memory bus transactions from six down to one, as well as relieving the host of the computations required for the update. It can retire the entire operation in under 20ns, far quicker than any other solution built with external memories. The alternatives require multiple memory bus traversals and host processing. The advanced counter macros on the Bandwidth Engine can be saturated using only 8 SerDes lanes, further reducing the power, pincount and host resources. A comparison of different implementations of a 4x100GE counter servicing 1M flows can be seen in Table 1.

Not captured in Table 1 is the simplification of the host controller when using the Bandwidth Engine Macro device which relieves the host of counter and ECC operations, as well as the coherency maintenance of transactions in flight in the counter pipeline. By off-loading the entire operation onto the Bandwidth Engine device, the latency becomes immaterial, and the counter rate is only dependent on the internal access rate of the parallel memory array.
What was once easily accomplished with a buffer in the host and an inexpensive DRAM has become challenging as line rates and DRAM performance diverge. As the line rate increases the buffer size needed grows proportionally, making the DRAM/SRAM hybrid solution less attractive for active counter implementations. In Table 1, the large amount of on-die SRAM needed to allow for the use of an inexpensive DRAM is a poor use of limited resources and would drive considerable expense on the host device. Even implementing with the fastest available low latency DRAM requires buffer on the host; using up resources which could be put to better use. Using the MoSys Bandwidth Engine for Statistics application minimizes the total system cost and complexity compared to alternative solutions.
APPLICATION: Two-Rate/Three Color Token Bucket
To satisfy the QoS and SLA contractual obligations the network must support multi-priority, multi-flow traffic and ensure latency, jitter and packet delivery performance for each flow. This requires equipment that includes capabilities shown in Figure 5, such as metering used for policing and a two-stage queuing mechanism used for shaping that ensures predictable performance and creates scheduling “fairness” with better load distribution in the network.

When a packet arrives at the operator network access point it is classified according to the type of traffic which is subsequently used to meter and police on a flow by flow basis. Traffic shaping can also be implemented to smooth out bursts and avoid buffer overruns in downstream equipment. Shaping can be implemented using a single rate bandwidth profile token bucket algorithm, and includes data packet overhead. Packets exceeding the ingress buffer of the downstream equipment are delayed in the upstream equipment until the receiving buffer has the necessary capacity.
Metering at the access point alone is insufficient to avoid congestion through the network. As the packet traverses the network, it will be aggregated onto higher speed links with other data flows. It is essential for policing to be implemented on these higher speed links to ensure end-to-end QoS and avoid draconian congestion control mechanisms. Since the routing occurs on a per hop basis scheduling, metering and policing can occur at multiple nodes along the way to ensure that real time and priority traffic is delivered according to contractual obligations. The operator can meet the bandwidth guarantees by reserving appropriate network resources and employing a two-rate/three-color rate-limitation methodology as part of its traffic engineering policy.

