To keep up with the growth in number of Internet users, usage, and data rates coming from both wireline and wireless, networking equipment must scale in performance. At the same time, the network must deliver a robust user experience free from delay and down time and support new requirements such as service level agreements, IPV6, and intrusion detection. Not only is the fundamental packet rate increasing dramatically to 100s of gigabit per second in the core and aggregation, but the number of lookups per packet to support the added services is also increasing.
Next generation deployments will begin with ASIC and FPGA based line cards in Metro Ethernet, Carrier and Core Router applications. We are going to examine the discrete applications and solutions necessary to achieve the performance requirements and at the same time deliver the quality of service necessary through network operations, administration and management.
The memory problem statement
Networking line rates have traditionally increased by more than 2x every 18months, outstripping Moore’s Law and in an attempt to keep pace with the ever increasing demand for high bandwidth content delivery and fast, reliable connectivity. Add to that the challenge of flexibility to handle new intelligence, standards and features. This creates significant challenges to memory throughput and access rate. FPGA based line cards compatible with high performance memory architectures allow the combination of bandwidth, intelligence and flexibility needed to process packets in real time and deliver a robust user experience free from delay and jitter. Studies have proven the tangible value of a quick, robust user experience, which requires almost continuous bidirectional traffic as opposed to the streaming bandwidth of high definition video. The real time processes for a responsive user interface are sensitive to latency and jitter whereas traditional networking traffic has either been insensitive or uses arbitrary buffering techniques as a stopgap. The days of ‘best-effort’ services are over; dropping a legitimate packet due to congestion leads to a degradation of customer service, while indiscriminant buffering can be even worse. Fundamentally, packet processing needs to occur deterministically and in real time as network performance continues to scale.
Memory is used for three fundamental operations in networking equipment: buffering packets, packet header processing for switching and routing (decision process), and monitoring those decisions for network management or accounting purposes. This series of articles will consider these three as broad applications, recognizing that within each is a finer delineation. We’ll be considering memory solutions for buffering, statistics, metering, state memory and general table lookup applications.
As a benchmark for examination we are considering processing aggregate data rates of 200 Gigabit Ethernet (GE) which is challenging the abilities of memory subsystems to keep up. At the 200GE packet arrival rate of 3.3 nanoseconds coupled with the requirement to process the packets in real time leads us to a new era in networking memory requirements. The general requirements to implement in FPGA using traditional memories are outlined as follows:
For each of these applications we will compare current best practices for implementation and compare memory solutions.
Buffering within networking applications can take place in up to three locations within a switch or router depending on the architecture and performance requirements. The three locations are the ingress ports, the egress ports and the packet processor. For example, in the case of an oversubscribed system an oversubscription buffer on the ingress ports can absorb bursts of traffic, which exceeds the sustained capability of the packet processor. A buffer on the egress port may be used to absorb a high flow rate coming out of the switch and temporarily store it while it is sent out over a lower rate port. The most common buffer is the one tied to the packet processor, usually referred to as the ‘Packet Buffer’, has a throughput matched to the packet processor and unlike the ingress and egress which are intended to buffer transient conditions the packet buffer size and performance is tightly linked to the packet processor itself.
The packet buffer has typically been implemented using wide arrays of inexpensive DRAM which until recently was well aligned with the system performance requirements. (See sidebar). Given the economics and availability of commodity DRAM conventional wisdom is to use this technology where-ever possible. This holds true if you need the capacity that DRAM has to offer, but as we will show improved buffer performance has outstripped the requirements for increasing buffer capacity. When considering system level economics (power, thermal, board space) and performance in almost all cases it makes sense to use a purpose-built solution, despite the higher component cost. This is illustrated by oversubscription and egress buffers, which are typically implemented using high performance, lower density devices.
Packet Buffer, Oversubscription Buffer, Egress Congestion Buffer
DRAM, which began as general purpose high capacity random access device, has evolved and optimized for main memory of general purpose computing applications. The interface and typical bus widths are ideal for 32 byte and 64 byte cache line transfers for x86 architecture systems, which despite emerging smartphone and mobile markets still remains the single largest market for commodity DRAM devices. As long as the requirements for packet buffering fall within the performance envelope of commodity DRAM it will remain the preferred solution. In order to continue to use DRAM for buffering it has been necessary to implement a memory subsystem with caching and queuing in order to keep up with networking performance requirements. Using caching techniques DRAM can be used for primary packet buffering applications and an intelligent caching manager can also alleviate the ‘65B problem’ to limit the required bandwidth into the off-chip buffer to twice that of the line rate. Of course memory devices with a common IO interface need to factor in the bus turn-around overhead. In the case of commodity DRAM we assume the bandwidth is 3x the line rate taking into account the two aforementioned factors.
A caching solution using a DRAM buffer for 10GE traffic solves the problem with two DDR2 DRAM and 1Mb of on-die SRAM. This solution scales linearly with bandwidth, so taking into consideration the improvement of DDR3 over DDR2 a 100GE buffer grows to 10 DRAM and 10Mb of on-die SRAM. Further scaling to higher performance levels is indicated in Table 2.