Next-generation network processors (NPUs) will be called on to perform gargantuan tasks. They will be asked to process network traffic at speeds an order of magnitude higher than current rates, to monitor traffic flow and regulate traffic patterns, and to maintain statistics on the volume of traffic of each user. The question is, Will they be up to the task?
Yes, provided they meet some very demanding requirements.
The key to next-generation network processors is sustained performance, or the ability to guarantee line rate performance regardless of traffic patterns and regardless of the network features being executed. That implies that next-generation NPUs will be able to process well over 10 billion bits of data every second--an extremely challenging task.
A network processor fits between a framer and the switch fabric on a network line card to route data through a network. Current-generation network processors use RISC-based execution engines, but such configurations run into serious throughput problems at OC-48c and above.
Network speeds continue to rise. While OC-48c (the 2.5-Gbit/second rate) is barely in the deployment stage, the industry is already making plans for OC-192c (10 Gbits/s) and OC-768c (40 Gbits/s). Along with added speed come added capabilities and features. Just moving packets and cells from origin to destination isn't enough. Everything that network processors do at OC-48c and below (plus some new features that aren't currently implemented) must be done at OC-192c and above.
For example, networks will begin to keep better track of the traffic that they handle. Just like telephone users, network users will pay for the volume of traffic that they move, instead of paying a flat rate, regardless of volume.
A next-generation network processor will handle many more tasks than the current generation. It will switch protocols from one network to another. It will implement any network feature handling that may be required. It must be programmable and configurable to meet changes in standards, protocols or features.
To appreciate the sheer complexity of the tasks that next-generation network processors will be asked to handle, one need only follow the life of a packet or cell as it flows through an intelligent high-performance line card. The processor performs a sequence of operations on packets or cells as they enter and exit the line cards. Initially, the processor will identify the packet protocol, verify integrity, classify content/destination, and police and meter the flow of data by permitting, denying and assigning priorities. Modification of the data may occur at this point. That would include encapsulation techniques, to transform one protocol to another or for forwarding purposes. Manipulation of bit fields may be needed in the form of TTL or CLP. Further, TOS or DHCP bit marking may be required.
Once classified, policed and modified, the data is temporarily stored within a buffer subsystem. That facilitates traffic engineering by allowing data to be stored and prioritized into multiple queues for indefinite periods of time. It also enables time-sensitive data to flow through the processing device more freely and delays best-effort, low-priority traffic in the buffer structures.
Once the data is available for forwarding, the traffic management function is used to shape and prioritize the data according to quality-of-service requirements. All data flows need to be monitored for management and billing to guarantee their service-level agreement. After shaping, the traffic must be forwarded and possibly segmented into the switch fabric or back into the line. For multicast traffic, the NPU may be required to replicate the traffic on the fabric egress side before forwarding it to multiple downstream ports.
Port-based flow control is also necessary in both directions--when forwarding into the switch fabric because of egress port congestion and when forwarding out to the line because of congested downstream ports. All the while, intelligent management of buffer queues must be maintained and monitored, since high bursts of traffic or downstream congestion may cause average buffer queue depths to rise beyond reasonable limits, forcing the NPU to apply intelligent dropping techniques.
Supporting all of the above and more makes wire speed packet processing a challenging dilemma. At 10-Gbit data rates, minimum size packets can arrive every 32 nanoseconds.
Current-generation network processors typically use a modified RISC embedded processor, which is a single instruction multiple data engine (SIMD), to handle such functions as protocol conversions and packet/cell routing. The approach worked reasonably well at OC-12c and below, where the RISC can usually deliver sustained performance, even when it is reprogrammed to implement evolving networking standards and protocols.
But RISC-based network processors run into trouble at 1 Gbit/s and above. The root of the problem is that the embedded RISC core is in the data path and that data can move only as fast as the RISC core can process it. A RISC-based device can barely handle the data rate at 1 Gbit/s and begins to degrade as features are added.
Data rate problems occur because the RISC's fixed 64-bit data width limits the amount of data that can be handled in each processor cycle. In addition, RISC software development tools are the same ones used to develop more mundane applications, where ease of programming often takes precedence over efficient performance.
Every added level of abstraction in the software tools reduces the performance of the end system. Such efficiency losses may be acceptable in a videogame, but not in a high-performance network.
RISC problems will be compounded as networks move to 10 Gbits and especially to intelligent OC-192c, where new network features must be executed in addition to handling raw data throughput.
When multiple RISC engines are used in a network processor, they are typically arranged in a multithread configuration where each RISC core is in the data path of a separate data stream. Such a configuration requires a complex scheduler to try to balance the load among RISCs. It limits performance, because of the fixed 64-bit data width of the RISC and because the RISC must interrupt data handling to process any network features that may be implemented.
At first glance, it may appear that increasing the clock frequency or adding more RISC cores to the network processor may solve the problem. Such an approach may indeed help somewhat, but it is not the answer. Increasing the number of RISC cores will improve data throughput to some degree, but using four RISC cores in multithreading modes will not provide 4x throughput.
In multithreaded configurations, multiple RISC engines process data streams in parallel. A complex scheduler sitting between the data inputs and the RISC engines tries to balance the loads among the RISCs. Each RISC engine is in the data path of one data stream, and the RISC processes all data and executes all functions for that data stream sequentially.
Clearly, the addition of features to any data stream, as will be the case with intelligent OC-192c, slows the processing of that stream. In addition, context switching, which occurs when a new packet requires different processing from a preceding packet, can affect performance.
Also, when processing a stream of packets with multiple engines, there is a high likelihood that the packets will be processed out of order. Mechanisms are therefore required to distribute packets into and out of the parallel-processing engines in the proper order. Another pitfall of the architecture is how to map applications to the machine in an optimal manner. The process of applying, verifying and debugging application functionality when using multiple processors has become a time-consuming source of frustration.
Finally, adding more RISC cores to a network processor is a practice that runs into trouble using today's process technologies. A network processor should be--must be--a highly integrated device. Adding cores increases chip size and eventually poses yield problems, thus driving up the cost of each device. It also significantly increases power consumption, a major concern in networking applications.
Next-generation network processors will differ radically in architectural approach from current-generation processors. The key is sustained performance: The NPU must handle network traffic at OC-192c and must be scalable to OC-768c. Line-rate performance must be maintained regardless of network traffic patterns and regardless of the network features that may be implemented.
Sustained performance, even in worst-case scenarios, implies parallel rather than sequential handling of network functions. For example, all network features that are implemented must be executed simultaneously (in parallel), rather than sequentially, so that adding features does not degrade performance.
A network processor must maintain line-rate performance whether it is implementing one network feature or a dozen features. There should be no trade-offs of performance for features.
Next-generation network processors may, and probably will, have an embedded RISC processor, but it will not be in the data path. Instead, it will be a supervisory, or housekeeping, processor that will not have to operate at line rates. An embedded RISC processor may, for example, set up other parts of the network processor to handle specific line-rate network functions, but then it would get out of the way.
The network processor functions that must operate at line rates will, in our opinion, embody sustained memory bandwidth capacity, deterministic execution engines and high-performance instruction sets that execute simultaneously.
Sustained memory bandwidth capacity simply means that the network processor must be able to store any packet size at the network line rate, and performance must not fall below the line rate for any reasonable combination of packet sizes. That capability can be realized by careful selection of buffer sizes and memory bus width.
Typical buffer size can be between 64 bytes and 2 kbytes. But a small buffer can run into throughput problems. The worst-case scenario for any small buffer is a packet that is 1 byte larger than the buffer size. For example, when using a 64-byte buffer, it might take twice as long to store a 65-byte packet as it does to store a 64-byte packet. On the other hand, a large buffer can run into efficiency problems since most packets are small.
For higher line rates, more buffering is required. Economics dictate that large buffers be implemented in DRAM rather than SRAM. But DRAM has high fixed latency when switching from one page to another, so that the usable bandwidth is not proportional to the bus width. For a wider bus, the number of data transfer cycles is reduced, but not the latency, resulting in even worse efficiency. Therefore, designers must strike a balance between latency overhead and granularity overhead to reach optimal usable memory bandwidth.
Clearly, large packets are handled more easily than small packets. A next-generation network processor should be able to handle any combination of packet sizes of 40 bytes or more while sustaining line-rate performance.
Having a deterministic execution engine is critical to maintaining the line-rate performance in the network processor. The deterministic execution engine is in the data path and is separate from the supervisory RISC processor. It maintains line-rate performance regardless of the number of features being implemented by using a pipelined architecture that executes multiple network processor functions in parallel rather than sequentially.
A network processor based on deterministic execution engines works in a multiple-instruction single-data-stream (MISD) configuration. A packet is processed by multiple fixed-cycle pipes of multiple engines executing simultaneously, thus eliminating the nondeterministic characteristics of sequentially programmed RISC engines.
A deterministic engine will support a MISD configuration . In this configuration, a single data stream is processed by multiple high-performance, fixed-cycle pipes, with each pipe consisting of multiple (non-RISC) execution engines. Each pipe performs a set of functions on the data stream and passes the data stream on to the next pipe. Each engine within a pipe executes a particular network feature. A feature is turned on or off simply by enabling or disabling the associated engine.
The deterministic engine is also scalable in both the vertical and horizontal axes; that is, more engines can be added to a pipe to increase a pipe's capabilities, and more pipes can be added to increase overall performance.
High-performance instruction sets executing simultaneously mean instruction sets tailored to support a deterministic engine. The instruction set will be wide and segmented to perform multiple functions in parallel. For example, different segments of a single instruction may implement different network processor features. Simply enabling or disabling a particular engine will enable or disable a particular feature. With such an architecture, all network processor features execute simultaneously, thus maintaining sustained performance regardless of the number of features being executed.
Another benefit of this type of instruction set is that the instructions map directly to the function being performed, allowing for the creation of highly optimized application programming interfaces. The resulting programming model is dramatically simplified. That greatly reduces the system time to deployment and eliminates the need for complex performance-debilitating abstraction-layer software tools such as compilers.
How wide will this high-performance instruction set be? That will depend on the number of features that the network processor can support and the width of the instruction segment required by each feature.
This article will be presented in full at this week's Communications Design Conference.