Design Article
Packet processing needs balance between architecture, network
Michael Selissen, Technical Marketing Engineer, Network Processor Division, Intel Corp., Tempe, Ariz.
8/5/2002 7:33 AM EDT
Writing software for a network processor involves implementing inter-process communication and resource management functions at a level of complexity found in distributed computing applications. So, in designing their network processors, vendors have had to devise new solutions for managing shared resources and lowering internal system delays. A measure of a network processor's value is how well it addresses these issues for low-latency, high-bandwidth applications, while still allowing for a variety of programming methodologies.
Packets move through a network processor along a software pipeline from one processing element (PE) to another. Each instance of the pipeline passes packets from one stage to the next, across PE boundaries. So at any one time, a network processor may have dozens of packets in various stages of processing.
In a network processor application, a software pipeline spans several PEs. This is a simplified example of a line card that might be found in an edge router. The processor aggregates network traffic from multiple gigabit Ethernet ports onto a high-bandwidth PPP-over-SONET link.
Each Receive PE reads a data stream from multiple ports, assembles packets, and places them into memory. The Receive PEs allocate buffers from a shared pool, while the Traffic Management PE returns the buffers to the pool once the packet is transmitted. After assembling a packet, each Receive PE places a packet descriptor onto a ring, which serves as a holding point for the remainder of the packet processing.
The second group of PEs removes packets from the ring and begins the processing the packet content. In the case of a Layer 3 edge router, this processing takes place on the Internet Protocol (IP) portion of the packet. The software verifies the protocol header and then classifies the packet by inspecting the source and destination addresses, port numbers, and type-of-service field. Taken together, these fields also form a search key into tables containing Access Control Lists (ACL), IP forwarding tables, and MPLS label-switching tables.
The next step is to calculate and update the flow rate. As a part of this metering and policing process, the software marks the packet with a discard priority based on whether it meets or exceeds the traffic constraints for the particular flow. The packet then enters the congestion avoidance stage where the software checks the fullness of the system's output queues and determines whether the packet should be queued or dropped.
The last PE in the IP processing pipeline inserts the packet into an outbound transmission queue based on the packet's destination port, flow identifier and class of service. The number of transmit queues for a given line card may range from a few dozen to tens of thousands.
Finally, the Traffic Manager PE schedules transmission to each outbound port and removes packets from the queues. Traffic management strategies and algorithms vary from one telecommunications manufacturer to another, though most implementations use hierarchical scheduling. This technique applies different scheduling algorithms based on the classes of service and numbers of queues. Some examples of the scheduling algorithms are Fixed Priority, Weighted Fair Queuing, and Deficit Round Robin.
Processing packets in parallel, inside multiple concurrent pipelines, requires strict coordination among PEs to control both resource contention and packet latency. PEs on most network processors do not host operating systems, so the availability of hardware-based mechanisms to help programmers perform these tasks is a key differentiator among network processors.
Certain features are critical to creating efficient software pipelines. Atomic operations are useful in coordinating access to shared resources while low-latency inter-process communication mechanisms move packet information between pipeline stages and help to control synchronization among PEs. Caching and data coherency features result in lower internal IO latency, which is necessary to meet the strict processing time constraints of each arriving packet.
Key programming issues include managing shared resources, minimizing packet latency, maximizing processor utilization, and application scaling.
In the edge router example above, there are certain points where access to shared resources occurs. To ensure, for example, that only one PE modifies statistics for a given flow, the code waits to enter a critical section. Once active, the critical section reads the statistics from memory, calculates new values, and then writes the updated values back to memory.
After determining the outbound destination port and class, the code places the packet onto a transmit queue. If multiple Receive PEs place packets on the same queue, the queuing process must ensure that packets are inserted atomically and in the order they were received.
Semaphore mechanisms help programmers manage critical sections and synchronization. Event signaling is one tool that is used to implement a semaphore. Running threads signal each other in a prescribed order to indicate the availability of a resource. For example, a thread receives a specific event signal, whereby it knows that it has access to the metering statistics. Once the statistics are updated, it signals the next thread in the sequence.
When a semaphore is shared among a small number of threads, or when order is not important, atomic test-and-set operations provide an alternative to signals. To ensure that atomic operations are equally accessible to all PEs, the network processor must manage them within a single control point. A full-featured lock manager provides another option for synchronizing access to resources. A lock manager controls semaphore operations with implied ordering. Advanced lock manager features include deadlock avoidance and query capability to identify holders of a particular lock.
Placing a packet on a queue requires three I/O operations: one to read the queue tail, one to write the link to the new element, and one to write the updated queue tail. In the router example, the Packet Processing PEs insert packets onto the queues while the Traffic Manager PE removes them prior to transmission. Network processors that offer atomic queuing operations let the software treat queue insertion and removal as single operations. This feature reduces queuing time and relieves the software from managing the three dependent I/O operations.
Minimizing packet latency
Many network processor applications, particularly real-time voice and video, require consistently low packet latency. Limited memory bandwidth, as compared to processor speed, is one contributor to packet latency. A network processor that supports internal as well as external memory gives programmers a choice of where to place data structures based on their frequency of reference.
SRAM, with its low access time, is useful for storing regularly referenced structures. In the router example, internal SRAM contains statistics counters and buffers for transferring messages between pipeline stages. Lookup tables, packet descriptors, and queues are located in external SRAM. DRAM provides a less expensive, albeit higher latency, memory for storing packet data.
Another way to reduce memory latency is for the network processor to guarantee a point of data coherency within its memory controllers. For example if one task writes data to SRAM, another can read the same data before the controller moves it to memory. This feature lets tasks quickly access updated data while reducing both the write and read times from the PE's perspective.
Delays caused by inter-process communication between PEs also have an effect on packet latency. So further lowering packet latency requires mechanisms to pass state information and event indications between PEs. High-speed busses linking PEs are a way to pass data while avoiding memory operations.
Packet-processing software often makes repeated accesses to certain data. As it reassembles a packet, for example, a PE will update a byte count several times. Additionally, PEs responsible for queue management may retain queue pointers locally to avoid repeated memory accesses. In both of these cases the PE software reads a data structure and updates it multiple times before writing the result back to memory.
These operations require that data be cached locally within the PE. To assist software in managing a cache, a low-latency lookup mechanism is effective in minimizing search times. A Content Addressable Memory (CAM) searches multiple entries simultaneously. In a single instruction cycle the software determines, for instance, whether the checksum for a given packet is cached within the PE or resides in shared memory.
As a network packet makes its way through the pipeline, it is processed by a series of computations and memory accesses. Some network processors offset, or hide, latencies associated with memory accesses by offering multi-threaded PEs. While one thread waits for an I/O operation to complete, another continues processing its packet. This approach ensures that packets are always advancing through the software pipeline.
Ideally, if each PE thread spends 50 percent of its allotted time performing computations and 50 percent waiting for I/O operations, the PE will achieve maximum utilization. It is, however, usually all but impossible for a programmer to reach this balance, especially across all applications. But to create efficient code, programmers need the ability to exercise fine-grained thread management. This means the PEs should be non-preempting so the code can be tuned to make the most efficient use of the PE while I/O operations are in progress.
A comprehensive software development environment, one that includes tools specifically designed to take advantage of a network processor's hardware features, is essential for creating highly tuned PE software. An integrated development environment, offering C compilers, as well as intelligent assemblers, linkers, debuggers, and code analysis tools, yields the best possible performance from a network processor, while letting the programmer hand-tune code to meet specific performance requirements.

