As with any processor architecture, today's designers are always on the prowl for better ways to compare a processor from one company versus a processor architecture from another. In the microprocessor and signal-processing world, organizations like EEBMC have been formed to fill the bill. Until recently, the same couldn't be said in the networking arena. Designers had to rely on company-supplied data, which could be developed using various testing techniques, as a means for evaluating emerging network processor (NPU) architectures.
Fortunately, the Network Processing Forum (NPF; www.npforum.org) has stepped to the plate on this issue, developing two benchmarking specificationsone for IPv4 and one for MPLSthat OEMs can use to fairly evaluate NPU performance. In this article we'll take a look at the motivation for developing the IPv4 benchmark. We'll also discuss the contents of the benchmark and key differences between this benchmark and preexisting benchmark standards. (Note: To get a detailed look at the MPLS benchmark, see MPLS Benchmarks Define Net Processor Performance)
Early in the benchmarking process the NPF Benchmarking Work Group (WG) agreed to split the benchmarking universe into 3 different levels:1
- System level Benchmarks at this level are targeted at performance of complete systems such as routers. They include both control plane and the data plane functionality. A set of system-level benchmarks has already been defined by the IETF and is targeted at end users (ISPs) who will use the systems in their networks. Examples of system-level benchmarks are benchmarks for IP routers, firewalls and web switches.
- Application levelThese benchmarks measure the performance of NPU application functions. Typically, most system level benchmarks include multiple application functions. For example, a firewall includes IP forwarding, filtering, and network address translation (NAT) applications. Application-level benchmarks are useful for evaluating the performance of an NP for a single application such as IP forwarding. Application-level benchmarks are targeted towards NP customers.
- Task-levelThese benchmarks target significant, fundamental operations that are commonly put together to make up an application. The operations used must be separable from other operations to facilitate independent measurement. A task-level benchmark is generally found in more than one application level benchmark. Examples of task-level benchmarks include longest prefix match (LPM) table lookups, five-tuple table lookups, string searches, and CRC calculations. Task-level benchmarks are targeted at NPU developers who implement value-added dataplane functionality and therefore need to measure and compare NP performance for a particular operation.
The WG focused its initial efforts on application-level benchmarks. The group believed these benchmarks delivered the greatest value and had the clearest path to standardization. IPv4 forwarding was selected by the NPF as its first benchmark because it:
- Represents a pervasive Internet application with strong performance impact. Benchmark results would be of interest to many.
- Represents very basic Internet functionality. We could build future benchmarks on top of IPv4 forwarding.
- Is well understood. Significant challenges loomed as we created our first benchmark (next section) so the WG decided to chose an application with already well agreed upon functionality.
- Includes important operations, like LPM lookups, that we planned to specify as a task level benchmark.
- Represents functionality that all NPs needed to implement.
We verified our approach with a survey of the entire NPF membership. IPv4 forwarding was selected most often from a number of possible first benchmark candidates.
To specify the IPv4 forwarding benchmark, the WG had to overcome a number of issues generic to all NPU benchmarks. The NP market is architecturally rich, with many different approaches represented. The wide architectural variation increases the challenge inherent in benchmarking NPs. Let's look at these challenges in more detail
1. Text Not Code: The WG quickly determined that NPF benchmarks would be specified in text descriptions, not code. NPUs span a wide variety of different programming models. Some have their behavior specified by a program. Others feature fixed function units that are configurable towards a specific application. Even among NPUs that are programmed there is no common programming language.
To avoid these problems, the NPF defined its benchmarks in English text, like IETF3 benchmarking RFCs. NPF benchmarks also specify the needed functionality in a "black box" or generic manner. While precise in terms of the input packets supplied and output packets required, the benchmarks do not specify the algorithms that are employed by the NPU.
2. Bounding the Black Box: The benchmark should clearly show the performance of the NPU being measured. NPU architectures differ widely. Measuring the NPU in isolation isn't possible in general since there is no standard NP "socket." NPU manufacturers select different interfaces for connection to framers and physical layer devices (PHYs), to switch fabrics, to memory, to co-processors and to control plane processors. In fact, two NPU vendors are very unlikely to select the same set of interfaces. Moreover platform level innovation makes a future standard NP "socket" unlikely and even undesirable.
Better standardization exists at the network port level. Here Ethernet, packet-over-Sonet (POS), and ATM dominate. This is a natural boundary for network products. Moreover measurements of performance at this level include the impact of platform level choices made by NP designers. For example too little memory bandwidth will result in a lower benchmark score.
NPF benchmarks specify input packets to the device under test (DUT) at the network level and measure packets from the DUT at the network level to gauge performance. Network interfaces have been selected as the boundary of the NPF benchmark black box.
3. Making Scores Comparable:Enabling benchmark scores to be compared is clearly a must. Still we must preserve the vendor's ability to field and benchmark platforms that they believe meet market needs. Vendors field systems with different sets of interfaces to target specific markets. One vendor may target ATM while another targets Ethernet. A vendor with an NPU targeted at 10 Gbit/s throughput will clearly include different network interfaces than another targeting 1 Gbit/s throughput. The issue of comparability is not just an interface issue. Those comparing systems may wish to compare systems of comparable size, integration level, or power.
The NPF cannot predict the sets of comparisons that will be needed. Further there is no desire to require vendors to build benchmarking systems to NPF specifications (like interface sets or power dissipation). Instead of specifying the DUT, NPF benchmarks specify the DUT information that must be disclosed. For example vendors must specify the number and type of ports, the quantity and part numbers of NPU ICs, coprocessor, and memory chips used, even the amount of memory used to store benchmark structures. In this way a knowledgeable engineer looking at a benchmark report will be able to judge the applicability of that specific performance measure versus product needs and be able to knowledgeably compare two different reports.
4. Headroom: Headroom can be loosely defined as the resources available to a program after base level functionality has been accomplished for a given workload. Within the NPF there is strong desire to include headroom metrics in benchmarks. Measuring the networking capability left over while running the benchmark would give system vendors a valuable measure of the resources left for additional functionality.
For a given NPU, headroom can be relatively easily specified. A set of low-level metrics like processor utilization, memory utilization, remaining memory bandwidth or remaining instruction store would likely capture headroom nicely. Unfortunately the set of metrics required is very different for each NP. The NPF could find no common ground here.
A second way to measure NP headroom was also proposed. Here task level benchmarks would be run at the same time as an application level benchmark. Both the application and task level performance would be measured. Unfortunately the set of tasks needed to completely characterize the leftover networking capability of the NP system is just as architecture dependent as the set of low-level metrics.
In the end the WG gave up trying to specify an industry standard measure for headroom. Disclosure of some application specific metrics, like memory footprint for routing tables, is required. It is left to individual companies to specify and communicate headroom for their architecture as they see fit.
5. Accurate Usage: Ensuring compliance with the benchmark specification is required to maintain the integrity of the specification. Abuse of the benchmark puts all numbers into doubt and severely degrades the usefulness of the benchmark.
A number of possible policing methods were considered to ensure accurate usage, ranging from result self certification to certification by other NPU companies. In the end the most workable approach was one similar to that employed by TPC4, independent, third-party certification. Before vendors can advertise numbers using the Network Processing Forum name or logo, the benchmark report must be certified by an NPF approved certifying agent. Currently Tolly Labs5 is a certifying agent. Once certified, vendors may advertise the benchmark numbers and the benchmark reports are posted on the NPF web site.9. For complete rules, designers will need to consult the Network Processor Forum.
The IPv4 Benchmark Implementation consists of the following: the IPv4 Benchmark Implementation Agreement10, the Mae-West route table snapshot11, a script for generating route tables12, an IPv4 benchmark reporting template13, and an NPF IPv4 benchmark implementation kit14.
The implementation agreement document provides a text description of the benchmark. The document structure is based on IETF RFCs 1242 and 2544.6 and 7 It includes terminology, test configuration, test descriptions, test parameters, and measurement criteria sections.
The Mae-West route table snapshot is used on the device under test (DUT) to select the proper output port for each incoming packet. The table has 28,895 entries distributed across various prefix lengths and IPv4 subnets. This route table is a snapshot of the MaeWest Route table snapshot as of Oct 26th 2001.
The raw Mae-West route table's next hops must be matched to the DUT and the routes that will be exercised from each port must be selected. This is the job of the mandatory Tcl script provided with the benchmark.
The Tcl script takes as input the Mae-West Route table snapshot, the configuration of the DUT and the capabilities of the traffic tester. The output consists of the route table to be loaded on the DUT and the traffic pattern that should be loaded on the traffic tester. The script customizes the route table to distribute the routes evenly across all the interfaces on the DUT. The traffic pattern generated ensures that route lookup prefixes and subnets exercised are representative of the prefix and subnet distribution of the entire route table. The traffic pattern also ensures that the same amount of route lookup and processing is needed on each interface. This ensures that the input and output is symmetric and the results obtained will reflect the true maximum performance obtainable on the DUT.
The template is then used to clearly specify all the data points and graphs that should be reported. This enables an easy comparison of one report to another.
The Implementation kit is a complete Tcl based example implementation of the functionality needed on a traffic tester (IXIA in this case) to generate the traffic to be sent into the DUT and analyze the output. It is designed to be portable across different traffic testers with minimal effort. The kit takes in as input the Mae-West route table and DUT configuration, and generates the traffic based on the DUT configuration. The kit is an extension of the Script described in section 3.3 and is provided as a reference, its use is not required for running the benchmark
Diving into the Benchmark Details
Now that we've lad out the rationale for the benchmark and the key components needed, let's dive into the specifics of how the benchmark works. To do this, we'll look at the metrics measured and illustrates each with results obtained using the commercially-available NPU.8 Note: The results presented here are based on measurements involving the IXP2400 NPU. A complete report on the benchmark results for this NPU can be found at http://www.npforum.org/benchmarking/Intel_IPv4_Disclosure.pdf.15
The system setup used to obtain the results is also described in the benchmark report. Figure 1provides an abbreviated description.
Figure 1: Diagram illustrating an abbreviated version of the IPv4 benchmark setup.
The IPv4 forwarding system used in this setup consists of an platform with:
- Two NPUs
- 512 Mbyte of 150 MHz DDR DRAM per NPU
- Two Channels of 4 Mbyte 200 MHz QDR II SRAM per NPU
- Two media cards that each contain a four-port Gigabit Ethernet MAC IC.
The IPv4 implementation kit assigns 3,611 routes per interface and sends traffic that exercises 1,000 of these routes per interface. Hence traffic exercises 8,000 of the 28,895 routes from the Mae-West table.
The IPv4 benchmarking spec calls for evaluations on forwarding rate, throughout, latency, loss rate, overload forwarding rate, forwarding table update latency, and forwarding table update rate. Let's look at all 7 in more detail.
1. Forwarding Rate
The forwarding rate is the maximum rate at which the received frames are forwarded by the IPv4 forwarding function. This rate is measured using output frames per second at a frame size of N bytes.
The forwarding rate is measured with three different traffic patterns
- Base case : 100% data traffic.
- Control case: 95% data traffic and 5% control traffic destined to the local or remote control processor.
- Option case: 99.9% data traffic and 0.1% IPv4 Record Route Option packets, which should be processed and forwarded.
These traffic patterns determine the performance of the DUT under different conditions. Results are also an indicator of the impact of control plane destined traffic on pure data traffic.
As shown in Figure 2, the results are plotted against the theoretical maximum for the given set of interfaces for the frame sizes 64, 128, 256, 512, 1024, 1280, 1518 and for the Internet mix consisting of 56% 64 byte, 20% 594 byte and 24% 1518 byte size packets.
Figure 2: Forwarding rate in packets per second.
Figure 2 presents the data obtained on the NPU discussed in this article. The forwarding rates achieved are equal to the theoretical maximum rate for all tests. This indicates that the system is able to process data as well as control and options traffic without any packet loss.
Throughput is defined as the maximum rate at which none of the valid received frames are dropped by the IPv4 forwarding function. To measure throughput, the IPv4 benchmark calls for designers to evaluate input frames per second at a frame size of N bytes.
In cases where the forwarding rate achieved on the system is not the same as the theoretical maximum possible, this test helps determine the upper bound of the forwarding performance obtainable on the device without packet loss. On the MPU used in our example, there was no packet loss the throughput was the same as the forwarding rate.
Latency is a critical parameter that designers must evaluate when looking at NPUs. The IPv4 benchmark defines two forms of latencies: one for store-and-forward devices and one for bit-forwarding devices. For store-and-forward devices, latency is defined as the as time interval starting when the last bit of the input frame reaches the input port of the DUT and ending when the first bit of the output frame is seen on the output port of the DUT. For bit-forwarding devices latency is considered the time interval starting when the end of the first bit of the input frame reaches the input port of the DUT and ending when the start of the first bit of the output frame is seen on the output port of the DUT. Overal, IPv4 measures latency in seconds.
The average, minimum, and maximum values for the latencies seen per packet on the NPU highlighted in this article are plotted for the same packet sizes and traffic patterns described above. These values are obtained at 100%, 75%, 50% and 25% of throughput rate. These values are an additional measure of the performance degradation seen on the system when the offered load increases. An ideal device would show little or no variation in latency as the load increases.
Results for the NPU under test a 100% of throughput rate are shown in (Figure 3). Lower rate tests do not differ significantly, implying that the system is capable of sustaining the 100% rate indefinitely.
Figure 3: Latency at 100% of throughput rate.
4. Loss Rate
Under the IPv4 benchmarking spec, loss rate is defined as the percentage of frames that should have been forwarded by the IPv4 forwarding function but were dropped instead. This is measured using the percentage of N-byte input frames that are dropped.
The loss rate is determined by sending a specific number of frames at a specific rate through the DUT and counting the number of frames received by the data plane tester. The frame loss rate is calculated using the following equation:
((input_count -- output_count) * 100)/input_count
The first trial is run by transmitting frames at line rate on all the media interfaces. Subsequent trials are run by reducing the frame rate by 10% increments of the line rate until there are two successive trials in which no frames are lost. Each run of this test should last 120 seconds or more to allow a large number of frames to be received at the tester. In the case of the NPU under test in this article, no loss was observed for any packet size at any rate.
5. Overload Forwarding Rate
The overload forward rate is defined by the IPv4 benchmark spec as the maximum rate at which received frames are forwarded over an output port measured for a sustained (constant) aggregate rate of frames destined to the given output port that exceeds the theoretical line rate for that port. This is measured in output frames per second at a frame size of N bytes.
The overload forwarding rate test is carried out but setting up the route table to forward traffic received on all the ports to a single output port. This test determines the behavior of the DUT when it has to drop packets.
On the NPU under test in this article, the results are obtained by performing a 4:1 overloading by forwarding traffic from four input ports to a single output port. The results obtained show that the port can still output traffic at the theoretical maximum rate of 1 Gbit/s for all packet sizes and traffic patterns. Dropping packets has no impact on this NPU's ability to forward other packets to the port.
6. Forwarding Table Update Latency
Forwarding table update latency is the time interval starting when the request for the forwarding table update is issued and ending when a notification of that request being completed is received. Under the IPv4 benchmarking spec, this latency is measured in seconds.
The forwarding table update latency spec is is used to determine the number of route table updates that can be performed on the DUT per second. Unlike the previous tests, this is purely a control plane test. The DUT in this article performed 96,293 route updates per second.
7. Forwarding Table Update Rate
The forwarding table update rate is the maximum rate at which forwarding table updates can be issued with the average forwarding table update latency below a threshold. This is measured by determining the number of route entries updated per second.
This test determines the impact of route table updates on the forwarding performance. Results are obtained by determining the forwarding rate at 100, 75, 50 and 25% of the maximum route update rate found in section 5.6
Using this metric, NPU users can determine the amount of control plane processing the NP can perform without impacting the data plane forwarding requirements. The NPU evaluated is this article recorded results at 100% and 25% of the maximum route update rate of 96,293 route updates per second. These results are shown in Figures 4 and 5.
Figure 4: Forwarding rate at 100% of max route update rate.
Figure 5: Forwarding rate at 25% of max route update rate.
The results in Figures 4 and 5 indicate that there is a measurable impact on forwarding performance for smaller size packets at 100% of route update rate. This impact is not seen when the packet size increases or the route update rate decreases.
The NPF IPv4 benchmarking provides an "apples-to-apples" comparison of network processors in real-world environments. The benchmark treats the NP system as a black box, stating stimulus and measuring results in terms of network packets. The benchmark measures IPv4 forwarding performance metrics such as forwarding rate, throughput, latency, loss rate, overload forwarding rate, table update rate. The benchmark requires vendors to specify the device tested in detail. The benchmark has been available since October of 2002 and multiple vendors have published results. These results are available from the NPF
This benchmark is the first in the series. The MPLS benchmark was released in March of 2003. Soon an IP forwarding benchmark including larger and smaller IPv4 route tables and IPv6 routing will be published. Other benchmarks are also being worked on within NPF. Future benchmarks and results along with the data obtained using the currently ratified IPv4 and MPLS benchmark will provide NP users with invaluable data.
- P. Chandra, F. Hady, R. Yavatkar, T. Bock, M. Cabot, P. Mathew, "Benchmarking Network Processors" HPCA8 2002.
- Standard Performance Evaluation Corporation , SPEC CPU2000 V1.2, http://www.spec.org/cpu2000/
- IETF Benchmarking Working Group. target="_new">http://www.ietf.org/html.charters/bmwg-charter.html
- The Transaction Processing Performance Council, http://www.tpc.org
- The Tolly Group, http://www.tolly.com
- Benchmarking Terminology for Network Interconnect Devices, IETF RFC 1242
- Benchmarking Methodology for Network Interconnect Devices, IETF RFC 2544
- Intel IXP2400 Network Processor, http://www.intel.com/design/network/products/npfamily/ixp2400.htm
- IPv4 Benchmark Implementation Agreement, Benchmarking Results, target="_new">http://www.npforum.org/benchmarking/benchmarking.shtml
- IPv4 Benchmark Implementation Agreement, http://www.npforum.org/techinfo/IPv4IARev.pdf
- Mae-West Route Table Snapshot, http://www.npforum.org/techinfo/maewest.txt
- Script for Generating Route Tables, http://www.npforum.org/techinfo/parseMaeWest.zip
- IPv4 Benchmark Reporting Template, http://www.npforum.org/techinfo/ipv4bm-template.pdf
- NPF IPv4 Benchmark Implementation Kit, http://www.npforum.org/techinfo/IPv4ImplToolkit.zip
- D. Meng, E. Eduri, M. Castelino, "IXP2400 Intel Network Processor IPv4 Forwarding Benchmark Full Disclosure Report for Gigabit Ethernet," March 5, 2003, Revision 1.0, http://www.npforum.org/benchmarking/Intel_IPv4_Disclosure.pdf
Author's Note: Benchmarking results presented in this article are based on measurements taken on the IXP2400 NPU.
About the Authors
Manohar Ruben Castelino is a senior network software engineer in Intel's Network Processor Group. He has worked in projects primarily in the networking and network management areas. Manohar has a B.E. degree from KREC India and can be reached at firstname.lastname@example.org.
Frank Hady is a principle engineer at Intel where he leads a small research group focused on providing high performance platforms for heavy compute networking applications. Frank serves as the NP Benchmarking Task Group chair in the Network Processing Forum. He holds a PhD in Electrical Engineering from the University of Maryland a well as MS and BS degrees in EE from the University of
Virginia. Frank can be reached at email@example.com.