As the demands of carrier and enterprise networks have become increasingly multi-dimensional in attributes of performance, functionality, and extensibility, network processors are increasingly replacing alternative solutions such as ASICs or general-purpose processors. However, the very strength of network processors -- being a "soft" solution via software -- is also the key challenge in deploying network processors.
A typical network processor (NPU) has many parallel low-level RISC-type processors that need to be programmed by the system builder. Typically these processors have one or more types of low-level, software-managed interconnect among themselves, have a new instruction-set architecture, address different types of on-chip and off-chip memories, have a very limited instruction space, and don't offer the comfort of an operating system. Complicating the environment further, the processors may have hardware-controlled context threading, and the software may have to deal with on-chip and off-chip specialized modules such as TCAMs, CRC and hash units, cipher engines, and classification hardware.
The result is that the complexity facing the NPU software developer can be several orders of magnitude greater than if writing software for the typical general-purpose processor. Although NPU manufacturers provide a subset C language in addition to assembly language, C-written NPU software tends to look much like assembly code because the program needs to deal with specifics of the hardware and because there are no libraries and operating system beneath.
It occurred to a group of us veterans from the first generation of NPUs that there is a radically different alternative to the usual approach of writing low-level, machine-dependent, code, be it in assembler or C, for NPUs. Instead, one could devise an extremely high-level functional (as opposed to procedural) language for expressing a wide variety of packet-processing applications, a language where the primitives are such things as tracking a connection or session, removing an outer header, translating IP addresses, encrypting a packet, scanning the payload for a regular expression, and so on.
By making the language a functional or declarative language, it is intrinsically parallel, a good match for the NPU underneath. One can then implement the language by developing a virtual machine on the NPU's parallel processors, and a compiler to translate programs written in the language into the virtual-machine representation. Providing that the language is broad and powerful enough, very specific applications such as content switches, session border controllers, GGSNs, VoIP peephole firewalls, security gateways, intrusion detectors, SIP proxies, IPv4/IPv6 NAT, and others can be expressed entirely by the language to the virtual machine.
The virtual machine approach almost completely abstracts the network processor, letting the application developer focus all of his or her attention on packet processing. The strengths of the virtual-machine approach are not completely free, and the obvious tradeoff would appear to be performance. However, the performance penalty is surprisingly small, so much so that we claim performance to also be a benefit. For starters, unlike, say, a Java virtual machine running on a Pentium, the NPU-based virtual machine can be designed in a clever way as both a pipelined engine as well as one with N-way parallelism.
To illustrate, the accompanying figure shows an implementation of such a virtual machine upon the 16 microengines of an Intel IXP2800 NPU. The virtual machine language consists of rules containing expressions and actions, where actions are performed for all rules evaluating true. Expressions can evaluate fields in the current packet, internal state, and also explicit connection state, such as tracking TCP connections or SIP sessions. So the virtual machine can simultaneously be receiving new packets from network ports or a switch fabric, doing state lookup on one or more earlier packets, evaluating rules for an additional set of packets, processing true rules on up to 12 packets, doing next-hop lookups on another set of packets, and transmitting yet another set.
Depending on the nature of the application, the system designer can divert microengines away from true-rule action processing to provide more computational resources to state lookup, expression evaluation, and/or IPv4/IPv6 forwarding. The expression evaluator pipeline stage can also make use of external classification hardware, and in fact we have developed an ASIC to do complex classification and layer-7 payload scanning specific to our virtual-machine architecture.
The IXP2800 provides many opportunities to create an innovative virtual machine that can deliver remarkably high performance, such as 16 microengines, each executing instructions at a 1.4 GHz rate, very large register sets (over a thousand per microengine), eight hardware threads per microengine, asynchronous memory operations, and seven independent memory controllers. Several gigabits per second of packet traffic can be processed by the virtual machine when doing fairly complex operations, with higher rates achievable for simpler applications.
There are several other aspects of this approach that have a significant impact on performance. First, one often associates added overhead with any type of virtual machine approach, but overhead as a percentage of useful cycles can be minimized by making the primitives (basic operations) complex yet general purpose. Basic actions can be such functions as: encrypt this packet and encapsulate it as a tunnel-mode packet, forward the packet, pick an IP destination address having the fewest TCP connections currently established, scan the payload for this fixed string or this regular expression, add an entry to this connection table.
We also know from painful experience that perhaps the most-significant design consideration in NPU software is the memory/processing speed differential. Again using the IXP2800 as an example, the average read time from memory is 150 to 300 cycles, depending on memory type, congestion, etc., but the instruction execution time is one cycle. So one can execute, on just one of the processors, several hundred cycles in the time it takes to do one memory read. One consequence is the performance becomes largely a function of how much data is moved between processor and memory per packet. Another is the importance of nth degree optimization of memory accesses, which has been done extensively in the virtual-machine implementation.
One more consideration is realizing that no matter how well optimized the virtual machine is, there will still be small amounts of critical functions in many applications that benefit by bypassing the virtual machine and being directly implemented on the hardware. A good virtual machine will provide the means to interface in a clean way to these user-written directly coded algorithms.
Network processors represent a powerful new technology capable of serving as the core of next-generation networking equipment, bringing such equipment both high wire-speed performance and the benefits of a software-centric implementation. New approaches to the difficult task of producing NPU software, such as the virtual machine approach, can make significant improvements in development cost, time to market, extensibility, and scalability and thus will make harnessing the power of NPUs a more achievable objective.
Glen Myers is chief executive officer at IP Fabrics Inc. (Beaverton, Ore.).
|
See related chart Acting upon 16 microengines, the virtual machine can simultaneously receive new packets from network ports or switch fabrics while executing state lookups, checking rules, hopping to other packets and transmitting data.
Source: IP Fabrics, Inc.
|