United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 


How to scale network processors








EE Times


s network processors begin to ship in quantity and appear in production-quality equipment, market demand for higher data rate support and software compatibility seems inevitable. However, it is no easy task to scale up gigabit Ethernet and OC-48 rate architectures to support 10-gigabit Ethernet and OC-192c rates and still meet carrier-class requirements.

Network equipment vendors typically opt for a network processor instead of a hardwired ASIC to improve time-to-market, reduce development costs and increase time-in-market. Although the industry generally agrees on the desired benefits of using a network processor approach, network processor suppliers disagree on several fronts, including: What is the best way to program network processors? How much required functionality should they supply vs. how much should the equiment vendor supply? How is a network processor solution architected?

The tactic a network processor supplier chooses in each of these areas can significantly affect its ability to scale a network processor product to OC-192c performance (and beyond) and simultaneously provide adequate processing power per packet. This power is required to support the desired set of applications and traffic mixes.

Many network processors require low-level microcode/picocode/assembly coding and tedious optimization of data-path functionality in order to deliver adequate performance. Although the software community has long understood the disadvantages of low-level programming, low-level network processor programming is an order of magnitude more complex than a general-purpose processing due to the use of parallel and/or pipelined engines in many architectures. Some solutions require application software to manage and schedule parallel execution of tasks split across multiple processing resources and to manage the associated sharing of state information.

A more insidious problem with network processors that require low-level programming is discovering how best to scale up performance while maintaining compatibility with previously written software. Low-level software typically must know the microarchitecture-the number of processing engines, the pipeline structure and the like-of the network processor for which it is written. Scaling up performance by methods other than simple improvements to clock frequency invariably requires changes to the underlying microarchitecture. These changes, in turn, mean that software written for the previous microarchitecture probably will not work effectively on the new one. Even data-path application software written in C often depends on a network processor’s underlying organization of support engines, which usually changes from generation to generation.

A more preferable approach is to provide high-level programmability using application-oriented languages or models that do not require application software to explicitly manage parallel execution or sharing of state information across multiple processor resources. This strategy optimizes lifetime software development costs and intervals. It can, however, also lead to poor performance, because it is difficult to map the abstractions provided by the software model to the facilities provided by the underlying hardware. The challenge for the network processor supplier who follows this strategy is to create a high-performance, economical architecture that efficiently supports this mapping.

Most equipment vendors who use a network processor solution expect to obtain compatible traffic management functionality. Scaling a network processor product that lacks traffic management is easier than one that offers it, but such an approach merely pushes the problem back to the equipment vendor, who is then forced to either develop a homegrown solution or integrate functionality from another supplier. Either of these approaches increases the equipment vendor’s cost and risk, especially if the integrated configuration has not already undergone extensive interoperability testing. The same is true for other functionalities that are missing from some offerings, such as policing, statistics, and segmentation and reassembly.

Ideally, the network processor is part of a full, preintegrated fiber-to-fabric offering. This allows the equipment vendor to focus development and integration efforts on only those areas that add unique differentiating value.

At OC192c, back-to-back 40-byte TCP/IP packets can arrive approximately every 39 ns. At OC-768c, packets arrive four times faster than this. At these rates, the external memory system becomes a significant bottleneck. Network traffic has little temporal or spatial locality (because packet arrivals are random), so caching is not nearly as effective as it is in traditional computing applications.

Support for full-sized routing tables and full-sized packet buffers almost invariably require external memory storage. Given the large amount of external storage required, it is preferable to use DRAM as much as possible due to its huge advantages in cost, power and space compared to CAM and SRAM types of memory. Consider, for example, the cost/power/space implications of storing 1M+ IPv4 routes in CAMs or using 64+ MB of SRAM-based packet buffer memory in each direction! Typically, line cards need to be in a power envelope of 150 W or less in order to be economically deployed.

Memory bandwidth is often cited as the most serious constraint on network processor scalability. While bandwidth is certainly an issue (especially at OC-768c rates), memory bandwidth can nonetheless be scaled by:

  • Increasing the number/width of external memories (thereby increasing the number of I/O pins devoted to external memory interfaces). At OC-192c line rates and above, this means it is almost impossible to squeeze all required functionality into a single chip unless significant compromises are made on the functionality to be supported.
  • Increasing the amount of data that can be transferred per memory I/O pin in any given period of time. This is the approach taken by Rambus and DDR SDRAM memories.

Memory latency is sometimes considered a constraint, although suitable pipelining can effectively hide latency. However, the random read/write cycle time of DRAM (typically ~65-75 ns for most DRAM types) is potentially significant.

For example, in the case of the buffer management function at OC-192c speeds, arriving 40-byte packets must be deposited into buffer memory every ~39 ns, and departing packets must be retrieved from buffer memory every ~39 ns. Thus, the buffer memory subsystem must support a write and a read every ~39 ns when processing a stream of back-to-back 40 byte packets. A simple buffer memory implementation that uses a DDR SDRAM interface could only support either a write or a read every ~65-75 ns, which is a significant gap from the required level of performance.

This scenario also illustrates that, for a given line speed, it is easier to support channelized configurations, such as 4 x OC-48c, than concatenated configurations, such as 1 x OC-192c. For example, in a 4 x OC-48c configuration, while packets might arrive every ~39 ns, packets from any single one of those OC-48c interfaces will arrive only at 1/4th that rate. Packets arriving from separate interfaces can be assigned to separate processing and buffer memory resources without packet reordering becoming a problem. Likewise, these separate processing resources will probably not need to extensively share state information when processing separate streams of traffic.

How multicast packet processing is implemented can also significantly affect functionality and performance. Because of the large number of memory I/O pins needed in OC-192c and above configurations, most network processor suppliers have placed classification and traffic management in separate chips. However, opinions differ as to where the packet modification function should be placed. Some network processor architectures place it with the classifier chip; others with the traffic-management chip. Placing the packet modification function in the traffic manager offers some important scalability and functionality advantages.

These include the following:

  • To properly support multicast, each copy of a multicast packet must be individually modifiable and schedulable. Performing packet modification in the traffic manager allows the original packet to be buffered only once and dynamically modified as it is sent out. If the classifier is responsible for modification, replication and modification are done before the buffering point in the traffic manager. This requires the buffer management function to provide higher performance since each copy of a multicast packet occupies additional buffer space and consumes additional memory cycles.
  • Network-bound multicast packets need to pass through the classifier only on ingress (unless a configuration specific to an application or system is needed to support egress classification). Egress packets need to pass through only the traffic manager.
  • Performing packet modification after traffic management is complete allows the use of traffic management results to help determine packet modification; e.g., supporting forward explicit congestion notification (FECN) in frame relay or explicit congestion notification (ECN) in IP requires marking the packets that encountered congestion.

Given some of these issues, even if a particular generation of a network processor can handle wire-speed traffic it is important to understand such issues as:

What kinds of traffic workloads are supported? Can the network processor support full line rate with any packet size with concatenated interfaces (including back-to-back 40-byte packets)? How is the random cycle time problem solved (for 10-Gbit and higher line speeds) to allow this in the face of random packet arrivals?

  • What level of processing per packet is supported under each of these workloads?
  • Can the typical user write full performance software without heroic efforts?
  • What gate counts, clock speeds and fabrication technology are required to achieve this level of performance? High gate counts, clock rates, or the need for specialized fabrication technology makes scalability difficult.
  • If multiple hardware-processing contexts are used, how independent are they from each other under various workloads? Independent contexts can be more readily spread across multiple executions to scale performance.
  • How likely is it that adequate resources will be invested into scaling the network processor to higher line speeds and performance levels going forward? Many of the 30+ companies currently developing network processor solutions are unlikely to be in business in a year. Even established companies may not be willing to adequately invest in the evolution of their current network processor solutions.

An example of these scalability principles at work is Agere Systems’ PayloadPlus family of software-compatible network processor solutions (2.5-Gbps and 10-Gbps versions have been announced with higher-performance versions in development). PayloadPlus Network processors are programmed in high-level application-oriented languages. The resulting software is focused almost entirely on performing the application and does not contain underlying details of the microarchitecture.

These languages are:

  • Functional Programming Language. FPL is used for packet classification. It allows an order of magnitude fewer lines of code (compared to C) to be written and offers other benefits such as reduced development cost and interval, improved defect rates and maintainability. FPL is also efficiently mapped onto the underlying pattern-processing engines that embody Agere’s patented classification technology.
  • Agere Scripting Language (ASL) is used for packet modification, policing and statistics. Effectively a subset of C (with typical scripts requiring less than 50 lines of executable code), ASL can be efficiently mapped into the underlying VLIW engines that provide the associated functionality on the PayloadPlus devices.

PayloadPlus network processors provide full carrier-class packet processing functionality, including classification, policing, statistics, queuing, scheduling/shaping, buffer management, data modification and fabric/framer interfacing.











  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
With Acquisition Delayed, Sun Cutting 3,000 Jobs
With its proposed acquisition by Oracle being delayed by regulators, Sun plans to cut 3,000 jobs across several regions over the next 12 months.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About