The advent of intelligent switching and routing in enterprise, access, and metro box designs is forcing test equipment to dramatically evolve. Test vendors can no longer build boxes that perform evaluations up to Layer 3 on the OSI stack. On the contrary, they need to build boxes that test at the link layer while simultaneously letting engineers evaluate the performance of applications at Layer 7.
To meet the demands of intelligent switching/routing, designers must develop multi-port tests systems that allow their customers to simulate high amounts of traffic from many different sources and measure tensand perhaps hundredsof metrics within a single port. Therefore, test designers must employ a microprocessor at each port in the test system.
But what processor is best to choose? That's the big question for today's Layer 7 test equipment designers. This article provides a real-life look at the choices/tradeoffs one design team made when choosing a processor for a multi-port Layer 7 test system. Processors evaluated for the application included PowerPC processors from IBM and Motorola and Pentium-0class processors from Intel.
Prior to creating the processor-based multi-port Layer 7 test system, the design team highlighted here has developed many data and networking test sets using high-speed FPGA devices, which allowed test and analysis at wire speed. For the most part, these designs did not feature a processor per port; the only exception involved the use of a relatively weak processor for running PPP-over-Sonet.
To leverage off the existing customer base with its purchased equipment, the engineering team decided to build the multi-port test system using the same form factor as previously developed products. Further, the engineers wanted to maintain backward compatibility with the existing design base, so certain FPGA logic had to be included. Finally, customers had come to rely on highly compact designs, so the engineers tried to squeeze in as many ports as possible in a confined space.
A typical 16-blade chassis with several line cards appears in Figure 1. Each line card in the shown chassis conforms to a historical form-factor. Each slot also has a limited amount of available power as well as thermal cooling ability. The challenge was to create a design such that up to 128 10/100/1000 copper ports, each with a dedicated processor, could fit in a single chassis. To maintain scalability, each port would act independently.
Figure 1: Sixteen-blade chassis with line cards installed.
Stacking the Application Deck
From a Layer 4 perspective, the embedded processors had to be capable of running many simultaneous TCP and UDP stacks. Further, each stack needed to run over a different network address on a different subnet. In fact, this spirit of independence ran as low as the Layer 2 level, where each stack could be associated with a different Ethernet MAC address.
The design also had to allow for each processor to step aside in situations where the FPGAs would transmit a raw packet without processor intervention. All this independence at the lower layers was designed to support a high amount of activity at Layer 7, where numerous applications had to run independently as the overall test simulated real-world traffic conditions.
The objective was to create an environment where the processors, coordinated through a sophisticated software interface, would create traffic patterns that emulated the real world. With this "real-world" traffic, realistic tests could be performed on networks and network elements, where realistic performance measurements could be ascertained.
So at Layer 7, it was necessary to run third-party software like NetIQ's Chariot or Radview's WebLOAD, as these test applications have been designed and deployed to simulate real world traffic conditions on a wide number of processors. To support this effort, the engineers realized that their selection of an OS was perhaps the most critical decision in their design. The OS became central to the selection of a processor, and it influenced the range of processors under consideration.
The OS had to support the following criteria:
- Multi-tasking: Each application might require the processor to "fork" a new process to support it.
- Multi-threading: Application test software like Radview's WebLOAD typically creates multiple threads within a single forked processor task.
- Popular Support: Support for popular third party applications like FTP, Telnet and HTTP.
- Wide Array of Supporting Processors: Must be supported on many processors so as to not limit the choice of processors for this design.
- Access to Source: Allows modification to optimize the Layer 1 stacks (TCP/UDP) for the specific hardware configuration.
Based on the above criteria, the engineers chose to go with an open-source implementation, of which Linux or FreeBSD were the leading contenders. In the end, Linux was chosen as the OS for this project. Linux features a fairly efficient implementation of various protocol standards, most notably TCP/IP. It also has a wealth of Layer 7 applications that have stood the test of time, including FTP, Telnet, SSH, VPN, HTTP and so on.
Note that the requirement for a real-time OS was eliminated by the creative use of FPGA logic. All latency-sensitive signals were queued in the FPGA logic, thus unshackling the OS from the burden of maintaining low-latency interrupt response times. Without the FPGA logic, the currently available Linux OS would not have been capable of handling all the requirements of this design.
Having chosen the Linux OS, the design team had an opportunity to study a wide range of possible processors and determine which one had the most efficient implementation. They were also very interested in any on-chip functions that would accelerate the processor's network performance. On-chip functions, like DMA or MMU, would reduce the minimum amount of peripheral hardware.
The engineers started out by sketching a diagram of the embedded processor environment, as shown in Figure 2. They then proceeded to evaluate processors based on the design.
Figure 2: Processor-board block diagram.
Having a white-board design in hand, the engineers combed through a number of benchmarks and hardware specifications looking for the ultimate implementation. Published benchmarks were used to rank each processor, but not all benchmarks were given equal weight.
For instance, some processors scored high in arithmetic benchmarks because they contained one or more internal double-precision floating-point units. It was anticipated that double-precision floating-point arithmetic was not necessary for the proposed embedded design, so this particular benchmark was not weighted very heavily. The results of well-established benchmarks, such as Dhrystone 2.1, were given a lot of weight in the selection process, so long as the objectives of the benchmark were well aligned with the design team's requirements.
Related to benchmarks was a processor's ability to efficiently execute compiled code, especially those processors that featured simultaneous execution units. Efficiency of compiled code, as well as the availability of good cross-compilers, was considered, though there wasn't a lot of good objective material to choose from. Engineers had to draw on their own experiences in this area. As it turned out, all the engineers that worked on this project already had personal experience with other types of processors, and this pool of experience contributed significantly to the selection decision.
Code Execution Criteria
The criteria of "efficiently executed code" include a processor's ability to execute those functions that are central to handling IP packets. At first glance, this may seem a bit amorphous. However, things such as internal cache size, task-switching speed, pipelined architectures and simultaneous execution units can contribute significantly to a processor's ability to handle IP packets efficiently. With this in mind, the engineers closely analyzed the internal architecture of each processor, often collaborating in groups to discuss and weigh the benefits and drawbacks of the various different architectures among the processors under consideration.
Since the design was to be incorporated into a very compact form factor, the design team scoped out the minimum required external components. The goal was to provide eight processors per slot, so every square millimeter represented precious real estate. Since everything, including the OS itself, was "soft", there was no need to include external non-volatile memory, such as flash. The chassis OS, on start-up, would keep each embedded processor in a hold state while uploading its OS. This eliminated the need for external non-volatile memory, essentially by replacing it with RAM.
Power dissipation became a highly limiting factor. A major requirement was to support eight processors on each line card, and each slot had a power budget of 100 W. Related to the power budget was the thermal cooling ability of the fans coupled with temperature limitations of many components, including the processors themselves as well as the surrounding FPGAs.
A fully loaded chassis would host 16 line cards, each of which would use eight processors. Thus to get a "ballpark" figure of the contribution to the overall power dissipation from the selected processors, the individual power dissipation had to be multiplied by 128. As it turned out, this simple formula immediately eliminated many processors.
Finally, the engineers were very sensitive to cost. It was important to maintain a final pricing structure that would be competitive with racks of individual 1U-high servers.
Choosing the Processor
Several processors were considered during the development of the multi-port Layer 7 test box. These included Intel's Pentium architectures, the Motorola MPC74XX series, and the PowerPC from IBM. After weighing all factors, the engineers decided that the PPC750 from IBM was the clear winner. Its efficiency in CPU processing power per Watt and per cubic area put it into the winning category. The Linux implementation is particularly efficient, which further qualified the selection of the IBM PowerPC processor.
The design team initially chose the PPC750CXe on an earlier design, and then migrated to the PPC750FX as it became available. The latter processor is an upgrade. The main reasons cited for this choice are: 1) the MIPS/watt is low, 2) an integrated 256K L2 cache, and 3) an instruction execution rate of 1392 Dhrystone MIPS, which exceeded the anticipated 1200 Dhrystone MIPS requirement. The FX increased cache to 512K, allowed higher clock rates at lower power, and doubled the usable bus speed.
Having selected the PPC750 processor, the engineering team faced the next challenge of implementing the hardware into a board design. Several reference designs were studied along with a number of design application notes. Also, application engineers from IBM were available to give advice on the layout and help understand some overall design concepts. Further, IBM maintained a "PPC Support Line", which is a dedicated phone number that acted as a "hotline" for immediate support.
The end result is that the design went from "kick-off" to a functional card in less than six months. For the engineering team, this was a relatively long development cycle. However, considering the compactness of the design and its resulting layout challenges, this was an impressive development cycle.
Fortunately, the firmware and software engineers didn't have to wait for the hardware design to be completed. Existing processor boards were available to write and debug code while the new hardware design coalesced. Because the specific addresses and resources of the test board designs were different than the newly designed boards, the firmware development engineers simply "stubbed out" specific code where necessary.
In addition, they used compiler directives to control the compilation of code and determine its target. By the time the new hardware was ready for power-up, the software team was ready with the code. As a result, the design was up and running on the first hardware production cycle.
About the Author
Dan Schaefer is systems engineering manager at Ixia Communications. He holds a BSEE degree from the University of Missouri, Columbia and has over 15 years experience in the design of hardware, software and FPGA logic design. Dan can be reached at firstname.lastname@example.org.