No one will tell you developing a network-processing element is an easy tasks. These chips must parse data, classify data streams, switch, and effectively manage traffic flows at line rates pushing into the 10-Gbit range.
And if you thought that was hard enough, think again. As engineers begin designing the next wave of network processing elements, additional features will be added, such as enhanced security, while line rates push to even higher levels.
As designers can quickly see, theory, estimates, and luck are no longer enough to successfully design a network processing unit (NPU), network search engine, and other network processing elements. To fill the gap, simulation provides a vehicle by which architectural decisions can be validated prior to product fabrication. Simulation enables basic code and packet processing functions to become operational and facilitates the evaluation of system performance. Simulation and associated simulation tools allow developers to intelligently analyze design decisions resulting in optimal product solutions.
In this article, we'll look at the impact that simulation has on the development of network processing elements. We'll also examine some advanced techniques needed to ease the simulation process in the future.
The traditional design flow can be reduced to a few key steps (Figure 1). The process begins with the hardware and software teams deciding on a system architecture. Once the architecture is defined, the hardware team designs and fabricates the system hardware while the software team develops the system software. Each team attempts to hypothesize the best possible solutions as its designs progress. When the designs are ready, the integration phase commences. The hardware and software are integrated, debugged, and tested, and the system is optimized for maximum performance.
Figure 1: Diagram illustrating a traditional design flow.
If a major performance problem is discovered during the integration phase, the possibility of having to modify the architecture is highly probable. Typically, this requires both system hardware revisions and software rewrites. Therefore, the traditional design cycle can lead to schedule slips and delayed product deployment in an unforgiving marketplace where time to market is critical to the success of a product.
Making the Traditional Approach Even Harder
The challenges associated with the traditional design methodology are compounded by several factors: the increased complexity of individual component devices, the swell in design decisions required by the interaction between two or more individual devices, and the re-evaluation of these decisions with respect to the system. To illustrate these difficulties, let's examine a system that includes an NPU.
An NPU typically has multiple processors or microengines, each running multiple threads simultaneously Figure 2. Multiple instructions are being executed on each individual clock cycle instead of the single instruction that would be executed in a traditional single-processor system such as the mainstream personal computer.
Figure 2: Diagram of a typical NPU.
Design complexity in an NPU increases because of the multiple internal and external memory buses housed in these designs. Developers must properly balance memory bus bandwidth and bus contention in order to avoid significant performance or throughput problems.
For example, the receive packet interface can be storing data to the DRAM while one or more microengines simultaneously attempt to access the DRAM to obtain packet header information. The DRAM design must provide sufficient bandwidth to service all requestors and the bus access architecture must efficiently support and resolve requests from all parties attempting access during the same clock cycle. In addition, the attached system memory must be carefully partitioned to facilitate sufficient bandwidth for today's applications and the design must contain enough headroom for future applications.
SRAM, TCAM Contention Problems
The same DRAM bus bandwidth and contention problems apply to the SRAM and network search engine (NSE), also known as ternary content addressable memory (TCAM), that can be connected to the NPU's SRAM controllers. QDR SRAMs and Network Processing Forum (NPF) Look-Aside 1 (LA-1) compliant NSEs can be placed on separate buses or they can simultaneously co-exist on the same QDR memory bus. Again, both bus bandwidth and contention must be considered because the "write-side" heavy NSE device accesses should be balanced with the typically "read-side" heavy SRAM memory devices. However, the combination of these two devices introduces a new set of criteria for design decisions.
One important design consideration is deciding which tables should be put in SRAM and which should be put in the NSE. IPv4 addresses can be stored in SRAM data structures called "tries." These SRAM "tries" sometimes require megabytes of storage and multiple accesses for searching, whereas NSEs often use one-tenth the storage and perform single access searches.
For shorter keys, like IPv4, the cost difference between the SRAM and NSE approaches might be negligible at lower line rates. But for larger or more complex search keys, such as IPv6 and access control lists, the benefit of the NSE approach becomes clearer.
Another important related design consideration is the update rate and overhead for managing the SRAM and NSE solution approaches. Using more sophisticated data structures in the SRAM design can decrease storage requirements. However, the more sophisticated the data structure, the more overhead is required to maintain it, which leads to slower updates.
A good designer will certainly consider the cost difference between all the complex factors just described and the single-cycle search and update capabilities of a NSE solution. A designer who uses simulation to test the efficacy of these competing choices will more easily recognize the trade-offs between SRAM and NSE solutions.
System-Level Architectural Modeling
As designers can clearly see, the complexity provided by both the NPU and its associated devices can cause some big architectural design issues in today's networking systems. The big question designers must ask, however, is how do you deal with these complexities?
One solution is to turn to the traditional design methodology. The problem with that move however, is that when an error occurs, it cannot be corrected without returning to the initial development stage.
A better option it to use a system-design that adds simulation capabilities to the development process. Let's see how this option works.
Figure 3 shows a development flow more suited to NPU-based designs.
Development flow for an NPU-based design.
As with the traditional system design approach, system-level architectural modeling begins with determining the architecture and writing the software code. However, the next step is to simulate the entire system so that the code and system can be tested, debugged, and optimized prior to the creation of the hardware solution.
If an architectural problem arises during the simulation, the architecture can be adjusted on paper and in the simulation. Then designers can revalidate the design. This flexibility significantly speeds up the flow and eliminates the costly and time-consuming process of redesigning the hardware.
The system-level architectural model provides a working prototype of the entire system infrastructure in order to facilitate the system designer's decision-making process. This working model comes in two flavors: data-accurate and cycle-accurate.
The data-accurate version processes and responds to commands just as an actual device does. The difference is that the data-accurate model responds instantaneously and without respect for a device's timing.
The cycle-accurate version is a more refined architectural model that incorporates all the system's timing requirements. It includes time constraints for both the device and the system modeling. Cycle-accuracy is, therefore, a superset of the data-accurate model with the difference being that one or more clocks are incorporated into the model.
The use of simulation in the development flow provides a number of advantages. First, the system-level architecture model initially provides a development platform for making the basic code and packet processing operational. Simulations can tell developers what they did, what they did wrong, and when they did something in the wrong order.
Take for example the previous NPU-to-NSE database search example. When a developer is first debugging NPU code, the information output by device simulation is invaluable. A device simulation message describing the NSE search command type, the database to search, and the search key can simplify command correctness validation and command data integrity. This simulation feedback is especially effective when the search keys are being assembled from multiple packet data offsets.
The simulation model can then predict and display the future search result outcome and the system time at which the result is to be ready. The result and time information are useful in multiple ways. First, the developer knows what behavior to expect and can now easily identify the code path to exercise. If the expected result is not the desired outcome, then the developer can immediately inspect and modify the simulation to obtain the desired outcome. Knowing the output beforehand also enables the developer to predict future system behavior and to set breakpoints or other debug features for successfully monitoring the code path and the system before a command's actual simulated completion.
The simulation time also allows designers to better tune their microcode. For example, the LA-1 interface requires NPUs to poll a device for completion. Therefore, in order to reduce unnecessary QDR read bus bandwidth consumption due to premature result polling, designers can simulate the output of the completion time and the actual NPU results for read time aid analysis. This output can tell the developer if the code read delay adjustment needs to be longer, shorter, or unchanged.
Another advantage can be seen through the NPU-to-NSE example described above. If a search command has an illegal parameter or fails to provide the proper amount of search key data (either too little or too much), then each condition can cause a warning to be output. Simulation can identify accessing resources before they have been initialized. It can also identify those resources attempting to execute commands out of order, such as those attempting to issue an NSE learn command without having previously issued an NSE learn initialization command.
The multitude of statistics that a simulation can acquire and the vast variety of possible graphic display and results processing mechanisms is just the beginning. The ability to extend beyond just mimicking the device, subsystems, and systems cannot arrive too soon. Simulation within system-level architectural modeling must continue to evolve and improve.
One area of improvement surrounds the multiple peering point capability provided by simulation systems. Simulation has the capability of enabling multiple peering points normally unavailable with real silicon. These peering points become increasingly important as the integration of standalone devices into a single monolithic entity continues.
In the past, developers have been able to connect logic analyzers and other monitoring and debugging devices to the outputs and inputs. Now, these buses are no longer visible to the developer.
Simulations need to expose both the internal and external buses so that they can be monitored, examined, and graphically displayed via methods and manners appropriate to the bus type. Simulations must also enable peering into internal data structures. For example, the ability to examine the contents of a FIFO, free list, or ring fullness can be extremely advantageous to a developer at multiple times during the development process.
In addition to supporting multiple peering points, simulation environments must become smarter and must identify errors and bottlenecks not easily observed by the developer. Smarter simulations generally mean an increase in both positive and negative feedback as well as the ability for the developer to specify the warning or error notification levels.
Errors such as internal device FIFO overruns, resource contention, or writing into non-existent memory spaces are examples of events needing to be immediately identified to the developer. Other errors such as exceeding the system power budget need to be monitored and evaluated by the system simulation on every cycle. If the error condition is met, then it to must be reported to the developer. Execution stall conditions should be monitored, graphically reported and the developer notified as desired on each occurrence or nth occurrence. These include the following: stall conditions associated with resource depletion or when a command FIFO full condition halts execution until the FIFO has been emptied sufficiently in order to accept a new command.
Another advancement challenge for the simulation model is getting to the "ready-to-run" or "running" state as quickly and easily as possible, especially when not all components are available. A good example of this today is the lack of control plane processors that typically perform most of the initialization and configuration processing.
Again, using the NSE as an example, the NSE's database is generally initialized and managed via the control plane processor. The lack of a control plane processor poses the problem of both initially configuring the database characteristics as well as adding search entries to the database.
Simulation models must facilitate the quick and easy supplying of the initial device configuration state along with the device's data contents. The models must support the ability to consistently load or reload the data so that the same simulation configuration may be run multiple times as required to generate accurate and reproducible results.
The ability to quickly reload the state outside of the running simulation execution time cannot be overstated. The faster the simulation reaches the ready-to-run state the more simulation sequences possible.
It is also important for the simulations to be able to save the current state. These saved profiles can then be used to reload the system, provide for post-mortem analysis, and perhaps most importantly, become incorporated into an automated regression system. Enabling the system designer to configure the system quickly and to easily insert system-modeling data for a complete simulation run is a necessity.
Most of today's leading NPU vendors provide simulation models of their devices and the tools to support and facilitate the simulation environment. In the hands of experienced system architects and designers, these tools intelligently optimize the design process, and ultimately, the product solutions themselves. And in the race to market, better, faster, and more reliable can mean the difference between success and failure.
About the Author
Ben Chang is a product manager for the IDT network search engines. He earned a bachelor's degree in electrical engineering from the University of California, Davis and received his master's degree in business administration from Santa Clara University. Ben can be reached at email@example.com.