![]() Architectural planning allows fast evaluationBy Don Roberts When planning a complex electronic system, designers must look at both the viability of the underlying algorithms and the system architecture that will implement those algorithms. Architectural planning lets designers quickly construct high-level models, which can be simulated for performance analysis and system refinement. Lockheed Martin recently applied such an approach to a 100-processor embedded computing system for image processing. The main benefit of the high-level architectural analysis was to allow us to evaluate many design alternatives in a short period of time, and assure our customer of success. We were able to simulate the processing of dozens of image frames in about 20 minutes, in contrast to the weeks of simulation that would be required for a detailed HDL system model. This approach is equally applicable to designing a complex system-on-chip (SoC) containing a single processor. Lockheed Martin's image processing system is part of a large project for a U.S. military customer. The application requires real-time processing at high throughput, with frame rates significantly higher than commercial TV's 30 frames per second, as well as extremely low latencies. To meet those requirements, the Lockheed Martin design team had to devise a way to parallelize the processing and data-distribution tasks. Although this sort of application is what many people in the industry would call naively parallelizable, it was not completely clear at the beginning of the project that the concept would work at all. Through component benchmarking and architectural modeling, we proved that commercial, off-the-shelf hardware could meet the system requirements.
The system's 100 PowerPC processors deliver compute power well into the tens of Gigaflops. The design includes more than a dozen Fibre Channels, each of which peaks at close to 100 Mbytes/second, so the system is highly I/O-intensive. In addition to the high-speed external I/O, we took advantage of the switched-packet-type efficiencies of Mercury Raceway processor interconnects. To meet the low-latency requirement, we pipelined the data input and pixel processing. As soon as a little of the image is in, we start processing. By the time the image is completely in, the system has almost completed the image processing. This approach requires sophisticated data distribution and, in fact, about half of the processors in the system manage data I/O. Thus, at any given time, the system will be dealing with 100 messages that all have to arrive at the right time to maintain real-time, very-low-latency processing with lots of data interchange among the processors. Clearly, we had to face the potential for communication conflicts, delays and problems. For that reason, we knew that neither simple simulations nor manual calculations could give us enough confidence to say that this design was going to meet the requirements. Architectural decisions In addition to our own need for architectural analysis, we had a directive from the customer to validate the system's performance before building any significant hardware. This directive was aimed at reducing the project's risk and cost. Our architectural analysis thus had to answer questions about the system's throughput, latency and device use, as well as overall feasibility. We also had to choose specific hardware components that provided the necessary performance. The Lockheed Martin design team used two methods to make those architectural decisions. First, we asked processor vendors to run benchmarks for us. To our surprise, the benchmarks showed that the PowerPC delivered the highest performance for our system, which is essentially a DSP application. The PowerPC's advantage probably lies in its high clock speed, which enables it to perform adds and multiplies faster than many DSP chips. To answer other hardware questions, we knew we needed some type of architectural analysis tool. We evaluated several possibilities, including two in-house tools from Lockheed Martin. One of the latter would have been fine for the early design phases but did not have the capability to get us through the detailed design. The other in-house tool is oriented toward profiling existing hardware to automatically generate code, but our latencies were too low to take advantage of the typical overhead for that code. We also evaluated a tool that handles generic system analysis, including electrical/mechanical components, data, interfaces, and so on. This tool did not map specifically to the requirements of parallel processing systems. In the end, we chose eArchitect from Innoveda Inc. (Marlboro, Mass.). Building a systems model A simulation using eArchitect works at the transaction level. Thus, system components pass messages representing the transactions back and forth rather than dealing with actual application data. Aside from counting the number of data packets or checking a few other data-related parameters, verification is not data-dependent. From eAr-chitect's point of view, simulation consists of a certain number of bytes coming in, a certain amount of processing time and a certain number of bytes going out. Each eArchitect component model has specifications for parameters such as throughput, latency and the time required to perform specific tasks. Processor models might contain information about the number of clock cycles needed to execute an algorithm, for example. In fact, we used the information from the processor benchmarks as the basis for many of the model specifications. For further refinements, you can create an instruction profile listing the number of integer adds, floating-point multiplies, etc. You can also use a flowchart-like format to profile software operations such as "look for a message directed to this process with a certain message name" or conditionals such as branches, loops and counts. You may know that a group of operations takes 800 processor cycles, but whether that "routine" executes once or eight times depends on what "data" comes in. As operations execute, the models send out messages that indicate the results. Working with this process proved to be an excellent hands-on way for our junior engineers to learn about parallel processing. Representing the entire image-processing system in eArchitect was a matter of choosing components from the built-in model library, creating models for components that were not in the library and connecting up the models with a graphical editor. We were surprised to find that most of the components we wanted were in the library, including the PowerPC and the Mercury Raceway. The library did not contain a Fibre Channel model, so we used a generic communication model and specified parameters such as the channel's throughput and the amount of overhead for each new data packet. We built up our system representation hierarchically. We took the primitives such as processors, memories and interfaces that were built into the library and created our own libraries that corresponded with vendor hardware products. Specifically, we created models for a Mercury motherboard and for a daughter card containing two PowerPCs, some memory and a Raceway interface. Then we plugged four of the daughter cards into the motherboard and copied that motherboard assembly a dozen times, interconnecting the motherboards via the Mercury backplane. Our final system representation is a mixed-level model in which two of the motherboards are modeled in full detail. We simulate those boards in detail, get the timing, and put that data into the other board models that work at higher levels of abstraction. Thus, we have detailed board models that tell us everything we need to know about the communications on the motherboard, and we have higher-level models that tell us everything we need to know aboutcommunications with the backplane. Using our hierarchical system representation, we can run through the processing sequence for a few dozen image frames in 15 to 20 minutes. Bear in mind that this system model is strictly for performance evaluations. It answers questions about latencies, message conflicts and utilization, but will not reveal whether the algorithms are correct or software is coded correctly. Activities displayed A look at eArchitect's analysis screens indicates the range of answers that the tool provides. The most useful display for us is the activity time line. It tells us what is executing when, shows the delays from various communications events, provides an overall time line and shows whether we are getting good use out of the processors. A histogram also shows utilization and latency for selected hardware components. A tool that turned out to be surprisingly useful is eArchitect's hot-spot analysis tool. It shows the design at a user-chosen level in a range of colors that indicate how fully each block is used. The hot-spot display was useful in tracking down instances in which messages were not getting through or where a processor was getting hung. When those events occurred in a simulation, we could see the utilization of the relevant blocks suddenly fall to nothing. And in general, when you see that most of your system is lightly used, but one block is red most of the time, you instantly gain an important insight into the system's performance. With capabilities such as the hot-spot analysis and the time line, we were able to refine our system architecture. While our original architecture had 150 processors, simulation showed that 100 processors would suffice. Even with fewer processors, we know from the simulations that we still have plenty of headroom for safety margins and "requirement creep." Reducing the number of processors cuts the cost of the system hardware. The simulations further decreased costs by eliminating one or two prototyping turns. But the most significant cost savings came from the ability to delay hardware decisions until they were firm. When Lockheed Martin did invest in hardware, we had confidence that it was the right hardware. Room to grow Still, there's room for improvement. Since message routing was a major challenge in our system, it would be useful to have a method for tracking messages when they do not behave as expected. The ideal approach would be to have eArchitect animate message activity on the hardware display, along with links in the time line display for tracking messages. We would also like to see eArchitect provide more robust cache modeling-always a modeling challenge-because this issue can prove crucial to overall system performance. As for our own design process, we believe we could improve it with closer integration of hardware simulation and software development. Additionally, using some target hardware earlier in our detailed design phase would give us useful hands-on experience. Still, we are quite happy with the results of our existing flow. The eArchitect simulations also gave our customer a great deal of confidence in the system design. With the ability to give such a compelling view of the system's performance, we passed handily through design reviews. Even when new requirements came in, we were able to show that the system could absorb them without increasing the number of processors. As a result, this project is on time and on budget. Don Roberts is EE Specialist at Lockheed Martin, Missiles and Space Division (Sunnyvale, Calif.).
|
| ||||||||||||||||