In contemporary engineering, there is only one question: How to design a ridiculously complex product in a ridiculously short amount of time and guarantee success the first time. But when you stop laughing, you understand it can be done - but not in any traditional way.
We at Ario Data Networks faced that question in designing and developing a scalable storage service router, used by major storage centers and storage-appliance manufacturers. One solution we found was architectural and functional modeling of our system on three different levels, using tools from Summit Design.
By modeling at an abstract level, we found we could get a handle on the properties of the key scaling algorithm and explore what would or would not work. We could hone the solution and comply with tight system restraints, in terms of physical size, heat capacity, and the like.
We learned this the hard way in other projects -- in which we built models from the ground up, without tools, and constructed our own analytical methodologies. In one instance, going that route took nine engineers over two years. In an age of complexity, conducting fast "what-if" exercises is a must, and using straight C or spreadsheets or bolting incompatible tools together is a formula for failure.
By stark contrast, the right tools allowed our designers to rapidly prototype, generate complex traffic scenarios, build traffic generators and develop analysis tools. In that way, not only was development time slashed, but risk as well. For example, by using building-block increments, we were able to validate previous steps before moving on to the next ones.
A complex switch
One of the key elements of the Ario system is a storage area network switch with a switch: The storage services are embedded within the fabric, itself. Embedding reduces complexity and assures scalability. As a bonus, this intelligent storage-centric fabric switch (which targets mid-range applications) runs at hardware speeds because there are no software processes involved.
Figure 1 - Ario storage-centric fabric switch
In reality, the switch is composed of two chips -- one for processing, the other for switching (a crossbar). The latter is fairly simple. The processor, on the other hand, is a complex system-on-chip (SoC), containing at least ten million equivalent gates, that must handle all of the in-bound and out-bound line traffic.
A number of those SoCs, say, 24 of them, can be hooked together through the crossbar, forming a self-contained island. Combining a number of islands results in a redundant, scalable topology; scalability is essential because processing is distributed over the fabric. The islands can be stacked together or arranged in a backplane configuration, in which each island is a card that plugs into the backplane.
Figure 2 - Backplane configuration
When connecting the islands, some fundamental questions arise. What scaling algorithm is needed to schedule traffic coming into all the ports at line speed? How should the schedule through the fabric be arranged to achieve system scalability? There are also the questions of how to handle or manage the voluminous amounts of data on disks sitting on the SAN network, how to handle the server traffic, how to manage the storage pool correctly, account for redundancy and performance, watch over device capacity, and so on.
Those questions and others can be answered only by deploying modeling of a very different color. Essentially, whatever design methodology we ended up with had to accommodate three tasks: high-level abstract modeling in C language, lower-level C-language-based modeling at the ASIC level, and combined C-language and RTL-based modeling. In addition, the simulator (which had to be mixed-mode) had to run in an extremely optimized mode so simulation runs would take many seconds or many hours, but not days or weeks.
We had to keep in mind that the model we wanted for the SoC had to be fine grained, account for all internal blocks, all processor arrays, and all microcode. Mixed-mode simulation was essential because we were working in various C flavors, as well as at the RT level; cycle-level accuracy was required, as well as tracking all line-speed and queuing activities. Add to those accounting for the dynamics of the memory subsystem.
We needed answers: How fast will we run this chip? How many processors are enough? Five? Eleven? What are the internal bottlenecks? What should be the FIFO depths? The average depth of a queue? How do traffic bursts affect the queues? Models could answer all of those questions.
In the end, we turned to Summit Design's Visual Elite and System Architect tools. With them, the modeling team could rapidly prototype and validate algorithms early in the project, experiment with algorithms, and be assured that the eventual product would meet both engineering and market requirements. Just as important, the team could focus on the end product and the issues at hand, not the tools or simulation.
The tool already has saved us at least 24 man months, and we were able to design two giant models with only two people in less than a year.
In practice, we modeled a slew of system attributes, including the ability of the system to scale, the ability to maintain consistent latency under any load, the flow-control processes, and the scheduler. The results of the latter modeling assured us that the bandwidth allocation was maintained within the system. We also verified the high-level system timing and rates, and the fabric utilization. It is instructive to review those activities, one by one.
The modeling process
The visualization techniques supplied by the tool were adequate in building and connecting blocks to ports, and ports to channels. The blocks contain the actual logic of a certain segment of the design, and as the blocks/ports/channels intertwine, they become components. As our design grew, these components, in turn, connect to other components via additional ports and channels. Our final ASIC grew out of connecting a number of smaller components and, in turn, we looked at the ASIC as one large component.
In this way, by connecting many larger components and using visual techniques within the tool, we were able to scale the design to the size needed to evaluate the overall concepts.
More specifically, we could accurately program the throughput of each sub-block within each component. We could easily calculate latencies by accounting for the size of the frame moving through a block and combining that with the throughput of the block. Accumulating the latencies of all blocks of all smaller components yielded the overall latency of the larger components and, in turn, the overall system latency (See Figure 3).
Figure 3 - Modeling system latency
The tool came with a number of different distributions (uniform, poisson, gaussian, exponential, to name a few). At a high level, we combined a number of these distributions in different ways to create traffic mixes that were transmitted into the model. At the lower level modeling, we were able to design an actual host and disk that interact with each other through protocol (CMD, XFER_RDY, STATUS, and so on) that allowed us to more accurately generate traffic for the model.
By combining the accuracy of the throughput within the components with different traffic mixes, we were able to gather statistical data -- such as min, max, average, standard deviation -- that was built into System Architect via the log-state function. Or we could add additional code into the model to accumulate specific data, such as utilization and queue depths. Such data allowed us to make decisions about fine-tuning the scheduling and crossbar algorithms.
Verifying high-level system timing and rates
By following the same rules on maintaining latency under any load, the modeling team was able to monitor high-level timing and rates as they modified algorithms to fine-tune the system. The team was able to monitor the scheduler and crossbar closely, along with memory bottlenecks, queue depths, and flow control; they could explore how various combinations of algorithms affected established rates of traffic flow. Special attention was paid to possible high-level timing idiosyncrasies as the various combinations were tried.
The modeling team also monitored the utilization of each of the ports within the crossbar by sampling the number of bytes through the port at any given time and then comparing that sample to the maximum number of bytes that could possibly go through that port. Over a period of time, the statistical data gathered yielded a picture of the utilization of each of the ports under different loads.
For instance, such a picture might show that, 50 percent of the time, a port was 70 percent utilized; 15 percent of the time, it was 80 percent utilized; 20 percent of the time, it was 90 percent utilized; 15 percent of the time, it was 100 percent utilized; and so on. This became a very good indicator of the influence of all of the combined algorithms from all of the different components, and it ensured the crossbar could handle the amount of traffic transmitted into it. Similar sampling also occurred at other components' output ports.
Next, the modeling team turned to designing an effective flow control mechanism, one which allowed the absorption of traffic burst within the system. As flow control was invoked, measurements were taken in queue depths that allowed the team to calculate the worst-case scenario for the design, that is, the point of catastrophic failure. By finding the limitations, correct "back pressure" could be designed into the system and applied to the source of the traffic, thus minimizing the effects of high traffic bursts.
Then, the team turned its attention to the scheduler algorithms, looking for a design that would assure the necessary bandwidth under different traffic load types. Although the specific IP of the scheduler is proprietary, we can say the tool allowed us to evaluate the communication necessary between the ingress and egress segments. We were also able to sample data at the scheduler tables to be sure that cells were scheduled out in the correct slot times.
We demanded a number of capabilities from the testbench within System Architect, including the ability to configure the architectural model, generate frames internally or read frames from an external source, and control the architectural model using the state/statistics maintained by the model. Also, the testbench had to output and verify the state/statistics and graphically display results in such a manner as to form conclusions of the model's performance. We used Visual Elite and System Architect to fulfill the testbench requirements as follows:
First, we generated an input file that we could read upon initialization so we could configure the model to act in a certain manner once stimulated with frame traffic. Among configuration inputs we found necessary were a debug mask, that is, flags allowing additional output for debug; the clock period, used in the NOW clock for scheduling slot time; throughput numbers for differing components; min/max variables such as frame size; and the generation time, or time to transmit frames.
We also needed the sink time, the time to allow purging of frames through the system; the queue type for buffering components (simple, priority, and so on); the distribution seed, used to instantiate distributions; and trace -- attributes that allow or disallow specific logging of built in tool functions. Those are only a few of the configuration parameters we used. Note that any parameter requiring programming can be read in from a parameter file.
For the ability to generate frames internally or read frames from an external source, we combined, at a high level, a number of statistical distributions (uniform, poisson, gaussian, exponential, to name a few) in different ways so as to create traffic mixes that we could transmit into the model. Included were distributions for destination, read/write, priority, frame size and inter-packet time.
At the lower level modeling, we were able to design an actual host and disk that interact with each other through protocol (CMD, XFER_RDY, STATUS, and so on) that allowed us to more accurately generate traffic for the model. Additional features allow the model to read frame header data from a real traffic snoop and transmit the recreated frame into the model.
The state/statistical data maintained by the model can be used to place bounds on components in order to restrict algorithms from giving unexpected results. The data also can be used to fine-tune a set of algorithms from different components in order to balance the system.
By combining the accuracy of the throughput within the components with different traffic mixes, we were able to gather statistical data that was either built into System Architect via the log-state function (such as min, max, average, standard deviation) or could add additional code into the model to accumulate specific data, such as utilization and queue depths. This data allowed us to make decisions about fine-tuning the scheduling and crossbar algorithms.
For performance tuning, the output log-state file can be directly read into the plot command, providing us with graphical data displays. Alternately, we could import the data into a spreadsheet program for further analysis.
The System Architect features of Visual Elite allowed us to lay out the high-level algorithms, consisting of the traffic, scheduler and crossbar models. By utilizing the test- bench features of System Architect and combining traffic model scenarios with different scheduler and crossbar algorithms, the modeling team was able to fine tune the algorithms and gather statistical data for overall system evaluation. All of that work contributed to the development of the final scalable architecture.
The next step was to take the existing high-level model and use the traffic, scheduler, and crossbar models as the foundation for the lower-level model. Additional information gathered through this process helped refine the requirements and design a number of proprietary IP components.
By using the co-simulation interface of Visual Elite with the subsequent low-level model, we were able to replace existing C language blocks with RTL blocks. So doing gave us an additional layer of verification, separating design intent from actual results. We also were able to validate the actual RTL used to build the hardware.
Consequently, we validated and verified the higher level modeling before moving on to the next level of modeling. The low-level modeling, combined with the co-simulation of RTL blocks, validated the design specification, as compared with actual hardware.
Although the Summit Design tools provided the high- and low-level modeling and simulation capabilities we needed, we haven't reached Nirvana yet. Our goal is produce an ASIC with many more ports, perhaps starting with FPGAs and then melding those into one ASIC. Backplane-based systems are on the horizon, too.
We've gone through an iterative process. But it would be more than nice to come up with a methodology in which a single detailed behavioral model contains the entire specification, all source code, the entire ASIC flow, and serves as the basis for a one-shot synthesis of silicon. That requirement would thrust the tools to the next level, admittedly a far higher plane.
Paresh Borkar is senior ASIC manager at Ario Data Networks.