Design Article
Constructing the perfect chip
Ron Wilson
8/15/2005 10:00 AM EDT
This week, on the Stanford University campus, a loosely knit organization of industry and university silicon gurus will conduct the 17th annual conference on the leading edge in performance silicon, aka Hot Chips. The giants will be there: Intel and IBM, Toshiba and Cisco, and for the first time, an IC paper from Microsoft. There will also be startup and university research projects. The underlying theme, as always, will be how far the industry can push the semiconductors at its disposal.
Over the years, Hot Chips has been a running index on the leading-edge ideas in the industry. RISC first battled CISC here. Superscalar processors stepped into the glare at Hot Chips, as did the first powerful system-level ICs. Multithreading, multicore chips, and many ideas that never reached the big time all appeared, in their turns, at Hot Chips.
So what does the conference have to say this year?
Arguably, we are witnessing another inflection point in chip architecture. Transistor budgets keep going up, but the substantial increases in performance we are accustomed to seeing with each new process generation are waning. And the trend toward continually increasing energy efficiency appears to have reversed. Now both efficiency and performance come as a reward for architectural and implementation genius, not as a gift from process engineers.
That, in turn, is leading to a new flowering of chip-level architecture. This year's papers do not describe evolutionary improvements to existing chip structures; rather, they hint at a broad search for architectures that can effectively interpolate between the growing demands of applications-especially in consumer media-and the increasingly fragile capabilities of advanced processes.
The clear victim of this search has been the canonical CPU-centric style of SoC design. Systems-on-chip in the past have, for the most part, reflected the architectures of the board-level systems they were replacing: a central CPU to handle both control and data functions; a bus, or hierarchy of buses, closely based on the CPU core's external interface pins; and caches, DMA and interrupt and peripheral controllers to suit the application.
"For some designers, architecture has been an almost simple exercise," observed Steve Roddy, vice president of marketing at Tensilica Corp. "They write out the entire application in C or SystemC and get to the point where it is functionally correct. If the whole thing runs on one CPU and meets its deadlines, they're through; if not, they partition off some of the tasks and repeat the exercise."
This partitioning process tends to be based entirely on functional lines. If an inner loop takes too much time on a realistic CPU core, try creating a custom instruction or two; most CPU IP is design-time-configurable these days. If that doesn't work, create a hardware accelerator-a custom data path that runs as a loosely coupled slave of the CPU, pipelines the loop and eliminates the instruction fetches and decodes. Model, evaluate and repeat.
In some cases, the code hot spot may be too poorly localized to fit easily in dedicated hardware, or it may be too subject to change during the product life, Roddy observed. In those cases, the hot spot may be lifted entirely out of the main code stream and dropped into its own instance of the CPU, or its own separate thread, if the main code path has enough dead time to make this realistic.
"As a practical matter, starting with a clean sheet of paper happens infrequently," said Bob Pleva, director of semiconductor product marketing at Sigma Designs Inc. "You have to leverage things that already exist and are known to work."
"And you find in design teams an almost religious conviction about what approach to try first when confronted with a new problem," added Sigma vice president of strategic marketing Ken Lowe. "Even the choice to pull out a clean sheet of paper comes from the architectural review process."
In any case, architectures that start out as CPU-centric SoCs tend to remain CPU-centric as they evolve. In particular, early assumptions about bus structures and memory architectures become embedded in a team's thinking.
That's not always so bad. Some applications are inherently compute-bound, so their legacy arrangement of buses and memory instances is just fine, if supported with sufficient computing power. But in a world of highly internetworked rich media, computing isn't the only issue.
"Media processors are huge data-moving engines," observed Lowe. And that's not unique to architects struggling with high-definition video. "Network processing is all about moving data around," said David Sonnier, division chief technical officer at Agere Systems Inc.
There is an architectural approach suited to applications that are primarily data pumps. Long ago, when ALUs came in 4-bit slices, signal-processing engineers evolved data flow architectures that directed the incoming data into a hardware pipeline that could process it at the incoming rate. Such architectures, of course, can become quite elaborate, with forks, loops and memory pools in the pipeline. But their distinguishing characteristic is that they are organized around the flow of data through the system, rather than the flow of data and instructions into a CPU.
Data flow a no-go
On the surface, data flow concepts would appear perfect for today's applications. Networking, media processing and many other pressing needs today are characterized by rigid demands on I/O bandwidth and quality-of-service but are not as sensitive to transfer latencies.
So why aren't most of the papers at Hot Chips about data flow architectures?
The simple answer is flexibility. Agere's access network processor-which will not be presented at Hot Chips-provides a case in point. "Traffic management and scheduling are key tasks," Sonnier said. "On one hand, at incoming data rates of 1 to 2 Gbits/s second, moving each packet of data with a programmable processor becomes a bottleneck-you can't keep up. You could try an array of processors, but there is enough sequential dependency between packets that this doesn't tend to work well. On the other hand, every customer has their own ideas about how to implement policies, so you can't entirely hard-wire the data-moving process, or you will alienate your customers."
This dilemma is even becoming an issue in supposedly standard environments. "There is a growing accretion of nonstandard stuff even in DSL," Sonnier said. "The providers have simply added into the standard the nonstandard things they had implemented before DSL was codified. So now every provider's DSL is a little different. It's a real problem for pure hardware, even for prepackaged software."
Agere's approach was finesse: Hardwire the primitives that actually move the data around to implement traffic management, and then put the hardwired data pump under the control of software running on a CPU. The solution provides programmers with an API that gives the flexibility they need for implementing policies their way, but it hides the actual shuffling of data and permits it to happen at full internal bus bandwidth.
For architects trying to serve the emerging media-processing markets, the problem is even greater. Algorithms and feature sets are still very much in flux for such applications as H.264 codecs, any kind of multimedia to mobile devices, and even digital audio. Murphy's Law appears to have created a variety of audio codecs that have little in the way of common inner loops, short of primitives at the level of the multiply-accumulate function.
"A flexible audio engine is a known fit for a single fast CPU," Tensilica's Roddy observed. "But when you get into high-definition video, you are way beyond what a CPU-centric architecture can handle. It raises all the big questions."
So architects approaching these markets have to take a more risk-fraught approach. They begin with an estimate of the aggregate computing power needed. Then they must organize that computing power in a way that exploits whatever pipelining opportunities the algorithms might present and whatever data parallelism the content may offer.

Telairity Semiconductor Inc.'s Hot Chips paper illustrates the point. Telairity started out not with a set of algorithms-or even a specific market-but with a signal-processing engine. Rich in single-instruction, multiple-data (SIMD) execution units and local memory structures, Telairity's DSP core was intended from the outset for applications that exhibited a high degree of data parallelism-applications that could be described well in terms of vector algebra.
From that point of view, video encoding looked like a target of opportunity, and the many uncertainties encouraged a programmable solution. "The first question was the encoding standard," said Telairity president and CEO Howard Sachs. "We initially were working with VC-1, but given the bandwidth limits providers are facing, we moved to focus on H.264."
That raised lots of other questions. H.264 encoding is replete with operating modes-in motion estimation, encoding and other areas. In each area, picking the best mode will improve the compression ratio. But an exhaustive search of all the modes is beyond any compact hardware solution. So is implementing all of them.
Telairity's solution was to cluster five of its autonomous vector/scalar processors around a single DRAM controller and video controller. Large multiport memory blocks in each processor serve the function of shared buses, giving nearly anything-to-anything connectivity through the vector SRAMs. A fast hardware mailbox RAM provides for message synchronization.
With so much vector processing power and connectivity, an effective approach to H.264 encoding can be mapped onto the chip. And with silicon in hand now, Telairity is increasingly confident that it will work. "We believe we can do high-definition H.264 encoding with four chips, and with eight chips we can offer the user considerable computing headroom during the encode process," said vice president of marketing and sales Shubha Tuljapurkar.
The approach is neither CPU-centric nor, a pure data flow architecture. But it may be the future.
That description also suits perhaps the most discussed new architecture of the year: the IBM Cell,to which Hot Chips will devote an entire section.
The Cell architecture is a heterogeneous cluster of processors based on a single theme, said Michael Gschwind, IBM master inventor and architect of the Cell synergistic processing unit (SPU). The chip comprises an IBM Power CPU core, in all its superscalar splendor, and a cluster of eight SPUs.
"The SPU applies the basic concepts of the Power architecture, such as the instruction set and especially the SIMD operations," Gschwind said. "But it represents a resurgence of RISC fundamentals and pervasive parallelism not found in CPUs." The SPU might be thought of as a stripped-down Power core-much simpler and smaller, with vastly simplified control logic-but with a SIMD vector unit on steroids. The overall result is a unit that's more specialized for vector operations but that allows eight instances to fit on one die. As in the Telairity chip, each SPU has its own local memory.
The control/data divide
The concept was to create a powerful architecture onto which data flows could be mapped without creating bottlenecks at internal buses or at the memory interface. That required introducing a couple of vital concepts.
The first was a separation of control from data flows. "High-performance computing has used this concept for a long time," Gschwind observed. So, in fact, have high-end networking chips. But extracting control flow from the data paths is an essential step in making the traffic pattern uniform enough to map onto limited, shared-interconnect resources.
The next step was to separate data movement from data processing. IBM accomplished this with a memory flow controller-a highly intelligent, block-oriented DMA processor under control of software on the Power core. The controller not only massages the memory traffic necessary to get maximum bandwidth out of the external Rambus DRAMs; it also allocates the huge but finite on-chip bus bandwidth and schedules block data transfers, making sure that blocks of data are moved in and out of the SPU local memories so that the SPUs are not caught waiting for data.
Partitioning the data into large blocks that exploit the inherent parallelism is fundamental to the Cell architecture. "The internal bus structure does not impose a processing model on the algorithms," Gschwind said. "Bulk data transfer is fundamental; modern media processing has to think in this way, seeing the data as structures and frames, not as streams. The system must orchestrate data transfers to optimize the use of bus, local RAM and SPU processing. Exception traffic must be kept separate, on the Power core. I believe you will see this way of thinking spread throughout the applications space."
It is a long journey-first, from CPU-and-bus architectures to data flow machines, then to algorithms and data structures that separate the control and data flows and organize the data, and ultimately to an underlying bus-oriented architecture for presenting a virtual data flow machine. SoCs aren't board-level computers on a chip anymore.



