Consumers may not know much about the
technology that they buy, but they know about clocks. The difference between a Pentium II and a Pentium III is probably only understood by a small number of processor architects around the world, and the most enthusiastic nerds, but the difference between a 300 MHz Pentium II and a 500 MHz Pentium III is obviousthe latter is faster! It may not be exactly 66 percent faster, but it's definitely faster. How do we know this? We know that the 300 MHz and 500 MHz refer to some sort of "clock" signal that defines
how fast things happen inside the processor chip and since this is where the real action is in the system, more must be better.
Digital designers may have a rather more refined view of the role of the clock in their work. To many designers, it's the clock that makes design possible. Digital designs are constructed by placing clocked registers down and putting combinatorial logic between them, then reworking the logic (by hand or, more often, by synthesis) until the static timing analyzer says it fits
within the target clock cycle time. Then you're doneand the chip can go to the fab.
Simple is as simple does
It's certainly true that designing with a clock keeps life simple and has enabled great strides to be made in the complexity of designs. Great strides have been necessary to keep up with the pace of IC technology development. However, clocks aren't essential to digital circuits. Since the earliest days of computers, we've known that using a clock is only one way to design; there are
other ways to get the job done that have advantages and disadvantages as well. Some early computers used clocks, some didn't. Most designers in the industry today, however, have forgotten that alternatives to the clock exist.
Why should we worry about the use of the clock? After all, clocked design has taken us this far and we understand it pretty well. Why throw all of that experience out the window and learn a new way to do design? Most of our design training attempted to teach us how to minimize
complex asynchronous clock behavior, so why should we reconsider now?
The answer: while clocked circuits have progressed a long way, they are beginning to show signs of strain. Some of the cracks in the strategy include the following.
- Noise - A clock synchronizes activity across the chip and locks it to a very precise frequency. From the perspective of electrical noise, this is absolutely the worst possible start. All the current spikes add up and the AC component is maximized. All the energy
is concentrated in narrow spectral bands at harmonics of the clock frequency.
|
Figure 1 - Riding the rails
|
|
|
The AND circuit propagates a dual-rail data value (but doesn't propagate a spacer correctly). The function of the circuit is Y=A AND B,
where Y, A, and B are encoded in a dual-rail notation. If both A and B have valid data, so will Y. The problem with spacer propagation is that Y will go to 00 as soon as either input becomes 00, whereas it should wait for both A and B to to 00.
|
- Power - The clock net is the biggest net on the chip and low-clock skew requires big drive transistors. On some chips, the clock net alone is responsible for over 30 percent of the total power consumption. But that's not the end of it.
The clock connects everywhere and causes all sorts of unnecessary activity and unnecessary power dissipation. The problem is getting worse, too. Although new process technologies decrease the power-per-function, the functionality-per-chip is increasing faster and the power-per-chip keeps rising.
- Design reuse - The way to get a bit more performance in the next design is to up the clock. This requires every part of the chip to be redesigned, even those parts that needn't go any faster. Moving the design
to a new process technology requires all the timing verification to be rerun, and so on.
Throw out the clock
The obvious way to avoid these problems is to throw out the clock. Self-timed systems offer the following immediate advantages.
- Noise - Activity in different parts of the chip is uncorrelated, so there is some cancellation in the AC components of current transients. Activity isn't locked to a highly tuned oscillator, so energy isn't concentrated into narrow spectral
peaks. In practice, the advantage is a reduction of at least 10 dB in the peak noise power levels for a large chip and the advantages increase as chips get more complex.
- Power - Self-timed designs are data-driven and only use power when there is useful work to do.
- Design reuse - A self-timed component can be reused without reference to any global clock, and the only timing verification necessary helps to ensure that minimum performance criteria are met. Increasing the performance of a
self-timed system only requires the redesign of the critical function; the rest of the design needn't be touched.
The upshot is that self-timed logic has a lot to offer the designer of the system-on-a-chip (SOC) of the future. Unfortunately, there are drawbacks to self-timed design, as well. CAD tools offer limited support, and most designers don't know how to think about self-timed design and consider it "alien." Neither problem is insuperable, but both will take some time to address.
Paradigm's progress
As with any technology paradigm shift, change will start slowly and offer high-risk/high-gain benefits to early adopters. This is where we stand at this moment. Academic and industrial research groups have shown what is possible; few established companies have dipped a toe into the self-timed water, while a couple of start-up companies have jumped headfirst into the deep end of the pool.
We present here only the briefest of introductions to self-timed design along with a couple of
examples of industrial-scale applications. There are some pointers for further reading at the end of the article, including some web site addresses where a great deal more information is available.
So, how do we design complex digital systems without clocks? The clock signal in a conventional digital circuit fulfills at least three roles: to define when data is transferred from one place to another; to define when data processing should commence; and to define when the processed results should be
available.
These functions can be closely related. In general, it makes sense to start processing data as soon as it has been transferred to the inputs of a function and to transfer results out as soon as they are available. In fact, a clocked circuit performs poorly on the last of these actions, as the clock frequency must be slow enough to ensure that the slowest unit can complete within a clock period under worst-case conditions. Most of the time, most units have their outputs ready well before the outputs
are transferred, and performance may be sacrificed as a result.
A self-timed design still needs to know when data is available for processing, a situation indicated by a signal that accompanies the data, or even by way of information embedded in the data itself. To show how this can be done, let's look at dual-rail encoding as used in many self-timed design styles, including null convention logic (NCL) from Theseus Logic, Inc. (Orlando, FL).
Working on the railroad
In dual-rail logic each
signal value is represented by two wires, each carrying a logic level: 00 indicates that the data hasn't arrived yet; 01 indicates that a logic zero has arrived; 10 indicates that a logic one has arrived. The fourth possible value on the two wires, 11, isn't used and never occurs in a correct design.
If a function unit is awaiting the arrival of a 32-bit input value encoded in this way, it will have 32 wire-pair inputs. It knows the value has arrived when one wire in each pair has gone high. The wires
could be of arbitrary length and have gone through any number of buffers. However, none of this matters. Provided that the signals are clean, the receiver will just wait until they are all ready, however long it takes.
So, communicating with dual-rail codes is straightforward and very robust. In fact, the communication is delay-insensitive in the sense that its functionality is unaffected by any wire or buffer delays that are inserted. In the context of SOC design, this means that place-and-route tools
can't produce a circuit that doesn't work at the logic level.
What about processing dual-rail signals? It's entirely possible to build any logic function in dual-rail logic. A general way to produce a 2-input logic function is to decode all four possible input values and then to merge the results to give the appropriate outputs (see Figure 1). Although this circuit will compute a dual-rail AND function in a delay-insensitive way, it will only do it once. To ensure that several consecutive values can be
sent through a logic circuit without any danger of symbol interference, they must be separated by spacers of some sort. The null in NCL is a spacer. The 00 value can be used as a spacer and the logic circuit must propagate these in a delay-insensitive way, just as it does the zero and one logic values.
A circuit that does this correctly is shown in the second circuit (see Figure 2). The only change from the first circuit is that the familiar AND gates have been replaced by perhaps less familiar C gates.
The functionality of the C-gate (often called the Muller C gate, taking the name from its originator 40 years ago) derives from the fact that the output rises when both of the inputs are 1 (just like the AND gate) and falls when both of its inputs are 0. When the inputs differ, it retains the previous state. The falling output condition enables the circuit in Figure 2 to be delay-insensitive with respect to spacers and data values.
Alternative lifestyles
Clearly, these examples are large
compared with standard Boolean logic, and the wiring cost of the circuits used in each case. In addition to dual-rail logic, there are other ways to build delay-insensitive self-timed circuits. For example, one-hot codes extend the dual-rail 1-of-2 encoding to a 1-of-M encoding. This may be useful for decoder-style logic functions.
|
Figure 2 - In a proper manner
|
|
|
This AND circuit propagates both the data value and the 00 spacer in a delay-insensitive manner. The c-gate waits until both inputs are low before propagating a low, so Y only becomes 00 after both A and B are 00.
|
Alternatively, N-of-M codes generalize the above systems to allow a fixed number of transitions in a group of wires. For example, a 3-of-6 code can carry 20
symbolsequivalent to four bits of data and some control symbolson six wires. This reduces the wiring cost at the expense of increased encoding complexity and may be useful for inter-chip communication. Additionally, other delay-insensitive codes are possible too; the only requirement is that the receiver must be able to unambiguously recognize when a data value has arrived.
Finally, transition encoding sends a data value as a change of logic level on a wire, and it's a non-return-to-zero
code. It can be applied to any N-of-M wire-level encoding. Spacers are unnecessary, but the processing logic is significantly more complex.
All of these delay-insensitive codes incur costs in terms of additional wiring and logical complexity compared with standard Boolean logic. This complexity may represent a small price to pay for a design that is guaranteed to function on any process technology using any place-and-route tool. However, when the overhead becomes too great (in building large memories, for
example), it may prove advantageous to weaken the requirement for delay insensitivity.
Bundled-data self-timed logic uses conventional binary-encoding, but accompanies each n-bit data value with a separate data validity signal, often called a request wire. Requests may be level or transition encoded.
The data can be processed by conventional Boolean logic, and the output request must be delayed from the input request by an amount no less than the logic delay. This delay matching is clearly not
insensitive to delays; but in most cases the single-sided timing constraint it imposes can be satisfied locally, and at the next level of hierarchy, the assembly of bundled-data components can be constructed in a delay-insensitive manner. Bundled-data circuits are generally smaller than delay-insensitive circuits but they are more difficult to reuse, as the matched delay timing assumptions must berevalidated for each use. Cogency Technology, Inc. (Toronto, Canada) has developed tools that help produce
bundled-data designs.
However data validity is communicated, there remains a need for a flow-control mechanism to indicate when the next spacer or data value can be issued. In order to regulate the flow, all these systems need an acknowledge signal passing in the opposite direction to the flow of data. With this signal in place, it's easy to build processing pipelines (see Figure 3). The pipeline will proceed at a rate determined by input and output data rates and by the speed of the internal processing
stages, all of which may vary from one cycle to the next.
As with clocked technology, there is more to system design than a simple pipeline. The principles outlined here, however, may be extended to support complex system functions as demonstrated by the large-scale industrial and academic self-timed designs produced over the last decade.
Large-scale logistics
There have been several large-scale demonstrators of self-timed design produced by university groups and industry over the last decade.
Here we will look at just two of these, selected because they were designed for commercial production.
|
Figure 3 - Already in the pipeline
|
|
|
Here a "bundled-data" self-timing scheme is used, where conventional data-processing logic is used along
with a separate request (Req) line to indicate data validity. Requests must be delayed by at least the logic delay to insure that they still indicate data validity at the receiving register. An acknowledge (ack) signal provides flow control, so the receiving register can tell the transmitting register when to begin sending the next data.
|
The Philips pager chip was developed to replace previous clocked-pager devices. The motivation for using a self-timed design was noise. The previous
clocked designs, based around an 80C51 microcontroller, could not receive radio packets while the microcontroller was running due to the radio noise produced by the digital circuits. The data reception protocol was therefore implemented in hardware and a different chip was required for each protocol used. (Different geographical regions operate different pager protocols.)
The self-timed chip was based around a self-timed 80C51, which generates only slight interference and can be left running during
packet reception. The reception protocol can therefore be implemented in software and the same chip can serve all the world markets. The pager chip used a bundled-data design style with return-to-zero request and acknowledge signaling.
The Philips chip was developed using Tangram, a synthesis tool for self-timed logic that has been developed at Philips Research Laboratories since 1985. It's the most complete tool-set for self-timed circuits available today, but at present is only available if you work for
Philips.
Hybrid and happy
The Draco (DECT Radio Communications Controller) chip demonstrates the potential of hybrid designpart clocked, part self-timed. Again, the motivation for the use of self-timed design was electrical noise; it was believed that a self-timed processing subsystem would offer better performance on the DECT radio channels. Again, a bundled-data design style was adopted, with return-to-zero request and acknowledge signaling.
The self-timed processing subsystem
comprises a 100 MIPS ARM-compatible Amulet processor core, 8 Kbytes of local RAM, a self-timed on-chip bus, a DMA controller, 16 Kbytes of ROM, and an external memory interface with production test support. The clocked telecommunications peripheral subsystem comprises an extensive set of interfaces, including DECT, ISDN, IrDA, I2C, parallel I/O, and a further 8 Kbytes of buffer RAM, a fully integrated DECT and ISDN controller system with an additional DECT frame buffer, an 8 Kbyte RAM, and analog interfaces to
a radio module and to ISDN line transformers. There are standard serial interfaces, a dedicated high-speed bus that supports the addition of special-purpose off-chip telecommunications processors, a PCM (pulse-code modulation) serial interface bus controller, a 4-channel full-duplex ADPCM (adaptive differential pulse-code modulation) transcoder DSP, and a telecommunications codec. The system includesa central clock module that generates all of the clocks required by the system and incorporates two
phase-locked loops and a power management system (see Figure 4).
It could be argued that the clocked subsystem would annul most of the advantages of the self-timed subsystem, but this isn't the case. The self-timed system includes all of the highest speed circuitry and nearly all of the heavily loaded output drivers (the external memory interface). The clocked system includes only lower-speed circuits and limited output drivers. Hence, the system components that are critical to the noise performance are all
within the self-timed domain, and most of the benefits will accrue.
|
Figure 4 - The Dracodie
|
|
|
The top half of the chip contains the synchronous telecommunications peripherals. The bottom half contains the self-timed processing subsystem including
the Amulet3 processor core, its local memory, a DMA controller, and a 32-bit self-timed on-chip macrocell bus.
|
A stitch in time
Clocks have served the electronics design industry very well for a long time, but there are significant difficulties looming for clocked design in the future. These difficulties are most obvious in complex SOC development, where electrical noise, power, and design costs threaten to render the potential of future process technologies inaccessible to
clocked design.
Self-timed design offers an alternative paradigm that addresses these problem areas, but until now VLSI designers have largely ignored it. Things are beginning to change, however, and self-timed design is poised to emerge as a viable alternative to clocked design. The drawbacks, which are the lack of design tools and designers capable of handling self-timed design, are beginning to be addressed, and a few companies (including a couple of start-ups, Theseus Logic, Inc., and Cogency
Technology, Inc.) have made significant commitments to the technology.
Although full-scale commercial demonstrations of the value of self-timed design are still few in number, the examples available demonstrate that there are no "show stoppers" to threaten the ultimate viability for this strategy. Self-timed technology is poised to make an impact, and there are significant rewards on offer to those brave enough to take the lead in its exploitation.
Further information
The asynchronous logic
home pages, with tutorial material and links to many of the world's major self-timed design and tools groups, are at: www.cs.man.ac.uk/async/. The Amulet group web pages contain details on the Amulet series of processors, and many technical papers are downloadable from: www.cs.man.ac.uk/amulet/. For an overview of the state-of-the-art in self-timed design, the February 1999 issue of Proceedings of the IEEE contains papers by many of the leading research groups in the area.
Steve Furber is ICL professor of computer engineering in the department of computer science at the University of Manchester, England. While at Acorn ComputersLimited during the 1980s, he was a principal designer of the ARM 32-bit RISC microprocessor. At Manchester he heads up the Amulet group who are researching self-timed ARM cores. He is a Fellow of the Royal Academy of Engineering and the British Computer Society.
To voice an opinion on this or any other article in
Integrated System Design
,
please e-mail your comments to mikem@isdmag.com.
Send electronic versions of press releases to
news@isdmag.com
For more information about isdmag.com e-mail
webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000
Integrated System Design
Magazine