As we move into the deep-submicron CMOS and system-on-a-chip (SOC) era, the
distribution of a high-speed clock becomes increasingly difficult. Large high-performance chips with several internal clock domains are already in the market place. The demand for such devices will undoubtedly continue, requiring smaller and smaller clock domains where typical wire delays exceed gate delays. Systems-on-a-chip exacerbate the situation as we integrate a collection of IP blocks, each of which may require a different clock rate and supply voltage to meet the necessary per-watt performance.
Self-timed
logic offers an alternative to this clocked nightmare. Such designs appear asynchronous to the outside observer, but are typically locally-synchronous. For instance, a self-timed pipeline takes clock domains to an extreme, with each stage in that pipeline sitting in its own synchronous zone. Unlike clock domains however, self-timed circuits don't need a global clock.
Referring to self-timed circuits as "asynchronous" is, however, less than ideal, in part because there is local synchrony, even if there
isn't a global clock, and in part because many electrical engineers treat asynchronous circuits with suspicion. To dispel these fears, consider some of the many successes of self-timed circuits. A number of research projects have successfully produced self-timed processors (for example: Amulet (UK), Titac (Japan), and a MIPS design at Cal Tech).Furthermore, Philips offers a self-timed 80C51 design with low-power and low-EMC properties that has been incorporated into two products: a multi-standard pager and
a contactless smartcard.
To further dispel fears, we will illustrate here how self-timed circuits can be prototyped. We will be using programmable logic devices (PLDs) as a test vehicle, partly for their convenience, but also because current PLDs present a hostile environment for the self-timed circuit designer. If it works on a PLD, then there shouldn't be a problem moving to a complimentary metal-oxide semiconductor (CMOS) implementation.
Data events and plug-and-play
When designing
clocked circuits, it's often useful to think about the system moving between states with military precision. Data moves in a regimented fashion to the beat of the clock similar to soldiers marching on parade. In a self-timed system, data movement is more fluidic when viewed as a whole. At a localized level, data transfer is cooperative between producer and consumer, thereby forming a communications channel.
Producing new data can be seen as an event. At a circuit level, we need to be able to indicate this
with one of two approaches: either have a separate control wire to signal new data, or encode this signal in the data. The latter can, for example, be achieved using dual-rail encoded data where two wires are used for every logical bit; 00 is used to indicate no data, 01 to indicate logical 0 and 10 to indicate logical 1. A data event is the transition from no data (00) to some data (01 or 10). To complete the sequence, data needs to be de-asserted (back to 01). Simply OR-ing these two wires together
reveals the data event as a positive edge.
Dual-rail-encoded data produces somewhat large circuits, so simple binary encoded data is often used together with one event wire to indicate when the data has changed. If the signaling event is independent of the binary encoded data, then a bounded time-constraint needs to be placed on data propagation so that the signaling event can be timed (via a delay element) to appear after new data has stabilized. This is often referred to as a bundling constraint and,
therefore, this method of data propagation is called bundled data.
We now know how to encode a data event so that it can be propagated forward. Now all we need to do is signal the producer when the consumer has latched the data. This can easily be achieved by sending an event back on one wire as an acknowledgement. In the case of dual-rail data, a forward data event (positive event) can be acknowledged by a positive edge and the removal of data (negative event) by a negative edge. This results in a 4-phase
communications protocol.
Bundled data can also be signaled using a 4-phase protocol. However, a 2-phase protocol is sufficient: an edge (positive or negative) is used to indicate new data and is acknowledged by one edge. In many respects the 2-phase protocol appears neater than 4-phase, but in practice tends to be slowerparticularly when using conventional data latches, which are sensitive to one-edge or level change.
A rigorously laid down self-timed communication protocol has many benefits.
Modules conforming to the interface standard can plug together to build larger systems. Performance variation between modules is taken care of by the interfacethe producer and consumer simply wait for each other.
Forks and joins
Single self-timed communication channels include a combination of data with request and acknowledge signals. From a control point-of-view, a data event is simply a forward-going request signal. Using this abstraction helps to keep things simple.
Any practical
system will have fan-out and fan-in of data and control signals. From a control point of view, fan-out is a mirror of fan-in. For example, a fanned-out request signal going to two places will receive two acknowledge signals that must be joined to fan them in (see Figure 1).
The C-element was designed by D.E. Muller in the 1950s at the University of Illinois. Despite its usefulness, textbooks rarely refer to the concept, and it's virtually never found in a standard-cell library. Fortunately,
implementing a C-element as a combinational circuit with feedback is straight forward, though a transistor-level implementation is more efficient. The gate-level version can easily be implemented on a PLD. Note, however, a prudent design check verifies that the feedback path is short.
Another form of fan-in is when a resource (for example, a multiplier) is shared between concurrent circuits. In a clocked environment concurrent circuits are still synchronized to the clock; it's trivial to choose who has access to
the resource without going metastable. However, in an asynchronous environment, like a D-latch based arbiter (see Figure 2a), two or more resource requests might arrive at any time,.
In practice, if the rise time of the request signals is reasonably fast and local-clock frequency isn't too high, then the chances of metastability propagating to the grant signals is slim. A much better alternative to the D-latch arbiter is the Seitz arbiter (see Figure 2b). This transistor-level design is based on an
RS flip-flop with a filter on the outputs, which prevents the grant signals from going metastable. Like the C-element, the Seitz arbiter is a very useful circuit, which has been with us for many years, but don't appear in must standard cell libraries or as an element on PLDs.
An arbitrating four-phase call module acts as a hardware-subroutine call, allowing access to a shared resource from concurrent circuits. An arbiter is required because the {\tt call} module can cope with simultaneous requests. A
request event is sent from a client to the subroutine, and after the subroutine acknowledges, the acknowledge is routed back to the appropriate client.
A case of simple FIFO
We can make a simple FIFO from transparent D-flip-flops with C-elements in the control path (see Figure 3; this is a corruption of Ivan Sutherland's Micropipelines structure Turing Award paper "Micropipelines", Communications of the ACM, 32(6), 1989).
The enable signal (E: high=transparent, low=latched) is routed to
each of the bank of D-flip-flops, and the delayed version (Ed) is output for use as a control signal. This is a common trick in self-timed circuits, but it's often difficult to force a place-and-route tool to control wiring order. This simple self-timed FIFO can store at most one word-per-two FIFO stages. More sophisticated control can obviously improve upon this.
The FIFO control structure can be wrapped into a ring in order to analyze the dynamic behavior. An event injected into such a ring will
rotate around the ring indefinitely. Delays on wires between C-elements won't affect functional correctness. We constructed a 64-element event ring made from variants of the C-elements with resets or presets.
The C-elements were arranged so that, at reset, two C-elements in opposite positions in the ring were reset, while the others were set. After reset, these two events spin around the ring opposite to each other. Since the events are traveling along the same path, one might expect them to remain opposite
each other indefinitely. However, this presents an unstable equilibrium and, after approximately 340 cycles of the ring (for one particular implementation), one pulse had caught up with the other. This appears to be due to charge and discharge times - when one event nears another event, it has a slightly faster path since the logic and wires ahead of it will still becharging or discharging due to the earlier event flipping state.
The effect is particularly weak when the events are far apart, but
becomes stronger as they become near. Eventually the digital-handshake logic kicks in to prevent the event from merging. This appears as a very strong force when the pulses are adjacent, but appears to have little effect once the pulses are slightly separated.
Testing arbiters to ensure that they resolve metastable conditions cleanly is problematic because the outputs can't be easily viewed without passing the signal through an output pad, which is likely to modify any metastable response. Instead, we see
whether the erroneous signals from arbiters resulted in side effects in receiving control circuits. Two arbitrating call-modules are coupled to an event ring so that events could be inserted or extracted (see figure 4).
Thus, sequences of inserts followed by extracts can be performed to test the circuit. Because the circuit is self-timed, the tester doesn't need to be fast. In fact, it's advantageous to allow periods between sequences of inserts and extracts to allow events to spin freely around the
ring, thereby ensuring that the self-timed circuit doesn't become phase-locked with the tester, which might otherwise avoid chances for metastability. Changes in environment temperature will also add randomness to the timing properties, which helps to stress the circuit.
We have performed this test procedure on a number of arbiter designs on 4000 series Xilinx chips. A Seitz arbiter without the filter (i.e. just an RS flip-flop) fails frequently due to poor metastability characteristics. The only design
that didn't produce an error after 4 billion insert-and-extract sequences was the D-latch arbiter design presented earlier (see figure 2a). This circuit does fail, however, if the slew rate of the request signals is long.
Three wishes
Self-timed circuits modules offer reliable plug-and-play interfaces due to their delay insensitive interfaces. Verification of timing at the module level is sufficient to ensure that system timing is correct, which isn't something that can be said of a
large-clocked system. Verification of delay-insensitive communication protocols just requires that signals are sent in the correct order. This is much easier than verifying bounded-timing properties.
Delay-insensitive interfaces mean that power consumption becomes data driven: no data, no state changesjust static power consumption. No clock, so no distribution net, and no clock gating is required. Thus, power management comes for free. Companies like Philips are already exploiting these properties in
products.
Designing self-timed circuits is still hard work. Tools are either not available (for instance, Philips' Tangram tools are currently only used in-house) or are research systems that need extension and refinement. And even at a low level, basic components like Seitz arbiters and MullerC-elements aren't provided in standard cell libraries or on PLDs.
If I had three wishes to make self-timed circuit research easier, they would be: 1) make Muller C-elements and Seitz arbiters as common as the
D-latch in standard cell libraries; 2) encourage ECAD tool vendors to support self-timed requirementsthe order in which a wire is routed to a number of nodes, for example; 3) increase funding for self-timed design and tool development.
The author would like to thank Peter Robinson, Steve Wilcox and George Taylor (University of Cambridge) for encouraging this work. Thanks are also due to Ivan Sutherland (Sun Microsystems) for suggesting that I look at C-element rings.
Simon Moore is a professor of electrical engineering at Cambridge University (Cambridge, England.) His research includes work on self-time circuits and C-element rings.
To voice an opinion on this or any other article in
Integrated System Design
, please e-mail your comments to mikem@isdmag.com.
Send electronic versions of press releases to
news@isdmag.com
For
more information about isdmag.com e-mail
webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000
Integrated System Design
Magazine