PALO ALTO, Calif. In a technical paper presented at the Hot Chips conference here Monday (Aug.19) researchers Ting Wu, Chi-Ying Tsui and Mounir Hamdi from Hong Kong University of Science and Technology (China) offered an alternative pipeline approach to crossbar design.
Their approach has yielded a 256-by-256 signal switch with a 2-GHz input bandwidth, simulated in a 0.25-micron, 5-metal process.
The growing importance of crossbar switch matrices, now used for on-chip interconnect as well as for switching fabric in routers, has led to increased study of the best ways to build these parts.
The obvious way to implement a crossbar switch fabric, according to presenter Tsui, is to simply route inputs horizontally and outputs vertically, and then to place a pass transistor at each intersection. Turning on the transistor connects an input line to an output line. The layout is intuitive, and provides easily for multicasting.
But even setting aside the unpleasant characteristics of pass transistors there are serious disadvantages to this approach. Setting up the switch requires n-squared control bits, and, more important for high-bandwidth interconnect, the performance of each connection is limited by the on-resistance of the pass transistor and the capacitance of both the input and output lines, all of which must be long enough to span the entire matrix in a fully populated switch.
For this reason most high-performance implementations are done not with pass gates but with multiplexers. Each output is driven by a wide multiplexer that selects one of the input lines.
The routing is variable and less obvious, but contains many shorter segments, reducing both input capacitance and potential crosstalk hot spots. The output capacitance is comparatively quite small.
But as signal rates approach the capabilities of the process, problems exist with the multiplexer architecture as well. There are still long wires on the input routes, which must span the whole array of multiplexers, and as the number of inputs and outputs grows, the wire delays increase. Multiplexer complexity increases rapidly with increased switch width as well.
The Hong Kong University researchers decided to take a novel approach to this problem by pipelining the multiplexers. Thus one 256-bit multiplexer is replaced by several cascaded narrower multiplexers separated by registers. The result is alleviation of the problems of multiplexer-based crossbar design, but in exchange for some new issues.
The basic element of the new design is a flip-flop with an embedded multiplexer. The flip-flop chosen was a semi-dynamic device attributed to Klass and Stojanovic, favored for its negative set-up time and small transistor overhead.
The designers chose to break the multiplexers into a cascade of a 2-to-1 static mux, a 4-to-1 flip-flop/mux, an 8-to-1 static mux and finally another 4-to-1 flip-flop/mux. This gave the design an effective two-stage pipelined 256-to-1 multiplexer.
But upon floorplanning and delay estimation it was found that the length of the input lines, reaching all the way across the row of first-level multiplexers, was still too great to achieve the necessary 1-GHz signal speed. So the researchers added another pipeline stage, this one simply for wire delay. In effect, they broke the crossbar in half with a vertical line of flip-flops. A matching set of flip-flops was placed on the output of the pipelines for the first 128 outputs, so that both the pipelines those for the left half of the array and those for the right half had the same number of stages. This technique broke the long input lines in two and produced an acceptable capacitance. It had been calculated that buffer insertion would not have yielded a sufficient bandwidth in this process.
With the addition of control and clocking circuitry, a 256-by-256 crossbar, each line handling 1 Gbit/s, was designed. The design was laid out, extracted and simulated, but has not been fabricated.
While the device meets its performance requirements in simulation, there are some disadvantages to the appraoch. Obviously, clock distribution becomes a critical issue. Further, because of all the synchronous stages, the device was estimated to consume 40W in operation.
But as an approach to the very real problems of layout and timing in larger crossbars, the design contributes some valuable new ideas to the toolkit.