Many product road maps show process migration years down the road as a fait accompli. A process shrink alone, however, will not enable developers to keep pace with Moore's Law. Running a longer pipeline 40 percent faster will not result in a 40 percent increase in performance unless changes are made to the internal architecture to handle data and processing more efficiently. For example, in anticipation of moving its DSPs to 1 GHz at the 90-nanometer node, Texas Instruments made some microarchitectural changes. Other than the resultant performance rise and ability to maintain future code compatibility, the changes are transparent to designers.
Since register file size grows in proportion to the square of the number of ports, implementing register files in a dual-data-path architecture results in approximately 50 percent less porting than for a single register file. Reducing the instruction fetch by one cycle opens a space in the pipeline for register forwarding and a pipelined move between the paths. Data functional-unit performance was matched within 5 percent along the critical speed path. Adding subword single-instruction multiple-data extensions to the 8-wide, very long-instruction-word instructions enabled more compact code and the efficient utilization of pipeline functional units.
Since overall performance is tied to off-chip latencies, we implemented a smaller memory cell, enabling an increase of on-chip memory. Other design factors will include moving from aluminum to copper at 130 nm, along with the use of 300-mm wafers, optical proximity correction and lead-free packaging.
To reach 1 GHz from 720 MHz required further refinements in the critical speed path and memory pipelines as well as improved clock skewing at the circuit level. One way to capture those optimizations is by varying the application process flows. For example, the low-power process flow trades performance for power efficiency, defining adjustments to transistor gate length, threshold voltage, gate oxide thickness and bias conditions. Integrated devices can also be tailored to specific applications. TI, for example, has integrated Viterbi and turbo coding coprocessors parallel to the execution pipeline to compound the efficiencies of a faster clock by offloading forward-error-correction processing.
Replacing the 3.6-k FSG dielectric used in the previous generation with a new, 2.9-k dielectric material, OSG, at the interconnect level reduces capacitance and propagation delays within interconnect layers while increasing drive current. Low-k materials boost overall chip operating frequency and allow metal lines to be packed closer together on a chip with less risk of electrical-signal leakage.
To push the parameters of 90-nm processes closer to their full potential performance, designers will want to consider including a 37-nm gate length using nickel-silicide metal gates and a strained-silicon approach with ultrashallow source and drain junctions to drive performance in both NMOS and PMOS transistors. The shorter gate length yields higher performance. Nickel silicide lowers gate resistance; strain induced on the transistor channel increases electron mobility. TI's approach to strain engineering improves on existing graded silicon germanium techniques, which tend to introduce defects that dampen yields.
While the rest of the industry won't see these enhancements until 65 nm, they are a portent of performance improvements to come: At 90 nm, transistors with such enhancements exhibit 50 percent better performance than equivalent transistors without them.
Integrating conventional analog radio, though possible, is challenging, given the high frequencies and stringent performance requirements involved. With volumes of several hundred million handsets a year, RF device integration must yield practical as well as technological results. BiCMOS multichip devices, for example, require expensive test methodologies, and RF yield limitations directly constrain digital baseband die yield. RF integration undertaken in SiGe is not prudent, since SiGe lags CMOS and will not keep the system logic at the lowest possible cost.
Even monolithic integration in CMOS fails to meet the mark. Scaling analog to lower voltage levels is difficult, and device models of new processes are generally inadequate for the highly accurate parametric modeling required for RF design. Clearly, a process shrink alone is not enough. Innovation in architecture is also needed.
TI's fast MOSFET has a physical gate length of 52 nm.
The combination of 90 nm and related clock speeds allows for more signal processing in the digital domain. Sampled data-processing techniques can condition signals in the course of frequency translation, resulting in receivers that require no off-chip intermediate filtering stages and are suitable for integration into advanced CMOS. Large blocks of logic can be clocked at up to 20 GHz. At these frequencies, it is possible to oversample cellular radio signals by more than 10x, enabling analog processing to take place in the digital domain. In this way, the benefits of CMOS scaling can be realized.
Also, systems-on-chip integrated with digital RF radios can self-calibrate their analog circuits to reduce the effect of parametric variations on yield while at the same time reducing the test cost for radio functions. Adapting digital CMOS for processing RF signals further reduces system cost, size and power consumption, and it affords designers the flexibility to mix and match digital, analog, RF and memory as needed. Texas Instruments expects to offer a single-chip implementation for GSM cell phones in 2004.
Ray Simar is chief architect at Texas Instruments Inc. (Dallas) and a TI fellow.