AUSTIN, Texas -- With an $11 million grant from the Defense Advanced Research Projects Agency and collaborative support from IBM Corp.'s Austin Research Lab, a team of computer architects at the University of Texas here plans to develop prototypes of an adaptive, grid-like processor that exploits instruction-level parallelism.
The group developing Trips, or the Tera-op Reliable Intelligently Adaptive Processing System, expects to have operational prototypes ready by the end of 2005 and is looking for commercial partners willing to bring the technology to market.
The prototypes will include four Trips processors, each containing 16 execution units laid out in a 4 x 4 grid. By the end of the decade, when 32-nanometer process technology is available, the goal is to have tens of processing units on a single die, delivering more than 1 trillion operations per second.
"One key question is, Will this novel architecture perform well on a variety of commercial applications?" said Jeff Burns, a project leader on the Trips program at IBM Research. IBM, he said, "is mapping commercial applications to this polymorphic architecture. We want to see if it will do what they [the team leaders] think it will do."
IBM also is applying its simulation, workload-evaluation and tuning tools. Burns said the company will develop targeted design tools built to keep power under control and will put them to work devising a power grid that turns off the power supply whenever possible.
Trips is the brainchild of Doug Burger and Steve Keckler, who came to UT-Austin five years ago with freshly minted PhDs and a shared belief that increasing wire delays would limit the ability to scale the performance of conventional processor architectures. They argued that moving data back and forth from a deeply pipelined processing unit to a large cache wasn't going to work well when wiring delays " for example, at the 32-nm technology node " would limit single-clock-cycle access to only about 1 percent of the die area.
The pair set out to develop a grid processor architecture and proposed it to Darpa's polymorphous computer architectures program. Polymorphism implies that hardware should adjust to the different applications and different workloads running on it.
Over the past three years, the two assistant professors have been joined by dozens of collaborators at the university and at IBM's research lab here.
Trips is a general-purpose architecture that adapts to various types of applications through the use of different levels of parallelism at the instruction, thread and data levels. It offers a great deal more instruction-level parallelism than conventional designs, so that even single-thread applications can run faster, said Chuck Moore, a senior research fellow at the university, working with Keckler and Burger, who earlier was chief engineer at IBM's Power4 processor design team here.
"Conventional machines are made to run faster as they become more deeply pipelined. The Trips machine has lots of instructions running in parallel," Moore said.
Software is getting harder to write and is far more costly. And for software engineers, conventional hardware is becoming "harder to optimize to," Moore said. "Our thinking is that the hardware really ought to adapt to the software running on it, so Trips morphs to the characteristics of the software running on it."
The architects also took advantage of regularity and structural reuse, using an execution-unit component many times. Since each component, about 1 square millimeter in size, is nearby its neighbors, the approach minimizes the long clock delays on conventional designs, which Moore said will become increasingly problematic over the rest of this decade.
"The regularity of this design means that each of these tiles is designed to be accessed within a single clock cycle. Only if you leave this local region does it takes more than one clock cycle," Moore explained.
UT's Keckler said the approach relies on the compiler to choose which pool of tiles an instruction gets placed on. That operation communicates with another nearby, so that if an operation needs the result of a second operation, the compiler will place them on nearby tiles. "We are not limited to one deep, skinny pipeline," he said.
The execution units and instructions form a "tree of dependencies. We get a chained-execution-units effect, so that now the critical path of those instructions executes in data order," Keckler said.
On Trips, a traditional program is compiled so that the program breaks down into hyperblocks. The machine loads the blocks so that they go down trees of interconnected execution units. As one instruction is executed, the next one is loaded, and so on.
"The program is compiled into hyperblocks [of instructions], loads them onto the grid of execution units, then loads the next one," Burger said. "Older machines used to do that with a single instruction; we're doing it with large blocks of instructions."
Each execution unit has a limited amount of instruction and data cache, memory that is similar to an extended buffer or register, he said.