MONTEREY, Calif. Researchers from leading universities and programmable-logic companies gathered at FPGA2002 this week to poke holes in the conventional wisdom that has guided field-programmable gate array design since its inception.
Up for reexamination were such issues as the configuration of lookup tables, power consumption and floating-point functionality. Some of the findings disclosed here could influence how vendors approach future architectures, and could shed new light on ways to attain better performance and control power dissipation.
Not even the most basic building blocks of FPGAs have escaped scrutiny. Should a lookup table, for example, have four inputs instead of three or five? There's plenty of evidence to back up the claim that the four-input LUT offers the best balance of chip area and speed. But what is suspect are the assumptions FPGA vendors have been making to arrive at their conclusion.
In fact, determining the best LUT size could vary significantly depending on a number of factors, the most noteworthy being the CAD tools used, according to a study conducted by the University of British Columbia. If the area efficiency of the LUT is the chief aim, a technology-mapper tool called Chortle will determine that a three-input LUT is best. But when using alternative tools like Flowmap or Cutmap, they will tell you a five-input LUT is the best choice. The differences are even greater when trying to determine delay, said Steven J.E. Wilton, assistant professor at the University of British Columbia.
"The experimental results can be very significantly affected by some of these assumptions," he said. He added that few of the FPGA architectures proposed thus far "really address how sensitive their results are to the experimental assumptions."
The study could force FPGA vendors to take a hard look at how they settle on an architecture. "I was fascinated by the results," said Stephen Trimberger, principal engineer and manager of advanced development for Xilinx Inc. "You read it and say that this is obvious, and other times you think, 'How could this be?' In the end you don't know which one is obvious."
Jonathan Rose, director of Altera Corp.'s Toronto Technology Center, called the study a "wake-up call" to those involved in FPGA research. "When we do our papers we don't do a sensitivity analysis. Computer architects have been getting better at this. We should too," said Rose, who also works part-time as a professor at the University of Toronto.
Power consumption is another area that draws the attention of FPGA makers. Few in the programmable-logic field, least of all the FPGA vendors, have felt an urgent need to address power issues, even though FPGAs consume much more power than the ASICs they are intended to replace. Not only do FPGAs have to deal with static power issues which get worse as voltages scale down they also must contend with the active wires, which account for the bulk of their power dissipation.
Top of the list
Moreover, FPGA vendors are trying to cast their chips as the main I/O hub in communications systems, but so far they are not discussing in any great detail how the rapid adoption of new I/O schemes will take its toll on power consumption.
Despite this reluctance to tackle the issue head-on, power consumption could soon be at the top of the to-do list in the near future, especially if programmable logic device (PLD) makers are serious about winning sockets in power-sensitive consumer electronics applications, said Xilinx's Trimberger.
Xilinx has already taken the first steps to raise the awareness of power issues by disclosing a study on the hot spots in its latest Virtex 2 architecture. In the paper, the company showed that 60 percent of the power consumption in the Virtex 2 family is from routing while logic and clocking account for 16 and 14 percent, respectively.
Additionally, Xilinx found that the cluster of LUTs, flip-flops and other circuitry that make up its configurable-logic blocks take up 5.9 microwatts per MHz for a typical design. But this is just for "typical" designs; actual power consumption within the configurable logic blocks (CLBs) can change wildly depending on the switching activity. This can occur frequently in synchronous circuits, where the inputs to the LUTs come in at different times during the same clock cycle. This "glitching" effect could contribute up to 70 percent of the power dissipation in a CMOS circuit, whether it's an ASIC or FPGA.
There are several ways for designers who use FPGAs to reduce power. For example, they can try to model their systems for the lowest amount of switching activity within certain performance guidelines, said Xilinx's Alireza Kaviani, who presented a paper here. Or they could exploit flip-flops for pipelining or retiming. Reductions in power consumption could also come from changes to the CAD tools or FPGA architecture, though Kaviani did not elaborate further.
Another way proposed by Amit Singh, a PhD student at the University of California, Santa Barbara, is to alleviate routing congestion. This can be done by absorbing small nets into clusters and spreading them out uniformly over the FPGA. This reduces the number of external nets that need to be routed, reducing power consumption as much as 13 percent.
While some worry that power could be the next showstopper, it's not stopping others from pushing the performance envelope. In 1997, Brian Von Herzen, president of Rapid Prototypes Inc. (Carson City, Nev.), showed that push-button FPGAs based on 0.6-micron design rules could be coaxed to run at 250 MHz by using low-level manual design tools.
Researchers at the University of Toronto have devised a manual tool based on the same "event horizon" methodology. The tool uses a push-button design flow to extract higher performance and works much faster than the native Xilinx floor planner it was compared against. It allows users to drag components to new destinations, determines slice packing, highlights violations, evaluates the "goodness" of component positions, reroutes nets and summarizes the delays. At its best, the EVE tool can improve maximum operating frequency as much as 19 percent, according to the University of Toronto's William Chow, now a software engineer at Altera.
"What it says is there's a certain amount of performance left on the table," said Xilinx's Trimberger.
At the same time, researchers are busy finding ways to incorporate new functionality that would have once been considered impractical for FPGAs. One of them is adding floating-point functionality, which is often considered anathema because of the huge amount of resources a standard floating-point unit can take up.
Some, however, are looking for ways around the problem by eschewing the standard IEEE floating-point unit in favor of something that's less bulky but can still do the job. At Ecole Polytechnique of Montreal, researcher Yvon Savaria has come up with a "configurable" floating-point format with a narrower bit width tuned to a specific algorithm, which uses floating point just to code the inverse of a division. It's so specialized that negative numbers are not even considered. Even so, it's suitable for applications like video processing, when the output image only has to have 8-bit accuracy to be considered studio quality. Moreover, it takes up fewer hardware resources: A multiplier using this technique needs just 34 slices which in this case is 17 Xilinx CLBs vs. 290 slices required for a 23 x 23 multiplier for a standard IEEE floating-point unit.
Engineers designing applications that need many samples and averages over a long period of time, such as radar spectrum analyzers, are asking for more floating-point precision, said Ray Andraka, president of Andraka Consulting Group Inc. (North Kingstown, R.I.). But they shouldn't expect to get there using standard IEEE floating-point units, he said.
"Floating point needs to do normalizing and de-normalizing, so it takes more resources," Andraka said. "A lot of times you can reduce the data path width."
Routing structures is another area that is subject to constant scrutiny. Altera, for one, showed that there is a limit to how many fast wires should be used in a PLD. For its Dali core, which was used for its Mercury transceiver devices, Altera chose a heterogeneous set of wires that are functionally identical but some are faster than others.
"You can remove 30 percent of the critical path delay if you speed up 20 percent of the global interconnect wires," said Altera's Michael Hutton. "After that it's diminishing returns."
Furthermore, once 20 percent of the wires are faster than the rest, the performance gains achieved by adding more fast wires starts to erode, he said.