United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

Complex RISC collides with IA-64 parallelism

By Ron Wilson and Alexander Wolfe

SAN JOSE, Calif. -- Complex twists on traditional RISC-based superscalar CPUs vied with Intel's new, highly parallel IA-64 architecture at last week's Microprocessor Forum, here.

The designs prove that all engineers are facing the same grand challenge. While forthcoming 0.25- and 0.18-micron processes will let chip designers pack as many execution units as they want on a high-power die, the problem is keeping all those functional blocks busy.

On the RISC front, new processors from Hewlett-Packard, IBM Microelectronics and Sun Microelectronics offered potential solutions in the form of multiple integer-execution pipes, multiple floating-point pipes, and separate load/store and branch hardware, for totals of between six and eight execution units on a chip.

In contrast, Intel Corp. is forging a novel path toward explicit parallelism, relying on a new kind of cooperation between the hardware and the compiler in its IA-64 architecture. HP worked with Intel to define the IA-64 instruction set.

To date, the fundamental tool for keeping the execution units humming has been to maintain a pool of ready instructions in the CPU and to dispatch as many as possible on the next clock. The problem, according to architects, is that conventional programs usually don't make enough instructions ready at any one time for the CPU to dispatch more than a couple in an average cycle.

Memory latency can be a more expensive problem. Architects' first line of defense has been to build big, single-cycle L1 caches, in an attempt to minimize the number of L1 cache misses. But the hardware cost of this approach is growing. HP's PA-8500, for example, has 1.5 Mbytes of L1 cache on-chip. The company has described the 8500 as a cache SRAM with an ancillary CPU. Taking a different approach, the IBM Power3 design team included only 32 kbytes of instruction and 64 kbytes of data cache, but made each cache 128-way set associative.

Another way to reduce the miss rate is to cheat. Instruction sets in both Sun's Sparc V9 and IBM's Power3 include prefetch instructions that, in effect, tell the CPU to preload certain information into its L1 caches.

Trying to overcome branches can lead to even more complexity. In conventional RISC theory, only the instructions within the current basic block--that is, those between the previous branch and the next one--are available to place in the pool.

Another weapon is out-of-order execution. Instead of waiting for an instruction's turn to come up before considering it for the pool, the CPU can look all the way to the end of the block and see if there are any more instructions ready.

Further improvement depends on speculative execution. In essence, the CPU guesses which way the branch will work out. The latest chips, such as the Power3, take the technique one step further, speculating down both possible paths of the branch.

All of these techniques tend to improve the all-important average instructions per clock figure. But all of them can be defeated by careless programming or a sloppy compiler. And much worse, each of these techniques adds another layer of complexity--and hence verification uncertainty--to the design. Indeed, the conventional superscalar RISC engine is approaching the point where no one will be able to tell whether it is designed correctly.

Such considerations led Intel and HP to conclude that it was time for a clean sheet of paper, said HP's senior vice president of research and development, Joel Birnbaum.

The result, according to Intel and HP, is the IA-64, which offers a new kind of cooperation between hardware and compiler. The hardware will offer a large--indeed, potentially unlimited--number and variety of execution units. In exchange, the compiler will organize instructions into simultaneously executable blocks, and give the hardware important assistance in avoiding memory latency and skipping over branches.

"We're going to break the sequential-execution paradigm--the notion that every instruction depends on the previous instruction," said John Crawford, Intel's principal microprocessor architect. "At the hardware level, we're trying to create a machine that can take a large number of instructions and feed them to functional units on every clock cycle."

The IA-64 architecture relies on the dual bulwarks of predication and speculation. The former is intended to remove branches from code, while the latter masks the problem of memory latency.

Execution runs
In practice, predication removes branches from code by essentially executing both pre- and post-branch instructions at the same time. Then, the results from instructions that wouldn't have been executing during a real-world sequential run through the code are thrown out. Because this is difficult to determine in advance, it's done post-execution by performing a series of checks with the aid of sixty-four 1-bit predicate registers.

"In predication, the idea is that every instruction is augmented with a flag that says, 'execute the instruction if the flag is true,' " said Jerry Huck, HP's lead microprocessor architect. "The predicate can remove branches and allow segmented execution. It offers freedom to the compiler to schedule [software] so as to minimize the critical path through the code."

In theory, a potentially vast performance boost is possible if few unnecessary branches are executed. However, critics aren't so sure the number will be that small. Nevertheless, Huck insists that code typically contains "lots of complex branches that you can collapse away."

There is support in practice for the concept. The ARM architecture from Advanced RISC Machines Ltd. (Cambridge, England) has included a form of predication since its inception. In fact, according to ARM architect Guy Larri, compiler writers have learned to use the ARM conditional-execution facility very effectively.

"It was one of the reasons we decided to leave static-branch prediction out of the ARM9," Larri said. "We had branch prediction in the ARM8. But we found that compilers using conditional execution could simply eliminate many of the branches we were trying to predict."

Speculation, the second technique the IA-64 relies upon, masks memory latency by essentially yanking load instructions out of their normal place in the middle of a branch, and brings them forward "to be initiated as early as possible in the program flow," said Huck.

Though speculation doesn't change the actual latency involved in accessing memory, it masks the problem, since the accesses in question are performed well in advance of when they're actually needed. "We're trying to cover the problem of memory delays," said Huck. "Loads from memory are often the first instruction of a dependency chain, so covering that latency is a big problem."

However, this is no easy task; it requires complex compilation algorithms and, in IA-64, a new technique in which the memory load in question is effectively broken into two separate instructions. This enables the compiler to track what's going on and ensure that the memory access isn't performed so early that the contents will be outdated by the time they're used.

Indeed, Intel's heavy reliance on its two new techniques to untangle applications at compile time--and the belief that this will yield performance advantages in the software run-times of the real world--are the areas where Intel came in for some grilling at last week's Microprocessor Forum. Perhaps the biggest criticism is that IA-64 relies heavily on static analysis--a snapshot of the code prior to its execution on the CPU--and deals little with the dynamics that come into play when software actually begins running.

Nevertheless, Intel's Crawford assured the audience that "we can take something that's brutally sequential and shrink the critical path by a significant amount."

Register-rich
In terms of its implementation, IA-64's most notable characteristic is its massive complement of registers: 128 integer, 128 floating-point and 64 new "speculative" registers for the compiler to work with. This aids performance by obviating the need for the compiler to perform cumbersome register-renaming tasks.

Another hallmark of IA-64 takes a page from very-long-instruction-word (VLIW) architectures. A packed instruction word that's 128 bits wide incorporates three separate instructions and maps them to functional units in the target processor. The word also contains a template field, which specifies dependencies between the instructions and other packed words.

This template field also gives the architecture its vaunted scalability. That is, traditional VLIW designs have trouble maintaining compatibility between different members of the same CPU family. That's because a fixed-length instruction word is generally mapped to a fixed number of execution units, creating difficulties when an architecture attempts to bulk up or trim down.

In contrast, the template field in IA-64 enables support of more-powerful processors by making it possible to gang together the packed instruction words. Thus, a chip with six functional units would simply work from two packed instruction words each cycle.

"As we get to larger transistor budgets, we'll be able to build wider and wider machines," Crawford said.

But compilers that are orders-of-magnitude more complex than today's code generations will be required if IA-64 is to succeed, industry experts said. Still, Crawford claimed that Intel and HP have the software savvy to deliver. "We already have the compiler expertise," he said.

Some of Intel's competitors believe that it could be well beyond the planned 1999 introduction of Merced before IA-64 makes it into the mainstream. "Intel is trying to give the impression that the day the systems appear, there will be scads of [applications] software," said Aaron Bauch, technical marketing manager for Digital Semiconductor's Alpha microprocessor group. "But today there isn't even an [IA-64] compiler." For its part, Intel claimed to be already working hard to seed the development of 64-bit applications.

At last week's IA-64 debut, Intel also answered another question that observers had been asking, strongly suggesting that the Merced CPU will be compatible with existing, 32-bit X86 applications via hardware conversion. That would promise far better 32-bit performance than the alternative technique of software translation.

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe

 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
Engineers take a bad year in stride
According to the findings of the 2009 EE Times Global Salary & Opinion Survey, generally, engineers are satisfied with their career choices.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About