datasheets.com EBN.com EDN.com EETimes.com Embedded.com PlanetAnalog.com TechOnline.com  
Events
UBM Tech
UBM Tech

Design Article

Comment


JMUNN

11/19/2012 1:09 PM EST

Personally, I believe in the many processor embedded in the RAM future. Scaled ...

More...



rhfish

2/15/2012 9:43 AM EST

Re: eDRAM - In our original 1989 work, Moore and I described making the CPU from ...

More...

Future of computing - Part 3: The ILP Wall and pipelines

Rich Pell

2/14/2012 10:17 PM EST

[Part 1 begins a look at the future of computing and, in particular, what happens when multicore processing "hits the Memory Wall." Part 2 turns its attention to the "Power Wall" - the increasing heat and power issues associated with increased performance.]

Instruction Level Parallelism means executing multiple instructions or pieces of instructions at the same time to make the computer run faster. Computers have hit the parallelism wall. This paper will analyze the problem, but first a little history.

Henry Ford pioneered assembly lines for automobile manufacturing.An assembly line consists of workstations that perform a single
production step before passing the partially completed automobile to the next workstation. All steps at all workstations are performed at the same time so many partially completed automobiles are assembled in parallel. The number of assembly steps determines the rate of production. The more steps in the line, the more cars are built in parallel, and the faster the finished vehicles come out the end.

At Ford's original factory, a Ford Model T could be assembled in 93 minutes from start to finish. There were 31 workstations on the line, and a completed car rolled off every three minutes.

In the early 1960s, Seymour Cray, Jim Thornton, and Dean Roush of CDC began thinking about performing multiple computer instructions at the same time in order to speed up execution.
The resulting product was the CDC-6600 containing 10 parallel arithmetic, logic, and floating point units. Due to this massive parallelism, it was as much as 10 times as fast as its fastest competitor. When introduced in 1964 it was considered by many to be the first supercomputer. Even though it was extremely complex and difficult to program, more than 100 were sold, many for the booming field of nuclear weapons research.

The 6600 executed multiple instructions at the same time.This might be compared to the early days of automobile assembly where factories consisted of many individuals who built a complete car. The 6600 did not yet have an assembly line.

Seymour Cray immediately began work on its successor, which would use a new type of parallelism. Cray recognized that most of the logic in the 6600 execution units was idle at any moment in time. He surmised that by dividing instructions into steps, he could organize his execution units such that one step of several instructions could be performed at the same time. The architecture appeared very similar to an automotive assembly line, and today we know it as an execution pipeline. The resulting product executed programs 10 times the speed of the 6600 and became the CDC 7600.

In 1972 Cray left CDC to form Cray Research.Its first product was one of the most successful supercomputers in history, the Cray-1, lovingly called "The Beast." The massively parallel beast featured 12 pipelined functional units and weighed 5.5 tons including Freon refrigeration. The first Cray-1 cost about $8 million and was installed at Los Alamos National Laboratory in 1976.

Seventeen years later Intel introduced the Pentium microprocessor featuring the P5 superscalar pipelined architecture.The 1993 Pentium approximated the performance of the Cray-1 and cost about $500. The benefits of parallelism had come to microprocessors.

The Good, the Bad, and the Ugly
The benefits of computer parallelism did not come for free.Parallel architectures compared to scalar architectures werelogically more complex, required more transistors, dissipated more heat, and were more difficult to program. The greater the parallelism, the greater the pain in each of these areas.

An execution pipeline breaks each instruction into a series of steps. The steps of many instructions are performed at the same time so many instructions are executed in parallel. The number of steps of the pipeline determines the number of instructions executing at the same time. The more instructions executing in parallel, the faster the programs run, just like on a physical assembly line ... at least in theory.

An automotive assembly line produces only one model. Some special purpose computer pipelines execute only a single instruction over and over.An example is a pipelined multiplier sometimes used in high speed digital signal processing. The multiplier operation might require 4 clock cycles to complete, but if 4 multipliers are connected like an assembly line, a new multiply can be started each clock. A portion of four multiplies can be performed at the same time, and a multiply product can be produced on each clock. The results of this pipelined multiplier will be delivered 4 times faster than a single multiplier.



General purpose computers on the other hand execute many different instructions. A pipeline can accelerate a general purpose computer program if the many different instructions can be crafted so they each have the same number of steps.

One of the byproducts of reduced instruction set computer (RISC) architectures was the regular structure of their instructions.A RISC machine need not be pipelined, but RISC instructions could be pipelined much easier than the variable length instructions common in complex instruction set computers (CISC).

A simple RISC pipeline might break its instructions into 5 steps.

Each step required one clock period.Since one step from five different instructions was executed during each clock period, one instruction completed each clock, unless there was a "problem."

In a perfect world, a CPU with a 5 stage pipeline would execute programs 5 times faster than a non-pipelined CPU. In the real world, not so much.

Two Steps Forward, One Step Back
Computer programs are sequential.That means that one instruction executes after another.Pipelines execute pieces of multiple instructions in parallel. It is therefore very easy for the pipeline to get the logical intent of the program out of sequence. A simple example is two consecutive operations that change the value of a register. The first instruction may execute the operation that changes the value of the register but the second instruction has fetched the register before the changed value has been written back.

There are several ways to resolve this synchronization problem. A simple instruction decoder can detect the possible problem and stall the pipeline logic long enough for the register value to be written back. The stall also slows down execution of the program.

A more sophisticated decoder can detect the problem and try to execute another instruction or instructions while waiting for the synchronization problem to be resolved. This is called "out of order execution."1 It adds significant complexity to an architecture, but can do useful work during the computer cycles that would be stalled otherwise.

A computer's architecture includes a fixed number of registers. These high-speed memory locations can be used to perform operations much faster than ordinary memory. Compilers optimize programs in order to perform operations using registers as often as possible. The primary disadvantage of registers is that they require instruction bits to select them. For example, five opcode bits can select 32 registers in an instruction. For architectures that specify two operands and a destination register, 15 bits of an instruction are consumed just to select registers. For this reason, very few CPU architectures include more than 32 registers.

If a register is in use by an instruction in the pipeline, that register is unavailable until the instruction has completed its modification of the register. Another instruction in the pipeline that uses the same register must wait until the register becomes available. Another technique to avoid stalling the CPU during this wait is called "register renaming."2 A CPU may only have 16 registers defined by its instruction set, but in reality there might be 64 unnamed physical registers present.For example, the instruction decoder might be able to determine that register R4 is used by two different sections of a program, but one section does not depend on the results of the other. The instruction decoder could then rename one of the unnamed physical registers creating an additional temporary R4 register. When the execution sequence was completed, the temporary R4 register would be "retired." Retired means the duplicate register would return to its unnamed state.

An even nastier problem occurs following a conditional branch; the deeper the pipeline, the nastier the problem.One Xeon version reportedly has a 31 deep pipeline.3 Each instruction takes 31 steps to complete. This means that the Xeon could possibly load 30 instructions into the pipeline, begin executing them, and have to stop. Upon fetching the 31st instruction a conditional branch that had been previously fetched could cause the program to jump to a different program area.

The situation is called "branch mis-prediction." The entire pipeline must be reloaded and all the steps done on those 30 instructions will be lost.Furthermore, the power required to execute those 30 instructions will be wasted as heat.The situation will be even more wasteful if a memory access was required by those lost instructions.

Dozens of strategies have been invented to minimize the problems caused by branches within pipelines. The simplest branch prediction strategy is to assume that a branch will always be taken. This can be effective since branches are commonly found at the bottom of software loops, and loops usually run many iterations before completing.

All branches are not found in loops.Also loops can be nested such that sometimes the inner branch will not be taken but the outer branch will be. Sophisticated statistical circuitry can correctly predict many common patterns. That prediction circuitry increases logical complexity, requires thousands of transistors, and increases chip power dissipation.

To further consume transistors and dissipate power, some architectures do "speculative execution."4 Such a CPU will fetch, decode, and execute instructions from multiple possible execution paths using multiple pipelines. When the state of the conditional branch is determined, the instructions from the unused execution path will be discarded and the correct pipeline selected. An aggressive speculative execution architecture can overcome inefficient branch prediction at the cost of lots of redundant circuitry.

Simultaneous Multithreading
A multi-issue deeply pipelined architecture trades off code acceleration for complexity and redundancy. Much of the redundant circuitry is idle during program execution.Sometimes this logic can be put to useful work with "Simultaneous Multithreading."5 Intel calls their version "Hyper-threading."6 Hyper-threading attempts to use the idle logic to run instructions from another program.For example, if one program is using the integer ALU and another one the floating point ALU, it might be possible to allocate steps of the pipeline to two or more independent programs.

Multithreading is implemented by creating multiple sets of registers and latches to hold the state of the multiple programs. The correct register set can be selected for the right pipeline resource by a very smart compiler. From the programmer’s perspective the multiple register sets appear to be multiple CPU cores. Unsophisticated computer buyers often view multithreading the same way. Multithreading sounds sort of like multi-core. Multi-core is good, so multithreading must be good too.


In reality, it is very difficult to find threads from multiple programs that can efficiently run on one execution unit. The improvement is very application dependent. Multithreading can actually slow down computer operation by causing additional cache misses requiring memory access.

According to Intel's marketing description, Hyper-threading can:7

  • Run demanding applications simultaneously while maintaining system responsiveness
  • Provide faster response times for Internet and e-commerce applications, enhancing customer experiences
  • Increase the number of transactions that can be processed simultaneously
  • Utilize existing 32-bit applications technologies while maintaining 64-bit future readiness
A more objective source of Hyper-threading benchmarks found that in some cases, "... the slowdown due to Hyper-Threading was less than five percent."8

A further analysis of the various problems and solutions regarding instruction level parallelism can be found in Ch 2 of John Hennessy and David Patterson's book.9

What Does It All Mean?
The implementation of extensive CPU parallelism combined with Denard Scaling and Moore's Law process enhancement delivered nearly half a century of unprecedented productivity growth directly tied to computer performance.

However that parallelism has reached an impenetrable wall of logical complexity and power consumption that renders further improvement incremental at best and may actually cause execution speeds to regress.

The modern microprocessor era kicked off with the penetrating 1980 RISC processor analysis by Patterson and Dave Ditzel.10 The insights of that study directly led to the SPARC CPU and indirectly to MIPS, the Pentium, and all of its descendants.

However the latest edition of Patterson's book11 is definitely pessimistic:

"...the exploitation of ILP (instruction level parallelism) - namely, speculation - is inherently inefficient."

"...we have reached - and, in some cases, possibly even surpassed - the point of diminishing returns in our attempt to exploit ILP."

"...lower CPIs (clocks per instruction) are likely to lead to lower ratios of performance per watt, simple due to overhead."

"The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate."

"...the widest-issue processor...is the Itanium 2, but it also has the slowest clock rate, despite the fact that is consumes the most power!"

"...modern microprocessors are primarily power limited."

To be properly and profoundly depressed regarding the state of computing, an engineer must read the recent works of Patterson, Kogge, Sterling, Dobberpuhl, Moore, Yelick, Colwell, or the several hundred or so lesser lights. These are some of the smartest individuals of the planet, and they essentially predict the end of computer improvement.

Patterson said it best in his recent interview with the New York Times regarding yet another dismal study. "I’m pretty sure it means we need to innovate, since we don’t want to die!"12

Another Path
The Venray Technology engineers, computer architects, and programmers have been confined by Patterson's Walls for our entire careers. We have spent the past 7 years finding a path to break out.


In language that rings true 30 years later Patterson & Ditzel concluded their seminal work with these words, "While the trend towards architectural complexity may be one path towards improved computers, this paper proposes another path."13

We at Venray Technology believe we may have found another path.

Related links:
The future of computers - Part 1: Multicore and the Memory Wall
Future of computers - Part 2: The Power Wall

About the author:
Russell Fish III's three-decade career dates from the birth of the microprocessor. One or more of his designs are licensed into most computers, cell phones, and video games manufactured today.

Russell and Chuck Moore created the Sh-Boom Processor which was included in the IEEE's "25 Microchips That Shook The World". He has a BSEE from Georgia Tech and an MSEE from Arizona State.

Footnotes:
1. http://en.wikipedia.org/wiki/Out-of-order_execution

2. http://en.wikipedia.org/wiki/Register_renaming

3. http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=1090&context=etd_projects

4. http://en.wikipedia.org/wiki/Speculative_execution

5. http://en.wikipedia.org/wiki/Simultaneous_multithreading

6. http://en.wikipedia.org/wiki/Hyper-threading

7. http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html

8. http://www.bit-tech.net/hardware/cpus/2009/09/08/intel-core-i5-and-i7-lynnfield-cpu-review/10

9.http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901

10. http://books.google.com/books?id=I7o8teBhz5wC&pg=PA135&dq="The+case+for+the+reduced+instruction+set+computer

11. http://www.amazon.com/Computer-Architecture-Quantitative-Approach-4th/dp/0123704901

12. http://www.nytimes.com/2011/08/01/science/01chips.html

13. http://books.google.com/books?id=I7o8teBhz5wC&pg=PA135&dq=%22The+case+for+the+reduced+instruction+set+computer




John.Bass

2/14/2012 4:57 PM EST

Once upon a time, my computers ALU was a single bit wide processing done with bit serial memory operations. An entire computer was measured in a few hundred bits of logic, and a few thousand bits of core memory registers, and a few million bits of drum storage.

When a bit in a flip flop is several tubes, with a short life, computers were pure KISS, but relatively large, hot, and power hungry per bit compared to today. It's been more than a few years since I've seen a Bryant drum spinning quietly. G15's, B200 series, and the like.

Today, a few trillion logic cells are easily available, and as the author notes, much of the logic in a modern microprocessor isn't actually doing the real work ... it's just meant to minimize pipe line stalls, provide caches, and do speculative prediction.

There is however another form of computing, where code is translated directly to logic, registers, and wiring. While the sequential semantics are honored, the actual parallelism is fully exploited. When there are a trillion gates available to construct registers, logic, and memory that actually perform work nearly every cycle.

We prototype with FPGA's ... but the same logic is easily reduced to ASIC's ... computing without the artificial boundaries imposed by instruction sets, ALU construction, ALU/memory pipelines, and branch prediction. Lumping all the data into a critical path memory/cache/register file is just an unnecessary serialization that logic computing leaves behind. Every memory object is an independent FF register, fully parallel, and not bottle necked as found in your classic CPU/Memory architectures.

A decade ago, the hardware engineers insisted that FPGA's, PLD's, and ASIC's were Verilog/VHDL only. Then SystemC became important for simulation. And then we find that people finally do get that you can write portable C code, that executes just fine without a CPU/Memory, and the other sequentially serial limitations of traditional computer architectures.

At first ... there was a scramble to bend C into an HDL with parallel extensions. That's not really necessary ... anyone that can write for a tiny embedded processor ... can also write C for logic execution.

As multithread programming becomes the norm, more an more programmers understand parallelism better. OpenMP is a nice start for a wonderful parallel C variant, that works just fine on common CPU architectures ... and can exploit logic execution environments naturally.

Sign in to Reply



HankWalker

2/14/2012 7:10 PM EST

ILP progress ended long ago. I listened to talks about embedding a processor in memory at Caltech in the late 1970s. What would be more useful than this history review is a detailed discussion of why a very simple processor tightly coupled to DRAM provides the right balance of computing, memory bandwidth, latency, energy efficiency amd cost compared to other approaches. Most obviously, compared to more complex processor in a logic process with eDRAM, in the style of the POWER7 (with L3 implemented in eDRAM). You would get much more processing performance although less memory. But that might make sense when coupled with off-chip memory. Another issue is that for most applications, many chips will be used, which presents the problem of the interprocessor communication system.

I would delete all the wacky statements about potential applications of the product. A great computer isn't going to bring about world peace. It didn't work with Colossus: The Forbin Project.

Sign in to Reply



rhfish

2/15/2012 9:43 AM EST

Re: eDRAM - In our original 1989 work, Moore and I described making the CPU from DRAM transistors. However the flurry of merged CPU/memory efforts in the mid to late 90s used embedded DRAM instead.

The primary advantage of adding eDRAM to an existing logic design is ease of implementation. Designers could use all the existing tools, logic libraries, and even existing architectures such as MIPS.

The primary disadvantage of eDRAM is stupendous cost. Logic process transistors cost 500x transistor made using DRAM process. Adding the steps to fabricate the eDRAM capacitors increases cost even more. (Confirm the DRAM cost for 1 billion transistors on www.dramexchange.com.) Furthermore eDRAMs are only about 1/4th the density and 1/2 the performance of similar true DRAMs.

However to design TOMI Aurora and TOMI Borealis we had to create our own logic library using DRAM process parameters. Autorouters were not happy with only 3 layers of metal in a DRAM, so our CPUs were laid out by hand.

If 1980 microprocessors cost $50k instead of $100 there would be no PC business, no cellphones, and no iPads. We thought that $2, fast, low-power, multicore chips optimized for Big Data might be interesting. Apparently so do a few others.

Russell Fish

Sign in to Reply



JMUNN

11/19/2012 1:09 PM EST

Personally, I believe in the many processor embedded in the RAM future. Scaled up to thousands and millions of little processors with their RAMs, there seems to be a light at the end of the tunnel of sequencial execution. This sort of massive parallelism with simple processing power seems to somewhat parallel the solutions nature has worked out with neural networks.

However, I also believe that it will take a change of the concepts of programming and also the goals of computing to move beyond the sequencial underpinnings and biases given by our computing history.

Neural network "learning" has not seemed to yield what we want but perhaps we can still crack the code of how to actually make good use of massive parallelism.

Best of luck in your endevour!

Sign in to Reply



Please sign in to post comment

Navigate to related information

Datasheets.com Parts Search

185 million searchable parts
(please enter a part number or hit search to begin)