Any embedded programmer with even a passing familiarity with the PC business might have reason to despair. Computers were once a rich, thriving, vital, and exciting industry ripe with alternatives and possibilities, advances and innovations. Now it seems the computer industry has degenerated into a repackaging exercise for Intel (or AMD, or Cyrix) processors. One by one, Clipper, MIPS, PA-RISC, Alpha, and other once-mighty processors have faded from the scene, replaced by the ubiquitous
Pentium II and its descendants. Don’t microprocessors matter anymore?
You bet they do. They just don’t matter much in the systems we call computers. But they do make all the difference in the world to the other 99.9% of microprocessor applications we lump under the generic heading of “embedded.” Far from collapsing, the number of embedded microprocessors is growing, with more on the way. Microprocessor architecture and instruction-set design are alive and thriving. And as the embedded
horizon expands, we veer further and further from the path once treaded by the computer designers before us. Embedded microprocessor architecture is decidedly different from the architecture that led to the rise — and fall — of so many computer companies.
Chip makers are always designing new instruction sets. Partly, it’s an ego thing.
engineers design new microprocessors. But the underlying truth is that embedded systems have different demands than mainstream computers. Embedded
systems need different microprocessors. One size most assuredly does not fit all.
Embedded vs. computers
There’s no denying that the mainstream computer world is collapsing around a single architecture: the x86. That’s because all computers do pretty much the same thing, and the only important criterion is performance. All the chip makers have pretty much figured out how to get performance, and it’s not controlled by CPU architecture, instruction sets, or RISC vs. CISC. It’s
controlled by semiconductor fabrication processes, and that’s the same for every company with the money to stay in the game. PowerPC, SPARC, MIPS, Alpha, Pentium — their speed is all controlled by the same semiconductor physics, and it’s nearly impossible to differentiate one chip from the other in any meaningful way.
Ah, but for embedded processors, it’s different. Embedded programmers and engineers do not value one-dimensional benchmark performance above all else. Embedded programmers
value power consumption, interrupt latency, media-processing ability, cost, code density, development support, and more. A microprocessor’s instruction-set architecture (ISA) affects all of these factors. That, and a healthy dose of optimism-stoked greed, is fueling new microprocessor design.
Every concept is up for grabs
If you were to develop your own microprocessor instruction set, where would you start? You could try eking the best performance out of the most commonly used
instructions — integer math and Boolean operations. But I suspect you wouldn’t get very far. Instruction sets don’t have much of an effect on the performance of basic integer code, as you’d find in a spreadsheet, user interface, or some simpler embedded applications. The implementation of that instruction set matters a great deal, but the ISA itself is hard to improve upon. Older chips (68020, 80386, Z80) are slower because they’re implemented inefficiently (by today’s standards), not
because they’re somehow permanently crippled.
Then there’s always the hoary old RISC vs. CISC debate. RISC is not inherently better than CISC, at least not for computer systems. Intel and AMD proved that. The best Pentium II chip is as fast as — or even faster than — the best SPARC, PowerPC, MIPS, or PA-RISC chip. (The last remaining exception is Alpha, which Intel now builds for Compaq.) If RISC really made a difference, Intel would be sucking exhaust fumes right now instead of taking over
the computer market.
(And please, when you feel compelled to rise to the righteous defense of your favorite processor and set me straight about how PowerPC, or SPARC, or the Z80 is “really” superior to the x86 and how the truth is being covered up by an insidious industry conspiracy led by inept journalists in Intel’s pocket, be sure to address your flames to /dev/null.)
Instruction-level parallelism (ILP) is another interesting, sexy concept that is also going nowhere on the desktop.
Most application software just doesn’t have much parallelism for a CPU to exploit. Today’s microprocessors wring all the ILP there is out of normal applications.
ILP becomes its own worst enemy. Even if you could find two, four, or 50 instructions to execute simultaneously, you probably couldn’t access all the data at once. Data starvation becomes a real problem. For high-end systems, bandwidth is more important than ILP. The processor in your Macintosh, SPARCstation, or PC spends more
cycles waiting on memory than it does processing instructions. (Depressing, huh?) Improving your “CPU efficiency” will simply increase the number of wait states.
Power and density
Instruction sets can affect power consumption. One of Motorola’s design goals during the development of the M-Core was that these 32-bit chips run at full speed over a 16-bit data bus. Why? So that the chips wouldn’t have to wiggle 32 data lines every time they needed to fetch a new instruction. Fewer
data lines means fewer electrical transitions, less power dissipated, and less electromagnetic radiation. It also means 16 fewer pins on a smaller, less expensive package — a nice side effect that I’m sure was not lost on M-Core’s designers.
It’s not just pins Motorola was trying to save with M-Core. Smaller instructions (usually) mean better code density, which means less memory required in the system. Some of Motorola’s newest processors have more memory transistors than logic
transistors. That is, the RAM and ROM take up more silicon than the CPU itself. Since all silicon costs the same, it behooves Motorola—and every chip maker — to improve code density in order to cut the cost of their new chips.
You can’t fake code density. It’s an inherent characteristic of the instruction set and, unless you use postprocessed code compression like IBM’s CodePack (See “Let’s Get Small,” January 1999, p. 9.), you can’t get away from an ISA’s
natural code density.
Code density has never been an issue with mainstream computers. After all, what do you care how much disk space the latest version of Unix requires? But in the embedded realm, where applications sometimes have to ship in a limited ROM, code density takes on vital importance. Embedded chip makers know this and design their instruction sets accordingly. Performance comes second; code density often comes first.
Isn’t that special?
So if integer math, logical
operations, instruction-level parallelism, and even RISC are all dead ends, what’s left to innovate? Quite a lot, it turns out. But it helps to know your market and your applications. New embedded microprocessors jockey for your attention largely on the basis of their unusual, special-purpose instructions.
Hitachi’s newest generation of SH-4 chips (of which the SH7750 is the first example) includes a stunningly impressive matrix-transformation instruction. With a single instruction, the SH7750 can
multiply a 4x4-element matrix with a four-element array, to produce a new four-element array. In one clock cycle. At 200MHz. And all of the numbers are single-precision IEEE floating-point values. I don’t think there’s another computer in the world, regardless of price, that can perform that function so quickly. It slings 288 bits of data around inside the chip every cycle (eight 32-bit elements, plus a 32-bit sum). It’s no wonder that Sega chose the SH7750 as the processor for its Dreamcast video
Sure, it’s a specialized instruction. After all, how often do you need to calculate the angle of refraction between a light source and a planar surface, or other 3D geometry transforms? (If you’re Sonic the Hedgehog, the answer is, a lot.) But that single instruction makes the SH7750 better suited to media-processing and 3D-graphics applications than competing chips that might run as fast but that don’t have the cool matrix-transform instruction.
Another example is the
multiply-accumulate (MAC) instruction. MACs are the lifeblood of most DSP chips, but a weird new concept to conventional microprocessors. Digital signal processing has become so important to embedded designers that most new processors now include one or more MAC instructions so they can make like a DSP. Whole new chip families have been created around this concept. Perennial startup company Hyperstone has a processor family that blurs the line between CPU and DSP. Chip giants like Motorola, Texas Instruments,
IBM, and others are also tinkering with the idea.
Mighty Siemens created TriCore to mix DSP capabilities with microcontroller practicality. TriCore has the usual assortment of Boolean bit-manipulation instructions, like AND, OR, and XOR, but TriCore can “accumulate” the single-bit result with other Boolean instructions. You can write a complex sequence of ANDs, ORs, XORs, NANDs, XNORs, and so on, and the final result will reflect the logical AND of all these functions. This construct maps well
to common multipart comparisons. It allows programmers (and compilers) to efficiently encode complex relational tests. It also helps you avoid peppering your code with a lot of short conditional branches that bloat code size and cause pipeline bubbles in the processor.
But special instructions can be misleading, too. The 680x0 has a wealth of wonderful features, such as a register-indirect, indexed, postincremented addressing mode that’s useful for array pointers. Wonderful stuff, really, and you
feel good about yourself when you finally find a use for it. It looks terrifically elegant in your assembly source code. Problem is, this convoluted addressing mode is no more efficient and no faster than doing the pointer arithmetic yourself and then using a simple address register as a pointer. It may look great, but it doesn’t buy you any cycles at run time.
The 386 processor has a BTS (bit test and set) instruction that’s actually slower than a standard OR operation. Likewise, the BTR (bit
test and reset) and BTC (bit test and complement) instructions are slower than the AND and XOR functions, respectively. In a similar vein, you’re better off using a simple DEC (decrement) instruction followed by a JNZ (jump if not zero) instead of the 386’s LOOP instruction. The two-instruction combo is faster, and it has the advantage that you can use any register or memory location as your loop counter, not just the ECX register. On the plus side, the 386’s LEA (load effective address)
instruction is a very quick but little-used way to perform a multiply.
Don’t like any of these chips? Then you can even design your own instruction set. ARC Cores sells synthesizable RTL models of a 32-bit microprocessor that you can tweak, adjust, modify, and extend to your heart’s content. Don’t like the way rotate-left-with-carry works? Change it. Always wanted a pixel dot-product function? Create it. Your own secret-recipe algorithm can become its own instruction (or two). The ARC design is a
real playground for the microprocessor do-it-yourselfer.
Newcomer TeraGen goes even a step further. Its eight-bit microprocessors have no native instruction set to speak of. Instead, the entire chip is configurable. Processor, peripherals, buses—everything. The first TeraGen chip emulates the venerable 8051, complete with UARTs and timers. Inside, the chip contains a number of very simple “microthread engines” that run very fast, at about 200MHz. Collectively, these microthread engines
are fast enough to emulate an entire microcontroller. TeraGen says future chips will be configured to emulate other eight-, 16-, and even 32-bit microprocessors, all using the same technique.
Not dead yet
This has been just a sampling. There are lots of other examples of interesting and unusual instruction sets, each with its own advantages and loyal following of programmers who wouldn’t use anything else. Some time in the future we’ll look at media-processing instructions, which are
becoming very valuable. Why else would Intel, Motorola, MIPS Technologies, Sun, AMD, and Digital develop MMX, AltiVec, MDMX, VIS, 3DNow!, and VMI, respectively?
Microprocessor design is far from dead. Instruction sets are not all the same. Your choice of microprocessor really does matter. Don’t give in. Resist the trend to standardize on the desktop leader. Support your local microprocessor designer.
Jim Turley is the senior editor of