SAN JOSE, Calif. ARM Ltd. announced details of its next generation core, the ARM10 Thumb, at the Embedded Processor Forum on Thursday (Oct. 15).
The ARM10 has been beefed up from previous ARM cores with instruction set enhancements, an optional floating-point coprocessor and 64-bit on-chip data paths. It will be a better-than 400-Mips 32-bit processor aimed at driving a range of applications, from third-generation mobile cellular terminals through cable-modems to consumer information appliances.
While shooting for high performance, ARM (Cambridge, England) has not abandoned its established strategy of keeping power consumption low, and is predicting power consumption at 600mW for the processor with caches at full performance. Power requirements will be proportionately lower if an application needs less performance and the clock frequency can be reduced.
But ARM's semiconductor partners will have to wait to get a chance to start making ASICs based on the ARM10. First silicon is not due until the middle of 1999, with tape-out of the design not expected until the second quarter of 1999.
Dave Jaggar, director of ARM's design center in Austin, Texas, provided details of the ARM10 architecture, saying that next-generation systems will be characterized by sophisticated user interfaces that are rich with graphics, voice control and synthesis, and digital video. Such systems will also be networked, either wired or wirelessly, with high-bandwidth connections. These assumptions are the background against which his design team was working, Jaggar said.
"The move to 0.25- and 0.18-micron, and 2.5- and 1.8-V operation is the opportunity and how to stay in the sweet spot," said Jaggar. "To keep the area and power down, we avoided the complexity and cost of a full super-scalar machine. We still achieved our performance objectives by exploiting unique features of the ARM architecture to achieve a high degree of internal parallelism from a single-issue machine."
Even though the ARM10's transistor budget of 250,000 is more than double the budget for the comparable ARM9, the die area of the device is still small compared with mainstream microprocessors.
The cached version of the ARM10, the ARM1020, will be ready to run Windows CE and most other mainstream computer and real-time operating systems. With the addition of a floating-point unit, it will do real-time MPEG-2 decoding and 3-D graphics rendering.
When the core becomes available for licensing to ARM's semiconductor partners, it will be the highest performing part in the ARM family of cores, eclipsing the performance of the StrongARM SA-1, according to ARM.
The design invokes performance-enhancing extensions to the ARM instruction set while maintaining backwards compatibility with previous generations of ARM cores.
The core is designed to deliver 400 Dhrystone 2.1 Mips at 300-MHz clock frequency, and features the optional VFP10 vector floating-point unit capable of delivering 600-Mflops. The addition of separate 32-kbyte on-chip instruction and data caches, a memory management unit and bus interface forms the ARM1020T cached processor core.
To help keep the pipelines busy, ARM has opted for 64-bit wide datapaths to connect the caches and the coprocessor interface to the integer core. These allow two instructions to be passed into the instruction prefetch unit every cycle, and allows load and store multiple instructions to transfer two 32-bit registers every cycle.
ARM has also enhanced the ARM Modular Bus Architecture (AMBA) by taking the multi-master section of the two-section bus from a 32-bit to a 64-bit width and by adding support for split transactions. With a half-speed, 150-MHz clock, AMBA can now provide 1-Gbyte/second bandwidth. The single-master peripheral bus side of AMBA remains as it was at 16- or 32-bit width, thus preserving compatibility with previously designed peripheral components.
Jaggar quoted the die size for the ARM1020T at around 50 square mm in 0.25-micron process technology, with a power consumption of 600 mW at 1.5-V operation and of 1 W at 2.0 V.
Jaggar said a 300-MHz clock frequency would be obtainable for the ARM10 at these voltages, allowing a performance-power efficiency in excess of 650-Mips/W.
The new instruction set, denoted ARM version 5T, includes the so-called "Thumb" capability which allows the use of a 16-bit instruction format.
Thumb versions of ARM processor cores, which use 16-bit wide external memory banks and allow reduced system complexity and power consumption, have proved the most popular versions of the ARM architecture for use in such products as mobile phones.
New instructions within version 5T include CLZ [count leading zeros] for speeding up normalizing operations and integer divide and instructions to support cross-calling of routines between sections of code written using either the Thumb or the full ARM instruction set. "We're now in the position of getting good product and customer feedback," said Jaggar. "For example, we're seeing people writing code very purposefully to make use of Thumb and ARM instruction sets." A third area of enhancement is additional support of debugging in software, enabling debugging of individual tasks in a multitasking environment.
Although the ARM10 integer core issues single instructions per cycle, multiple function units such as the ALU, multiply, branch, load-store and coprocessors, can work on separate instructions in parallel when certain instructions take greater than a single cycle to execute. The use of optimizing compilers allows the best use of resources in these circumstances. ARM10 also includes a 32 x 16 multiply-accumulate array to provide fixed-point DSP support.
One of the most significant enhancements to the ARM10 architecture over previous versions, according to Jaggar, lay not in the core but in the cache architecture.
"The data cache supports non-blocking hit-under-miss operation," he said. "That means we can keep executing other instructions when we have a cache miss. It's totally borrowed from mainframe design, but this is the first time you'll see it in a chip drawing less than 1 W."
Describing the performance benefit of this approach, Jaggar said: "For Dhrystone [benchmark] it makes a big fat zero, because the Dhrystone benchmark stays in the cache and never misses. But for something like Windows CE using certain applications you could expect to see a two-times speed up."
The VFP10 executes single- and double-precision floating-point math to the IEEE 754-1985 standard, but uses a relatively simple multiply-add pipeline, with divide, remainder and square root implemented as iterative processes. This kept the gate count of the floating-point unit down, Jaggar said. In the case of exceptions, software is used to either look up cases or to execute algorithms on the ARM10.
The VFP10 uses a register bank consisting of 32 single-precision values or 16 double-precision values for its operations. Entries can be used as a vector of data, which allows a single instruction to operate on multiple data values. "It will operate sequentially and take a number of cycles," said Jaggar. "But having started off the process with a single instruction the VFP can be busy while other instructions are processed in the ARM10."
But "the VFP is what's really new," said Jim Turley, senior editor with Microprocessor Report, and an organizer of the Embedded Processor Forum. "It's clever from an engineering point of view, and useful from a commercial point of view."
Turley said the ARM10 could be applied in both portable battery-operated applications, and in "tethered" digital consumer applications that require higher performance and allow higher power consumption. "You can dial in the clock speed and the power consumption, and that leaves the older cores for lower cost applications," he said.
ARM is "finding it harder and harder to preserve the charms of the ARM architecture," Turley said. "They're having to use the same techniques of other processor developers [to improve performance], but not super-scalarity, which does add complexity."
Although ARM has tried to keep the ARM10 and ARM1020 simple, Jaggar said designing for deep-submicron technology was having its effects. His design team is considering issues of clock skew and different timings for data moving from different parts of the cache. As a result, ARM will offer the ARM1020T as the preferred design for licensing to its semiconductor partners. Many of these issues had already been worked out in that hard design, he said.