San Mateo, Calif. -- MIPS Technologies Inc. will wield a new competitive weapon in the embedded-processor wars this week, when it introduces hardware multithreading as an optional extension to its 32-bit and 64-bit architectures.
The MIPS announcement will cover architectural definition only-no actual CPU cores using the technology are likely until next year. But by bringing a leading-edge concept from the server world to system-on-chip design, MIPS has suddenly made multithreading a talking point in cores for embedded applications.
The company will describe the multithread application-specific extension (MT-ASE), as it is known in MIPS parlance, at the Microprocessor Forum, which begins today in San Jose, Calif. MT-ASE is conceptually similar to the hardware multithreading support offered by IBM Corp.'s Power-5 architecture or Intel Corp.'s Pentium Hyperthreading. Additions to the architecture permit the CPU to maintain several separate program threads in cache at the same time and to switch from one thread to another within at most a few clock cycles. A hardware scheduler determines which thread will be accessed for each instruction issue slot.
Tags associated with each instruction as it flows through the machine indicate the thread from which the instruction was fetched. The CPU has multiple sets of general registers and control registers, one set assigned to each thread, and the tag tells the hardware which registers to use with each instruction. When an instruction in one thread causes a stall-for a cache miss, a branch prefetch error or an external bus cycle, for instance-the hardware can switch quickly to another thread without the overhead of a context switch. Cycles-often many cycles-that would have been wasted in the stall are used to execute instructions from other threads.
In the server world, multithreading is seen as a way to hide memory latency. With CPU clock frequencies in multiple gigahertz, the time it takes to fill a cache line after a miss can amount to hundreds of clocks. By switching threads, the CPU can use the waiting time effectively; in effect, cache misses become completely nonblocking, and the average time lost from cache misses approaches zero.
In embedded systems, with much lower clock frequencies, the cache miss penalty is relatively much lower. But it is still worth eliminating if hardware thread management can actually fit useful instructions into the few cycles where the previously active thread was stalled waiting for a cache fill. There are other latencies as well, notably from slow peripheral devices.
But in the system-on-chip (SoC) world, there is another key reason to want hardware multithreading, said Tom Petersen, director of product marketing at MIPS (Mountain View, Calif.). Often, today's SoCs are designed with a DSP core or other elaborate coprocessor sitting on the bus right next to the CPU core. This additional core is not there because of a task the CPU can't do-particularly with the instruction-set extensions and coprocessor interfaces available on virtually all embedded-processor cores today. It is there because it is nearly impossible for a single-thread CPU-even running a real-time operating system-to guarantee the hard real-time deadlines required with signal-processing code.
Using multithreading, Petersen said, a designer can dedicate one thread to signal processing and guarantee that this task will receive a minimum percentage of CPU cycles. While this is not the same as guaranteeing a completion deadline-and it is still not entirely deterministic (the guarantee is of a portion of the overall cycles, not of any particular cycles)-it is sufficient in many systems to meet real-time requirements, Petersen maintained.
This allows designers to move both control-plane and data-plane processing onto a single CPU without having to worry that OS overhead, external interrupts, cache misses on application code and the like will affect real-time performance. The critical task runs on its own virtual CPU, appearing to the application code as if it had a dedicated CPU in a symmetric-multiprocessing (SMP) system.
The MIPS model gives the application developer two approaches to multithreading. The first, called the Virtual Processing Engine model, is essentially heavy threading. In the VPE model, each thread appears to be running on its own CPU in an SMP system. Threads are launched and terminated by an SMP operating system, and communicate with each other through conventional memory-based mechanisms. The only difference between the VPE model and shared-memory SMP is that all the threads are in fact running on the same CPU, with the hardware thread manager controlling which thread is running when.
The second approach can use much lighter-weight threads and is potentially more efficient, Petersen said. This scheme exposes the thread-support mechanism to the application level. New MIPS instructions permit an application to fork a new thread, to yield the CPU to another thread and to terminate a thread. As in the VPE model, each thread has its own register-set copy, and switching is handled by the hardware scheduler. Hardware registers determine the details of thread priority. But it is up to the operating software to keep track of threads, which share their context with their parent process, and to provide virtual thread support if the number of requested threads exceeds the number a particular CPU supports.
MIPS expects a range of its silicon partners to adopt the multithreading architecture, and is likely to release an intellectual-property core itself using MT-ASE. Applications will range from increasing the raw throughput of CPUs in high-end networking applications-in effect eliminating the need to migrate to on-chip SMP-to eliminating DSP cores and coprocessors from low-end consumer SoCs such as those in set-top boxes, MIPS believes.
The approach has both supporters and critics. One fan of the idea, if not of the competition, is Ubicom Inc. The Mountain View company has a proprietary processor with hardware support for threading very similar in concept to what MIPS will describe. Ubicom has multiple register sets, a hardware scheduler and hardware allocation tables to drive the scheduler.
Ubicom CTO David Fotland dismissed the latency-hiding aspect of multithreading as an issue that's critical only in the server space. "You don't have really huge memory latencies in the SoC world," he said. "And if you did see a potential problem with memory latency, any SoC designer these days would put a small RAM on the chip to deal with it."
But Fotland said that multithreading has been highly successful in the purpose for which Ubicom uses it: to create a real-time context in which code can be run to emulate peripheral devices. Fotland said that a recent Ubicom chip uses the CPU to execute the majority of the functions of Ethernet media-access controllers, reducing the dedicated MAC hardware to "a thin layer just inside the I/O pins."
The Ubicom model interleaves instructions from different threads to blend the real-time peripheral tasks with applications running under an OS such as Linux. From a programmer's point of view, should an application developer want to use the threading model, the threads look like tasks for an SMP system.
Other embedded-processor suppliers regard multithreading as futuristic or as altogether inapplicable to the SoC world. In particular, those with compact cores suggested that by 90-nanometer or even 130-nm design rules, the difference in die area and cost between a single core with multithreading hardware support and multiple cores for on-chip SMP would be small compared with the entire SoC, and that most designers would find the SMP approach simpler. In addition, keeping separate threads on physically separate processors could offer increased opportunities for power management, it was suggested.
One issue is memory bandwidth management. Both caches and synchronous DRAMs are highly sensitive to the pattern of accesses: Caches can thrash if most accesses to them are not within a fairly narrow set of address ranges, and SDRAM throughput can drop substantially if they are not accessed in a pattern that minimizes precharge and page miss delays.
MIPS' Petersen acknowledged that multithreading could compound the problems of memory optimization if it was not thought through. Certainly, he said, it's possible for competing threads to pollute a cache, for example. But he pointed out that in some cases-notably, when one thread is working on control-plane tasks while another is processing streaming data-there would likely be no interference.
By giving designers a way to migrate from multiple large hardware cores to a single CPU core with hardware multithreading, MIPS, like Ubicom, is offering a potentially important savings in die area. The impact on energy consumption is less clear, since in general, instruction-based engines are less energy-efficient than hardwired machines at a given task.
The impact on software developers is not entirely clear. While both Ubicom and MIPS have attempted to make their application-level software models similar to models for SMP machines, differences remain, particularly with MIPS' decision to expose the details of thread management in one model.