MIPS Technologies Inc. will wield a new competitive weapon in the embedded-processor wars this week when it introduces hardware multithreading as an optional extension to its 32- and 64-bit architectures.
The MIPS announcement will cover architectural definition only--no actual CPU cores using the technology are likely until next year. But by bringing a leading-edge concept from the server world to system-on-chip design, MIPS has suddenly made multithreading a talking point in cores for embedded applications.
The company will describe the multithread application-specific extension (MT-ASE), as it is known in MIPS parlance, at the Microprocessor Forum, which begins today in San Jose. MT-ASE is conceptually similar to the hardware multithreading support offered by IBM Corp.'s Power-5 architecture or Intel Corp.'s Pentium Hyperthreading. Additions to the architecture permit the CPU to maintain several separate program threads in cache at the same time, and to switch from one thread to another within at most a few clock cycles.
When an instruction in one thread causes a stall--for a cache miss, a branch prefetch error, or an external bus cycle, for instance--the hardware can switch quickly to another thread without the overhead of a context switch. Cycles--often many cycles--that would have been wasted in the stall are used to execute instructions from other threads.
In the server world, multithreading is seen as a way to hide memory latency. With CPU clock frequencies in multiple gigahertz, the time it takes to fill a cache line after a miss can amount to hundreds of clocks. By switching threads, the CPU can use the waiting time effectively; in effect, cache misses become completely nonblocking, and the average time lost from cache misses approaches zero.
In embedded systems, with much lower clock frequencies, the cache miss penalty is relatively much lower. But it is still worth eliminating if hardware thread management can actually fit useful instructions into the few cycles where the previously active thread was stalled waiting for a cache fill. There are other latencies as well, notably from slow peripheral devices.
In the SoC world there is another key reason to want hardware multithreading, said Tom Petersen, director of product marketing at MIPS, Mountain View, Calif. Often, today's SoCs are designed with a DSP core or other elaborate co-processor sitting on the bus right next to the CPU core. This additional core is not there because of a task the CPU can't do. It is there because it is nearly impossible for a single-thread CPU--even running a real-time operating system--to guarantee the hard real-time deadlines required by signal processing code.
Using multithreading, Petersen said, a designer can dedicate one thread to signal processing and guarantee that this task will receive a minimum percentage of CPU cycles. While this is not the same as guaranteeing a completion deadline, and it is still not entirely deterministic (the guarantee is of a portion of the overall cycles, not of any particular cycles), it is sufficient in many systems to meet real-time requirements, Petersen maintained.
The MIPS model gives the application developer two approaches to multithreading. The first, called the Virtual Processing Engine model, is essentially heavy threading. In the VPE model, each thread appears to be running on its own CPU as if it were in a symmetric-multiprocessing (SMP) system. Threads are launched and terminated by an SMP operating system, and communicate with each other through conventional memory-based mechanisms.
The second approach can use much lighter-weight threads and is potentially more efficient, Petersen said. This scheme exposes the thread-support mechanism to the application level. New MIPS instructions permit an application to fork a new thread, to yield the CPU to another thread, and to terminate a thread. As in the VPE model, each thread has its own register-set copy, and switching is handled by the hardware scheduler. Hardware registers determine the details of thread priority. But it is up to the operating software to keep track of threads, which share their context with their parent process, and to provide virtual thread support if the number of requested threads exceeds the number a particular CPU supports.
MIPS expects a range of its silicon partners to adopt the multithreading architecture, and is likely to release an intellectual property core itself using MT-ASE. Applications will range from increasing the raw throughput of CPUs in high-end networking applications to eliminating DSP cores and coprocessors from low-end consumer SoCs such as those in set-top boxes, MIPS believes.
The approach has both supporters and critics in the industry. Some embedded-processor suppliers regard multithreading as future technology or as altogether inapplicable to the SoC world. In particular, those with compact cores suggested that by 90- or even 130nm design rules, the difference in die area and cost between a single core with multithreading hardware support and multiple cores for on-chip SMP would be small compared with the entire SoC, and that most designers would find the SMP approach simpler. In addition, keeping separate threads on physically separate processors could offer increased opportunities for power management, it was suggested.
One fan of the idea, however, if not of the competition, is Ubicom Inc. The Mountain View company has a proprietary processor with hardware support for threading similar in concept to what MIPS will describe. Ubicom has multiple register sets, a hardware scheduler, and hardware allocation tables to drive the scheduler.
Ubicom chief technology officer David Fotland dismissed the latency-hiding aspect of multithreading, saying it's an issue that's critical only in the server space. "You don't have really huge memory latencies in the SoC world," he said. "And if you did see a potential problem with memory latency, any SoC designer these days would put a small RAM on the chip to deal with it."
But Fotland said that multithreading has been highly successful in the purpose for which Ubicom employs it: to create a real-time context in which code can be run to emulate peripheral devices.
Fotland said that a recent Ubicom chip uses the CPU to execute the majority of the functions of Ethernet media-access controllers, reducing the dedicated MAC hardware to "a thin layer just inside the I/O pins."