According to conversations in the hallways of the CPU world, multithreaded architectures are about the best candidate to be the next big thing. The recent announcement of a thread-oriented processor by TeraGen Corp. and an IBM paper at Hot Chips next month on a multi-threaded PowerPC suggest that multithreading has at last arrived.
The idea is simple. Suppose you have a superscalar processor. Lots of execution units, all sitting there waiting to be fed by a dispatch unit that scrambles like crazy to find instructions ready to execute. More often then not (OK, most of the time, unless we are running benchmarks) there are not enough instructions in the ready queue to keep the execution units busy.
But while the dispatch unit is frantically running out of instructions that it can dispatch from the current program thread, there are probably lots of other threads sitting in the i-cache, waiting for their turn.
Wouldn't it be nice if the dispatch unit could draw instructions from all the current program threads-or maybe all the current threads in the system-instead of from just the one active thread? Then the odds of keeping all the execution pipelines filled should be a lot better.
There are complications, of course.
If you are going to blend instruction streams from different threads, you have to have different instances of the CPU state-the general registers, special registers, caches, MMU and so forth-all available, so that each instruction finds the register file, caches, memory map, status registers, etc. that are appropriate for its thread. Also, you have to deal in some way with issues like a group of registers that is shared among several tasks.
These things take careful thought and lavish use of extra hardware-loads of general registers so you can keep a register file image for each thread, for example.
But extra hardware doesn't represent much of an expense these days, if you can keep the interconnect local and straightforward.
The big challenge would appear to be finding the right trade-off between increased parallelism-bringing higher throughput-and added complexity-forcing lower clock frequency. Complexity in peripheral areas, like cache controllers and MMUs, can be particularly insidious. But as we move toward 0.1-micron processes, the trend seems to be toward complexity, even at the expense of frequency.