The unveiling of the Nintendo GameCube recently gave the industry an interesting look at where chip-level architecture may be going in the next few years. From the traditional point of view, the GameCube silicon looks entirely conventional: a these-days-modest 405-MHz PowerPC CPU on one chip, an ATI-supplied graphics engine on another. But what catches the eye on second glance is the way the Nintendo designers appear to have dealt with system performance not by throwing processing power at it, but by the elegant use of memory.
Like most modern system-on-chip designs, the GameCube includes both a large 24-Mbyte main memory and a variety of specialized memories distributed around the system. But the choices of memory type and location suggest the designers were much more concerned with latency than with raw bandwidth.
One indication is the use of large blocks of on-chip memory, provided by NEC's embedded-DRAM process. While embedded DRAM is still a questionable approach to cost reduction, it is well established as an approach to latency control.
Another indication is the use of MoSys DRAM, or 1-T SRAM, as the company prefers to call it, not just for local buffers but for the main memory. The MoSys scheme for organizing DRAM gives a combination of DRAM cell size and SRAM-like average latency, once again directly addressing the latency issue. Finally, there is the extensive use of caching, not only for the PowerPC but also for textures. Consistently, Nintendo engineers seem to have used memory type, speed and location to reduce the storage latency for computing and rendering tasks.
The message here is that the center of gravity in design-for-performance is shifting away from the processors. It is too grand a generalization, but in effect computing speed is a solved problem at the moment. What is lagging now and hence, what is limiting system performance is relative storage latency: the delay involved in laying your hands on an object relative to the time required to apply the appropriate methods to it.
Organizational tricks, such as nonblocking requests, ingenious task and data scheduling and so forth, can deal with underlying memory latency issues. But Nintendo's point seems to be that to make a system accessible to mere human programmers, the latency issues should be worked out in hardware, not left to the applications designer. The answer today seems to be on-chip, low-latency specialized memory blocks, not proprietary high-bandwidth external buses. That may be a crucial lesson for the next few years.