Systems with multiprocessing capabilities typically come in fixed sizes, with usually two, four, eight or 16 processors and so on. This staircase approach to multiprocessing works well in situations where compute needs can be clearly quantified and remain relatively static for long periods. Employed primarily in larger central database applications, these systems provide impressive compute throughput, processing enormous quantities of data.
However, when compute needs increase, the only option is to go out and buy the next larger multiprocessing system and rewrite the application to take advantage of the additional compute resources. Consequently, these fixed-size processor systems force companies to stay within their established compute envelope or else upgrade to the next-size system.
This limited paradigm for multiprocessing is much too restrictive for Internet-centric computing. The developments of the past few years have dramatically demonstrated that Web-based companies can grow exponentially in a matter of months or even weeks. This trend will only accelerate as financial transactions become more secure. E-commerce should explode in the next year and dramatically escalate business conducted over the Internet. To be successful, Web-based businesses will need to quickly scale their compute resources to meet the demand. But a staircase approach to multiprocessing is no easy task. Scaling resources can take weeks of effort-a lifetime on the Internet-posing a major threat to the success of the business operation.
Recognizing the need for a more flexible method for scaling, Sun Microsystems has devised a new approach to multiprocessing with its UltraSparc III microprocessors. These chips were carefully designed from the ground up to support multiprocessing, making it easy to add more processing resources when needed. These chips can gracefully scale from two- to 1,000-way processing without requiring massive hardware and software redesign. To appreciate the elegance of this new approach, it is important to understand how multiprocessing works today.
Traditionally, there are two methods for implementing multiprocessing-shared memory and what is typically referred to as clustering. The shared-memory approach is based on multiple groups of general-purpose processors with a single memory image, where coherence is achieved through "snooping." A central memory controller is used by all the processors within the group to talk to the memory. As a result, all the memory interactions and communications must go through this one point in a snooping-coherence domain.
However, there are strict limits to expandability when using the shared-memory approach to multiprocessor designs. Constrained by the electrical load within the group, one memory controller is usually required for every four processors on a bus segment. It is possible to interleave memories, but only up to a certain amount. Some memory controllers can sustain up to 16 interleaved memories, for example, but that is the limit.
Beyond that, it is simply not possible to expand the system without adding groups, each with its own system memory controller. What is called a second-level bus connects these different groups to make them act like one big system. Repeaters are used to interconnect the various memory subsections, but these unfortunately add substantial memory latency. In fact, the latency often jumps a whole order of magnitude-going from hundreds of nanoseconds to microseconds. To accommodate such a dramatic change in latency, the application running on the system typically needs to be rewritten to adapt to the increased delays.
To circumvent the latency problem that comes with a shared-memory system, some systems implement an alternative multiprocessing architecture. Commonly referred to as clustering, it relies on the network to communicate among the different groups. These message-passing systems know how many groups there are and provide what is called intracluster communication through messages. But here again, to adopt this approach requires major modification to the application to make it work. Moreover, some applications are not amenable to message passing-their performance takes a nose dive. It is best suited for legacy applications that were built to take advantage of messaging. Obviously this is not an attractive option for Web-based businesses that need to react to increased demands in almost real-time.
Another major task when employing multiple groups is keeping track of all the memory in the system. A directory must be built typically out of SRAMs for performance reasons. This SRAM-based directory tracks which node has what memory. But SRAMs are expensive and so are kept to a minimum in a design. Consequently, when another group is added later, often the directory is not large enough to accommodate this expansion. As a result, the directory dictates the memory size.
Putting memory control on-chip
Aware that the limitations in today's multiprocessing schemes are quickly becoming a drawback, Sun developed a new approach with its next-generation microprocessor. The UltraSparc III, built in 0.18-micron CMOS, is a 64-bit processor designed to deliver 750-MHz performance. The UltraSparc III processor incorporates new capabilities that enable systems to quickly expand from just a few processors to hundreds without having to rewrite applications or use extensive additional circuitry.
Employing a methodology devised by Sun called Scalable Shared Memory (SSM), the UltraSparc III processor eliminates the need for an external memory controller. It does this by placing the memory controller on the processor itself so it can talk directly to the memory. As a result, memory size and performance naturally scale with additional processors because each and every one contains an on-chip memory controller. This also makes it possible for memory to be interleaved across as many processors as required.
With this elegant approach to multiprocessing, there is no need for multiple groups of processors and therefore no latency penalty. Although latency does increase as processors are added in an UltraSparc III multiprocessing system, it is much more gradual. In very large systems, the difference in latency between the traditional approach and an SSM system can be dramatic.
Sun also developed a way to gracefully extend beyond a single snooping coherence domain while maintaining a single memory image-programming model. Instead of a directory, a few bits in each of the DRAM memories are used to indicate if the memory is local to a node or being used remotely. As memory is added, this distributed type of directory expands naturally. Anywhere from four to 28 processors can be included in the local virtual memory.
Because it only records whether the memory is being used locally or not, this is not a full memory directory like the type used in traditional multiprocessing systems. To deal with memory accesses that are not local, an SSM agent is employed to determine exactly where data is stored. For example, when a processor checks to see if its immediate memory is available, it might find it is not. It would then reroute the request to the SSM agent, which looks for available memory within its particular segment. If the agent cannot find the memory, it does a broadcast to all the DRAMs in the system to see it they have any free. Although this may seem complex, remember that up to 28 processors can be in a segment, so these remote memory requests do not occur often.
The UltraSparc III processor's on-chip memory controller enables end users to quickly expand compute resources where and when they need them-a major boon, especially to Web-based businesses. With little effort, additional processors can be added without requiring an application rewrite and without being penalized by major increases in memory latency. Consequently, everything that can run on traditional multiprocessing platforms can run even better on a multiprocessing system with SSM.
Instead of having to purchase a new server when it needs more compute resources, a company can simply add the amount of processing power required. Freed from the burden of having to rewrite their code as they grow, Web-based businesses can reap the full benefits of the anticipated spectacular growth in e-commerce that lies ahead.