BEAVERTON, Ore. The real target in IBM Corp.'s acquisition of Sequent Computer Systems Inc. earlier this month was not Sequent's widely debated NUMA-Q nonuniform memory architecture, according to sources at Sequent. Instead, it was an as-yet unreleased next-generation system architecture that IBM intends to apply across its spectrum of CPUs in an attempt to seize the initiative from Sun Microsystems Inc. in the explosive market for Web servers. But Sun will be a moving target. That company, long a critic of NUMA-Q, is preparing its own NUMA system architecture.
Sequent senior marketing manager Steve Wanless said that work is already under way with IBM to bring Big Blue's multi-CPU systems, including the System 390s, into the new NUMA architecture.
Meanwhile, David Yen, Sun vice president and general manager for enterprise server products, said Sun is running NUMA-system prototypes "at various research sites, including MIT. With the market changing and our work maturing, I think you will see a second-generation NUMA technology from us within a year or so."
The growth in the use of Web resources translates directly into a need for servers capable of managing very large databases and supporting large numbers of simultaneous updates and queries.
Symmetric multiprocessing (SMP) architectures have proven themselves in these applications. But as the workload increases, so does the number of processors needed, past the point where systems can be assembled using the conventional SMP layout of a single shared CPU bus and a single shared main memory.
NUMA, proposed in its current form by a research project at Stanford University some years ago, has been the answer to this problem at Sequent, Silicon Graphics Inc., Hewlett-Packard Co. and other server vendors. Essentially, NUMA links a number of board-level SMP systems together through a fast bus. Each board has a hardware interface sufficiently complex to make all the memory in the system appear local.
But there are two outstanding problems with the NUMA approach: latency and coherency. Access to a block of memory on another board requires the NUMA hardware to translate a load or store operation on the board's local system bus into a message over the interboard communications link. The other board must then perform the memory operation and reply with a message of its own. This process can be an order of magnitude slower than a local memory access.
Coherency is also an issue. Since an application is naive about where a word of data comes from or what process was using it, it is up to the hardware to make sure the data comes from the closest correct source. In addition, if a process changes the data, all the other copies of the data that might reside in other memories or caches around the system must be informed that their data is no longer valid.
In Sequent's NUMA-Q implementation, these jobs were assigned to an IEEE Scalable Coherent Interface (SCI) ring that provided memory mapping, communication and coherency. But that is about to change, according to Wanless.
"SCI was the right choice for the time," Wanless said. "But the SCI protocol is heavy it takes a lot of processing to handle the coherency. SCI will not be an adequate solution for the future."
Two big changes
In the next generation, Wanless said, Sequent "will make two major changes to the NUMA architecture. First, we will move from the SCI ring structure to a cached, switched network for connecting the processor boards together." Second, he said, is a much-simplified coherency protocol.
The switched architecture yields better latency, "mostly because when a piece of data is requested from another board, we will cache the block of data it came from in the board's output port," said Wanless. "But more important, the switched structure will give us greater reliability. In an SCI ring it is very difficult, when a board fails, to switch it out of the ring and switch in a new board. Hot-swapping is not possible."
The switched architecture means "we will be able to switch boards in and out readily, and we can hot-swap. In the 24-hour server world that is a very important ability. In addition, we can physically separate the switched system into isolated regions in effect making separate servers on one piece of hardware."
In terms of coherency, "the coherency state machine in the SCI port [now] has to handle about 18 different states. We can simplify that considerably [in the new architecture] and speed up the processing," he said.
Connection between CPUs is not the only problem Sequent is addressing, however. Wanless pointed out that in a NUMA system, it is vital that disk space not be the private property of any one CPU or SMP cluster. Hence, in the current systems, disk drives are linked through a Fibre Channel network to all the SMP boards in the system. This makes any portion of any disk appear local to any CPU.
The next step, Wanless said, is to make the LAN routers that funnel Internet packets in and out of the server equally transparent. "To do this, we are forcing the convergence of LANs and peripherals," he said.
Sequent has devised a way to use a zone on the Fibre Channel switch to support not a storage-area-network protocol, but TCP/IP. Thus, LAN routers and concentrators can be put on the storage Fibre Channel along with the disk drives. This means any LAN traffic coming into the system will also appear local to any process residing on any CPU.
But when IBM brings the first implementations of the new Sequent architecture to market, it will find a ready answer from market-dominating Sun.
"Sometimes it looks as if Sun were the only company in the market not talking about NUMA," observed Yen. "But we have followed NUMA since the beginning. When Thinking Machines first got into trouble back in 1994, we picked up an entire design team from them. That team has been researching NUMA all this time. Now we have a number of prototypes based on groups of our current servers up and running. We didn't see any reason to rush into production, or to talk a lot about the work."
Yen reported that "our current SMP line does quite well with up to 64 CPUs using a uniform crossbar-switch architecture."
He said that two major philosophical differences would distinguish Sun's project from the work of other server vendors.
"First, we feel it is extremely important to present to the application the appearance of a conventional SMP system. If you have a huge difference in latency between the local memory and memory on another board, applications developers have to tune their code very tightly to your specific configuration, doing everything they can to make sure that huge access latency doesn't harm performance. What you want is a small difference in latency, without harming single-node performance."
The second point, Yen said, "is to remember that small-system performance is crucial. We've seen some vendors jump into NUMA with both feet, as it were. They built systems that from a single-node configuration on up were optimized to be full-up 256-processor systems. I think you see this reflected in the fact that the leaders in the TPC-C benchmarks results are rarely based on NUMA ideas. But the sweet spot in this market is still between one and 24 CPUs. You can't compromise performance down there. You have to build the architecture for big systems on the base of best-in-class single-node technology."
Yen declined to describe Sun's new system topology in detail. But he did say it is sufficiently different from the existing Sun servers to require support from other parts of the company.
"This will require new hooks in Solaris 8 [operating system]," Yen said. "And it depends on some new hardware hooks in the Ultrasparc-III [CPU]. Because we have in-house operating systems and CPU development, we don't have to start out our system design with someone else's CPU boards as a given. We can design from the silicon up."