On June 8, America’s Summit supercomputer was announced as up and running at an impressive 200 petaflops maximum theoretical performance. (Public benchmarks are expected later this month.) It became the fastest system in the world, retaking the lead from China which had claimed dominance for several years.
Competition remains stiff. China has multiple exaflop projects expected to be running a year or more before the U.S. has a system at that level.
The Summit supercomputer at the Department of Energy’s Oak Ridge National Laboratory consists of 4,608 compute nodes, containing a total of 9,216 IBM Power 9 processors and 27,648 Nvidia Tesla V100 GPU modules. The Power 9 and V100 chips talk over NVLink, a high speed, high bandwidth mesh interconnect.
The proprietary NVLink is fine for building individual high-performance compute nodes. But scaling thousands of nodes into a high-performance cluster requires a state-of-the-art network. Summit’s full fat-tree network is built using InfiniBand EDR cards from Mellanox.
Competitors will find that Summit’s network performance and overall performance scalability would not be possible without PCI Express 4.0. The supercomputer is the first public high-performance cluster to support PCIe 4.0 at a scale of thousands of nodes.
Summit is comprised of 256 59kW racks of compute nodes and 40 38kW racks of IBM Spectrum Scale storage. Each rack includes two top of rack switches. Eighteen racks of core switches implement the fat-tree network between compute and storage racks. Overall, the massive system claims a cross-sectional network bandwidth of nearly a petabit/second.
To link its many elements, Summit uses a multi-host Mellanox ConnectX-5 Infiniband adapter card in each node with individual cables that carry 100 Gbits/second bi-directional traffic. The Mellanox cards use PCIe 4.0 to talk to IBM’s Power 9, the first mainstream server processor to integrate the link formally ratified in early 2017.
Summit implements one ConnectX-5 card in a single PCIe x16 slot shared between the two Power 9 processors. The PCIe slot directly connects eight PCIe 4.0 lanes to each of the two Power 9 processor sockets, so that one of the processors doesn’t become a bottleneck for the other or for the GPUs. All network traffic in a node flows through the processors and then over NVLink to reach the GPUs.

The ConnectX-5 has two 100 Gbit/s Infiniband ports. Each port connects to a different top-of-rack switch. IBM says each switch card supplies a compute node with peak bandwidth of 25 GigaBytes/second. The shared PCIe 4.0 x16 port can handle a peak of 32 GB/s, so each node coupled has plenty of bandwidth to support peak InfiniBand EDR rates.
The result is that Summit can support both classic compute-intensive workloads with relatively low throughput requirements and high-throughput machine learning training and inference workloads. Summit’s sustained performance will be evaluated after it has been operating for a few months.
–Paul Teich is a principal analyst at Tirias Research




Is the competition merely "stuff" (2nd paragraph)? Or is it stiff?