There is some debate in the InfiniBand community about whether HCAs should employ on-load or off-load protocol processing. Adherents of each camp claim higher performance. In this article, we will look at real-world application testing that compares the two approaches.
QLogic, comparing the two major InfiniBand architectures to determine their performance, conducted a recent performance study. The interconnect is a key factor in determining the performance of a High Performance Computing (HPC) cluster. InfiniBand is now the leading choice for major High Performance Computing (HPC) fabrics because it offers the highest available bandwidth and the lowest available latency of the open/standards-based interconnects. However, depending on the design of the InfiniBand architecture, InfiniBand's advantages can be squandered as the number of compute nodes scales up into the dozens, hundreds or thousands. One of the main challenges is achieving efficient cluster performance scaling, which can be impacted by the type of InfiniBand architecture that is used.
Background - Adapter-based vs. Host-based Processing
There are essentially two types of InfiniBand architectures available today in the marketplace.
* Offload architecture (in which each InfiniBand adapter includes an embedded processoring resource that processes a portion of the communications protocols) was utilized by InfiniBand adapters that were initially designed in the early 2000s when InfiniBand was first being designed as a fabric for the enterprise data center.
* On-load architecture (in which the server's processors are utilized to process the communications protocols) is a more recent design, and was created when it became clear that HPC was the major market for InfiniBand.
On-load architecture was designed to run HPC/MPI applications, as well as to accommodate the latest processor technology with its multi-core implementation. Offload architecture had MPI retrofitted on top of its InfiniBand transport layer, which had been designed for the enterprise data center. In contrast, the latest generation of InfiniBand is designed for the HPC market and the current generation of faster, denser core count processors.
The two generations of InfiniBand handle protocol processing very differently. An organization's choice of InfiniBand architecture can make a significant difference in overall fabric and application performance, particularly as the size of the cluster scales. Some vendors rely heavily on on-load processing techniques, while other vendors primarily use offload processing.
Adapter-Based/Offload Adapter Architecture
Just a few years ago, a typical server may have had just one or two single- or dual-core processors and relatively slow PCI or PCI-X buses. Since these processors had the ability to issue only one instruction per clock cycle at a relatively low clock rate, the servers benefitted from having communications processing offloaded to the adapter.
Today's CPUs provide significantly more power than only a short time ago (Xeon 5400 - Harpertown), and they are far faster than the processoring engine typically found in an offload device. The Intel Xeon 5500 - Nehalem processor issues four instructions per clock cycle and operates at a clock speed of 3GHz. As a result, each Nehalem processor has an execution rate significantly faster than the generic processor engine found in many adapter-based offload processing engine. This difference in processing power can potentially overload the adapter-based/offload microprocessor, thus making it a bottleneck to the host and HPC cluster's performance.
With the recent releases of denser core count processors from Intel and AMD, the potential for overloading the microprocessor has increased. For example, Intel's Xeon 5600 "Westmere" processor has six cores, which means that the processing overload on the microprocessor has now increased 36 times. Therefore, with a dual socket server the processing overload is approaching 72 times.
Host-Based/On-Load HCA Architecture
InfiniBand that is designed with host-based processing is a much different approach. Host-based HCA architecture depends on the node or server to process the InfiniBand protocol. This allows the host-based protocol processing performance to scale in a much more linear fashion along with the number of available cores. More cores equal faster performance, enabling users to leverage Moore's Law; users can continually scale InfiniBand protocol processing provided they have an adapter that can take full advantage of the added power.
Complementing this design is a transport protocol that was created for HPC/MPI market requirements. This transport protocol is a "lightweight" protocol that is built around a tag which matches semantics similar in concept to ones used by high performance HPC interconnect pioneers Myricom and Quadrics. By combining the host-based adapter's ability to leverage the greatly-increased processing power of today's processors with an efficient InfiniBand protocol design, the host-based/on-load adapter can provide optimal HPC application performance and scaling.
Performance Study - Different InfiniBand Architectures
QLogic conducted a study of the two major InfiniBand architectures to determine their performance characteristics for HPC applications using MPI. This study was conducted at the QLogic NETtrack Developer Center in Minnesota. The tests were completed using an Intel- based cluster consisting of 16 nodes and 128 cores. Each node had dual Intel Xeon 5570 "Nehalem" 2.93 GHz processors and 24GB of memory. The InfiniBand interconnects tested were the QLogic TrueScale QDR hosts and switches and the Mellanox ConnectX-2 QDR adapter with Mellanox/Voltaire switches.
Performance Study Objectives
The goal of the study was to analyze the following performance characteristics of the two major InfiniBand architectures:
* The host messaging rate performance of the InfiniBand interconnect architectures. A host's ability to process MPI messages is one of the key factors in determining how MPI applications will perform and scale.
* The MPI application performance and scaling efficiency.
Host Message Rate Performance Test
Host rate messaging is a key factor in the performance of an application, especially as the cluster scales. As users seek to solve more complex computational problems in less time, HPC clusters continue to grow both in terms of nodes per system and cores per node. Clusters that once used two to four cores per server now typically include eight, with 12-, 16- and 24-core servers now available. To extract the most out of the additional compute power, the adapter (or adapters in some cases) must keep up with the exponential increase in communications throughput required by clusters based on the latest processor technologies.
The definitive test for measuring host rate message throughput is OSU's MPI Message Rate test. The message rate test evaluates the aggregate unidirectional message rate between multiple pairs of processes. Each sending process sends a fixed number of messages back-to-back to the paired receiving process before waiting for a reply from the receiver. This procedure is repeated for several iterations. The objective of this benchmark is to determine the achieved message rate from one node to another node with a configurable number of processes running on each node.
This test was run between two Intel-based servers with dual Xeon 5670 (2.93 GHz) processors. The first test used QLogic TrueScale QLE7340, the QDR host-based architecture, and adapters in each server connected to a 12300 TrueScale QDR switch. Then the test was repeated with Mellanox adapter/offload-based ConnectX-2 QDR adapters, which were connected to a Mellanox MTS3600 QDR switch.
Figure 1: Non-Coalesced message rate of host-based/on-loaded adapter vs. adapter-based/off-loaded adapter
Figure 1 illustrates that the on-board protocol processor-based adapter "tops out" at roughly seven million messages per second. More significantly, the performance of this adapter actually declines as the number of processor cores moves beyond three. In contrast, the host-based adapter offers more than seven times more message throughput than the adapter/offload adapter at twelve cores. When one extrapolates this effect out over hundreds of nodes, it becomes clear that when adapter-based processing is the primary technique in use, the incremental benefit of adding nodes declines as the number of nodes increases because the adapters become the bottleneck.
In summary, host-based adapters achieve five times more messages per second at scale, while adapter-based/offload adapter performance peaked at four cores.
Application Test - Fast Fourier Transform
The HPC Challenge benchmark has an MPI application-level test. This test is a good example to show how InfiniBand host message rate, cluster message rate, and latency can impact the performance of an MPI application. HPCC's MPIFFT test measures the floating point rate of execution of the double precision complex one-dimensional Discrete Fourier Transform (DFT). The rating is in gigaflops or billions of floating point operations per second.
Figure 2: HPCC MPI Fast Fourier Transform results
The MPIFFT test results show that TrueScale has a nine percent performance advantage at 32 cores and a 13 percent advantage at 128 cores. TrueScale achieves almost nine Gflops more performance using the same Intel-based servers as when Mellanox InfiniBand is used. This test shows advantages in message rate, cluster message rate, and latency. Even though the TrueScale adapter uses the host CPU and memory, it still achieves better performance than the Mellanox offload adapter.
Application Test - ANSYS FLUENT
The final application tested was ANSYS FLUENT, which is one of the most popular computational fluid dynamics (CFD) applications in the market. FLUENT is an example of a commercial MPI application used in many industries, including aerospace, automotive, oil and gas, and medical, to name just a few. ANSYS has made strong strides in designing FLUENT to scale and perform well on HPC clusters.
The size of the CFD model has a direct relationship to the performance of the interconnect used. A CFD simulation is composed of a number of cells. A relatively small CFD simulation will be broken up into a small number of cells per core. The cells per core are solved quickly for a specific step of the simulation. This then requires each node to send its information from the completed step to the other nodes. This means there are frequent MPI communications (i.e. messages) between nodes. A smaller model best shows the performance of the interconnect and is a good indication of how a much larger model would run and scale on a larger cluster.
The FLUENT test was run on the Intel-based server cluster, which consists of 16 servers and one NFS server node. Each server has dual quad-core Intel Xeon 5570 "Nehalem" 2.93GHz processors and 24GB of memory. The Intel-based server cluster has a total of 128 cores and 384GB of memory. The following InfiniBand configurations were tested:
* QLogic: 12300 QDR switches, QLE7340 QDR HCAs
* Mellanox: MTS3600 QDR switches, ConnectX-2 QDR HCAs
A FLUENT test was performed using the Turbo 500K model. This test showed an even more pronounced performance difference at scale. The Turbo 500K model stresses the processing and communications infrastructure of the cluster. The test results showed that the TrueScale-based cluster had a 22 percent performance advantage over the other cluster.
Figure 3: ANSYS FLUENT 12.1 - Turbo500K Results
The smaller FLUENT models best show the performance capability of the interconnect. The reason is that these types of models are broken into smaller numbers of cells per core, which requires extensive MPI communication between cores and nodes to complete the simulation.
FLUENT is both a latency-sensitive and CPU/processor-intensive application, which places stress on an interconnect, especially at scale. TrueScale has a performance advantage ranging from 3.6 to 22 percent, depending on the FLUENT model.
From these tests, it is clear that the on-load processing architecture offers the higher performance and the scalability for most HPC clusters.
About the Author
Joe Yaworski is director of Global Alliances and Solution Marketing for QLogic. Within his Global Alliance responsibilities, he manages QLogic's strategic partnerships and alliances in the High Performance Computing space. Joe has helped build one of the industry's broadest HPC ecosystems, which now includes alliances with over 70 companies. Joe's role is to help channel and alliance partners to create solution-marketing programs that combine their offerings with QLogic's HPC technologies. Also as part of his responsibilities, he directs the QLogic NETtrack Developer Center; which is used to test and certify partner applications and perform performance benchmarking.