Multi-core processors, HPC clusters, grids, clouds, and GPUs grow in popularity and availability. However, examine the software in the market and it is the 'worst of times'.
“It was the best of times, it was the worst of times…” opens “A Tale of Two Cities” by Charles Dickens. The passage referred to the French Revolution, but today it could refer to the revolution ongoing in parallel computing. It is the ‘best of times’ for hardware.
Multi-core processors, HPC clusters, grids, clouds, and GPUs grow in popularity and availability. However, examine the software in the market and it is the ‘worst of times’. There is a gap between what the hardware offers in terms of performance and what is being realized by both the commercially sold software as well as what is being developed in-house. To date, training in parallel programming has been scarce for engineers and scientists.
The situation is changing, though, as technical computing software is increasingly utilizing parallel hardware. High level technical languages such as MATLAB have been steadily adding features to allow their users to solve bigger problems, faster, with parallel resources. Parallel computing is no longer limited to C and FORTRAN programmers who understand the nuances of MPI or OpenMP.
Ideal versus Realistic Language Options
It would be ideal if software, new and old, sped up automatically as new cores were added to their systems. It would be ideal if technical computing languages could automatically take advantage of parallel hardware and ‘do the right thing’ without burdening the end user with reprogramming anything. It would be ideal if old programming paradigms were sufficient for utilizing parallel hardware effectively.
Unfortunately, this won’t be the case. Automatically utilization of parallel hardware is called ‘implicit parallelism’ and has been the Holy Grail of parallel computing research.
There has been some success, though typically narrowly focused. For example, several of the BLAS (Basic Linear Algebra Subprograms) implementations have added multi-threaded matrix manipulation. The Intel FORTRAN compiler has implicit technologies designed for looping structures,including ‘Auto-parallelization’ (TLP) for outer loops and ‘Auto-vectorization’ (ILP) for inner loops. Other examples exist, but the overall impact of implicit parallelism has been small compared with the success of explicit parallelism. Explicit parallelism, though, places the burden of extracting performance from multiple cores on the programmer.
With explicit parallelism, programmers use specialized functions to call on the power of parallel hardware. These programmers could be anyone, but traditionally parallel programming has required knowledge a computer scientist would have rather than an engineer or other scientist.
There are numerous low-level technologies available. For example, CUDA was developed by NVIDIA to provide access compatible Graphics Processing Units (GPUs) for general computations. Programmers can use the GPU processing power from languages such as C.
A programmer who wanted to use a GPU with this technique would need to know both C and CUDA. They would also have to understand how to write parallel programs rather than serial programs because there is a difference between them. If a programmer instead wanted to write a C program that could use multiple cores or processors (CPUs) they would instead need to learn a technology such as OpenMP or MPI (Message Passing Interface). Their CUDA knowledge would not be directly useful because CUDA GPUs are not built in to a standard system.
What’s common with these solutions is that they are geared for programmers using a low level language such as C, experience with parallel programming concepts, and knowledge of a particular API for a particular hardware solution. They are in the sweet spot for high performance computing (HPC) experts, and an engineer or scientist going this route have to go through a learning curve which requires them to become HPC savvy as well.