Design Article
High-level parallel programming model simplifies multicore design
Michael D. McCool and Stefanus Du Toit
3/26/2008 4:52 AM EDT
Processors recently have added explicit parallelism in the form of multiple cores, and processor road maps are showing the number of cores increasing exponentially over time. This is in addition to existing per-core vector instructions, which also require parallelism. The good news is that processors will continue to scale exponentially in performance. The bad news is that as processors are no longer significantly scaling their clock rate, software apps that are not parallelized will not benefit.
Use of a high-level programming model greatly simplifies software development for multicore processors, including heterogeneous multicore devices. Most important, this approach does not sacrifice performance if the platform implementation includes modern automatic code optimization strategies.
Given the volatility of current processor designs, it is also worthwhile to consider the importance of portable, high-level parallel programming models for future-proofing software development. Portability also allows easy migration of code among processor designs, including handheld, mobile, desktop and server processors.
Heterogeneous designs in which some cores are specialized for some tasks are now being considered. This leads to greater power/performance efficiencies. However, the trend potentially complicates programming, since different cores in the same machine may have different instruction set architectures and require different approaches to optimization.
Examples of heterogeneous architectures include the Cell BE processor (which has one PowerPC core and eight high-performance DSP-like vector cores) and emerging processor designs that combine CPU and accelerator cores on the same chip. Nvidia's handheld GoForce processors combine an ARM core and a GPU, and projects such as Advanced Micro Devices' Fusion and Intel's Larrabee seek to combine X86 cores with GPU cores as well. GPU cores, originally designed for graphics, now support a general programming model and are similar to digital signal processing cores.
GPU cores are applicable to a range of applications, including video processing, simulation, vision, audio processing, speech recognition, and even spam recognition and database search. Heterogeneous designs are likely to be especially common in the mobile and handheld space, where power efficiency is paramount, although they are significant in the desktop and server space as well.
It is important to note that improvements in the power/performance ratio can be used either to maximize performance for a given power or to minimize power for a given performance.
Processors are therefore evolving rapidly, and aside from adding more cores, future processors will likely be capable of executing more operations in every clock cycle on a single core, especially in "accelerator" cores. For example, processors may support four-way to eight-way single-instruction, multiple-data (SIMD) instructions, have a pipeline of five or more stages, and be superscalar, allowing multiple independent instructions to issue (start executing) in the same clock cycle.
Even current processors depend heavily on instruction-level parallelism for performance: It would not be unusual for 20 or more operations to be in progress at once on even a "mainstream" processor, and hundreds or thousands may be in progress at once on a GPU.
Therefore, although multiple cores are the most obvious form of parallelism in today's processors, each core also may include significant additional parallelism at the instruction level. Maximum performance is only achieved when single-core code executes as many operations as possible every clock cycle and also manages memory so that the cores are not starved for data.
Traditionally, each hardware parallelism mechanism and memory management mechanism has been programmed separately using a low-level interface. However, programming multiple low-level parallelism mechanisms and coordinating them is a herculean task. Such an approach is also processor-specific, which means it has to be redone for each change in the instruction set architecture, or to target different deployment processors. This extends development time enormously.
Instead, it makes sense for software organizations to consider a high-level approach to parallelization that lets developers focus on the overall structure of their application. A software devel- opment platform is a system that dynamically manages and optimizes code and also manages its execution on a parallel machine.
An appropriate high-level parallel programming model, with a software development platform that manages and optimizes code to target the various parallelism mechanisms available, has major benefits in terms of programmer productivity and software portability. It also leads to higher-performing code overall, since more parallelism mechanisms can be targeted than would be possible with a manual approach.
The first fundamental observation motivating parallel software development platforms is that parallelism can be abstracted and expressed separately from the manifold mechanisms by which it is implemented in hardware. Given a suitable high-level description of the "latent" parallelism in an application, it is possible to map it automatically onto multiple hardware implementation mechanisms. Also, to achieve maximum performance, programs should be designed around massive parallelism.
Multiplying together the opportunities for parallelism on a modern mainstream processor (i.e., multicore by vectorization by pipelining by superscalar issue), hundreds to thousands of independent operations need to execute at once for full utilization. If a large amount of latent parallelism is available to the platform, it can be decomposed over the available mechanisms, and any "extra" parallelism can always be serialized. If a new hardware platform emerges with more parallelism, however, the platform can automatically improve performance. Therefore, software platforms are key technologies enabling innovation in processor design, since they decouple the software design from the details of the hardware.
Click here for larger image


ScottRanville_SoftwareBeretInc
4/14/2008 10:50 AM EDT
Very timely article. We are actually in the process of doing exactly what you talked about. It has taken many hours of study and experimenting, but we have finally gotten code to run on one of your mentioned processors. Our initial experiements indicate that even small models execute significnatly faster on the parallel hardware compared to running the model in its native modeling environment. Time now to start testing with larger models in which we expect even more impressive improvements in performance.
Thanks
Scott Ranville, Software Beret Inc.
Sign in to Reply