In the world of technical computing, discussions about parallel programming focus on customizing algorithms to utilize hardware effectively. Fueling these discussions are several factors related to high-performance systems: the introduction of multicore and many-core systems, the appearance of programmable devices such as global processing units (GPUs), and the growing availability of commercial-off-the-shelf (COTS) computer clusters.
Until very recently, commercial high-level tools to support the development of technical computing applications for high-performance systems did not exist. Parallel programming was an esoteric art applied by specialists who focused on achieving maximum performance by using custom setups and low-level libraries and by tuning their applications for specific hardware.
In a 2007 briefing, IDC highlighted the difficulty of scaling beyond a single node because of the lack of appropriate programming environments. Today, as high-performance systems become more prevalent, there is an urgent need to make these systems more readily programmable by all.
To that end, parallel programming solutions must focus beyond custom algorithms and performance. Ecosystems of tools are being developed that assist engineers in the design, development, and debugging of parallel applications and that fully utilize the capabilities of rapidly evolving hardware. To succeed, these new ecosystems, which have greatly matured over the last few years, need to:
• Extend the functionality of standard tools used in developing serial applications to support parallel programming
• Support the scalability of applications from a simple multicore desktop to an intricate cluster and grid configuration, without the need to modify the application code
• Provide a robust integrated development environment (IDE)
• Execute applications in batch and interactive modes.
Adequate parallel programming support is a key consideration in choosing a programming language. This support may range in approach from subtly implicit to overly explicit language constructs. An implicit approach involves automatic parallelization that is facilitated by the system without changing the existing program, whereas an explicit approach requires users to annotate programs so that the programs run in a parallel manner. Implicit parallelism is still an active area of research that currently has no overarching solution, so in the foreseeable future, a level of language explicitness is required. Ideally, languages should directly support high-level parallel language constructs that require users to do minimal program annotations to achieve optimal performance.
Traditional parallel programming solutions focus mostly on the explicit approach that generally uses message passing interface (MPI) commands. MPI is the industry-standard protocol for point-to-point communication in a parallel program. Using it requires programmers to work with low-level language constructs, which represents a substantial barrier.
Several software companies are addressing the need for parallel programming support. For example, Microsoft Parallel Extensions to the .Net Framework provides language constructs for concurrency in applications written with any .Net language. These constructs abstract away lower-level programming details.
MathWorks parallel computing tools let users program across the implicit/explicit range. For example, Matlab users programming computationally intensive applications might annotate FOR loops as PARFOR loops. A PARFOR loop behaves like a traditional FOR loop when executed on a single-processor system. Yet when executed on a multicore machine or cluster, PARFOR transparently makes effective use of the additional computational resources now available. Moreover, users programming data-intensive applications can annotate arrays as distributed arrays, which are allocated across multiple cores or processors. Distributed arrays enable users to develop parallel applications without having to concern themselves with the low-level details of message passing. In the implicit programming ideal, an application that is parallelized would require no modifications to the code. Working toward this ideal, The MathWorks is gradually adding direct support for parallel computing in its toolboxes, eliminating the need to modify applications that use them.
Desktop computers with multicore processors and general-purpose computation on GPUs (GPGPUs), such as the CUDA Toolkit from Nvidia, are readily available. While programming these systems often involves using a threading model on a single machine, many users need to scale beyond the abilities of a single machine to a computer cluster. In these cases, a process-based model is required. A user application need not be concerned with threads or processes; it should just run as efficiently as possible. Nevertheless, scalability remains a difficult and indeterminate problem to solve.
In some cases, users need to run applications on resources with variable capacity. An example of variable resources is Amazon Elastic Compute Cloud (EC2), which provides users with complete control of resources running on Amazon's computing environment. Barriers to this approach include setup difficulty and the loss of control over intellectual property when running on an external system.
Tools must permit programmers to seamlessly scale applications from desktops to clusters and grids without modifying code. The configuration framework provided by MathWorks parallel computing tools is one example of how the scalability problem can be addressed for parallel applications. The configuration framework lets users maintain named settings, such as scheduler type and cluster usage policies. As a result, users can switch between hardware resources by simply changing the configuration name.
IDEs need to support both process-based and thread-based solutions while assisting both novice and experienced programmers in developing their applications. In particular, IDEs need to enable programmers to debug parallel programs that deal with large data sets and may run in desktop computers or in large clusters.
Microsoft has done extensive work on its Visual Studio development system to reflect this requirement. Visual Studio 2005 can execute parallel applications, and its debugger supports process-level and thread-level breakpoints and stepping. Another product, TotalView Debugger from TotalView Technologies, supports multiple platforms and programming languages, is capable of scaling to thousands of threads or processes, and provides tools to explore large and complex data sets.
Tools should allow for batch and interactive workflows. Programmers are more productive when they interactively develop their parallel applications. However, most high-performance computing (HPC) centers support a batch workflow, in which users write applications, submit them to a cluster for processing, and wait for their results. Other solutions based on the Web, such as nanoHUB.org from Purdue University and network.com from Sun Microsystems, focus on interactive on-demand access to computing resources.
MathWorks parallel computing tools support both workflows. They extend Matlab, letting programmers use the familiar Matlab environment for interactively developing parallel applications. Programmers can also use a batch environment that provides them with an offline execution mode.
While there are several vendors that offer solutions that let technical computing users be as effective in building applications for parallel systems as they are in building traditional serial programs, gaps still exist in the ideal ecosystem of tools. Emergent trends such as utility computing and grid availability will bring even more requirements to this rapidly changing field.
 Earl Joseph, Jie Wu and Steve Conway, "IDC HPC Breakfast Briefing," International Supercomputing Conference, Germany, June 2007.
 Roy Lurie, "Language Design for an Uncertain Hardware Future," HPCwire, September 2007.
 Cleve Moler, "Parallel Matlab: Multiple Processors and Multiple Cores," The MathWorks News and Notes, June 2007.
Loren Dean (Loren.email@example.com) is a director in the Matlab development organization. Prior to MathWorks, Loren worked for AlliedSignal Aerospace performing systems analysis and integration for aircraft engines, with extensive use of Matlab and Simulink. He has a BS and an MS in Aeronautical Engineering from Purdue University and an MBA from Northeastern University.
Silvina Grad-Freilich (Silvina.Grad-Freilich@mathworks.com) is the manager for parallel computing and application deployment marketing at The MathWorks. She holds both BS and MS degrees in computer science from the National University of La Plata in Argentina and an MS in management from MIT (Massachusetts Institute of Technology).