Debugging code running on multiprocessor computing systems, and, in particular, parallel code on multicore devices, is an old computing problem that has reached a certain prominence and urgency because of the profound transformation of hardware from single-processor to multiprocessor and multicore solutions in the past few years. But beyond software engineers' moaning and groaning that hardware is making life harder, what does this transformation really mean? It's one thing to claim that something is difficult, but it is something else entirely to produce a solution that truly helps.
There are several different schools of thought about multiprocessor software design and debugging, depending on the background of those in the discussion. Sometimes, the issue is optimizing an "embarrassingly parallel" algorithm by writing a small piece of new code to run on a particular parallel machine. Other times, it is taking an existing many millions of lines of code and simply making them work correctly in a parallel world. Although I think that most cases involve a mix of both, the bigger immediate problem is posed by large pieces of existing working code. One obvious example is the battle between SMP (symmetric) and AMP (asymmetric) setups, an argument often more religious than anything else. It is also an argument quite irrelevant to the debugging question.
A survey performed by Freescale and Virtutech at the Embedded Systems Conference Silicon Valley 2008 indicated that the top issues in multicore software development were:
» lack of determinism and repeatability of bugs;
» inability to stop an entire system to debug software;
» getting existing software to run on multicore systems;
» inadequate visibility of all states in an embedded system (e.g., system-on-chip, board and rack).
The survey indicated that performance was less a problem for software development, yet the primary reason for choosing a multicore processor. This could be interpreted as an assumption that getting code running guarantees some kind of performance increase. The way hardware trends are, multiprocessing is the only option for increased performance, so even a 25 percent increase in performance from twice the number of cores might be a benefit. The alternative is a single core and no performance increase at all, so this idea has some merit.
It could also be the case that performance is a secondary concern at present, if simply because multiprocessing is still new to most developers. In a few years, as multicore platforms become more common and multiprocessing becomes more accepted, more developers will overcome the initial learning curve, and the emphasis will shift from just getting things to work to getting things to work with efficiency. This is sound software engineering practice: first make it work, then make it work excellently.
Guidance from OSes
Operating systems offer a good case study for those interested in software ports to parallel computers. First, operating systems are by necessity the first code (if you consider the boot code to be part of the OS, an idea that can be controversial in itself!) that must be put in place if a new machine is to be useful. Second, operating systems are fairly well studied, with data available from several generations of parallelization efforts. The first took place on mainframes in the 1960s and the second on desktop operating systems in the 1990s; the third is now taking place in embedded operating systems.
In general, porting an operating system to an SMP system involves two main tasks: splitting up OS data into (1) one part local to each processor and (2) one global part, putting in locks to protect the shared data. For an initial port, most of the work is actually in splitting up the data; the locking can be simple from the beginning. This sacrifices performance in order to get things correct. Long-term, most of the work on a parallel OS is spent successively refining the locking regime and mechanisms in order to increase parallelism, and thus performance. Note that such refinement is not just a matter of increasing performance: each change to the locking regime requires extensive testing and debugging to make sure that no critical case has been forgotten.
The main issue is getting locking right: either locks are missing where they should be, or they are taken in an order that leads to a deadlock. This is borne out both by anecdotes and by published material. From conversations with software engineers involved with both general software and fundamental operating systems, it is clear that this applies to most ports of existing code onto parallel platforms. First, get locks in place; second, tune them to lock as little and seldom as possible, making sure correctness is retained.