ARM is providing software support for the CPU migration and global task scheduling versions of big-little multiprocessing, as well as some performance and power comparisons.
On the positive side, this more general approach can be applied to different numbers of big and little cores. SoCs can be tuned in their design to the appropriate number of cores -- six Cortex-A7s, and two Cortex-A15s for one type of equipment but five and three for another, for example. However, the software does sit somewhere near the operating system, so it starts to be a issue that extends beyond the SoC provider.
In the Linux domain, ARM is using the Linaro organization to provide an in-kernel switcher as an implementation of big-little CPU migration and to provide big-little MP, an open-source global task scheduler supporting OS-aware heterogeneous multiprocessing.
Measured results for CPU migration and global task migration versions of big-little under a mixed-load. Performance in blue, energy in green.
This image shows that, under a mixed load of web browsing and MP3 audio playback, both the CPU migration and global task scheduling flavors of big-little produce a 50 percent power savings over running only half the number of big cores -- four Cortex-A15s versus four Cortex-A15s and four Cortex-A7s, for example. However, because the global task schedule version can occasionally have all cores operating at the same time, it shows slightly higher computational performance.
Charles Garcia-Tobin, a software power architect at ARM, told me and other analysts assembled at Cambridge that ARM is in discussions with all the operating system vendors on how to accommodate the CPU migration and global task switching approaches. "Every partner is going to do their specific patches for Linux," he said. "They want to tune for their core counts. We provide the general infrastructure." Six ARM partners are working on versions of big-little based on either CPU migration or global task scheduling, and they are expected to come to market within six to nine months.
My takeaway is that global task scheduling offers the most flexible and highest-performance solution, but in practical terms, not by much. It supports asymmetric topologies but comes at the cost of additional software complexity that has to interface to the operating system. It almost goes without saying that focusing on the number of cores is facile when so many layers of software complexity, if done badly, can hobble their performance.
But it also has to be considered that, though global task scheduling may bring an extra level of design complexity, it is not the most general case of heterogeneous multiprocessing which must also take into account GPU cores, caching schemes across multiple ISAs, and hardware accelerators. It is a good thing engineers thrive on complexity.