Those of us who earn our livelihoods working with embedded signal processing software understand that optimization isn’t just something we do for fun (though often it is quite fun). Optimization is usually done to meet hard real-time constraints, trim product costs, or stretch battery life.
We all know that the complexity of the processor core has a significant impact on how difficult it will be to optimize the code. But more and more, it’s critical to also consider the complexity of the whole chip. I was recently (and painfully) reminded of this fact a few months ago when my colleagues and I set out to optimize some video software for an ARM9E-based SoC.
Since the ARM9E processor is quite simple and we already had lots of experience with video algorithms, we figured that the optimization process for this project would be straightforward. No such luck. The first challenge we faced was the lack of a cycle-accurate, chip-level simulation model. Without such a model, it’s very difficult to figure out where all the cycles are being spent. This makes it difficult to figure out which sections of the code would benefit most from optimization. There are models available for various parts of the chip—the core, buses, memory (which included L1 cache and L2 DRAM), and peripherals—but if you want to tie them all together, you’re on your own. For us, as for most engineers trying to get a software product out the door, building our own SoC simulation model wasn’t a practical solution. So instead we got a development board and hooked it up to an emulator with real-time trace capability. Problem solved? Well, no.
The emulator’s trace facility initially couldn’t keep up with the processor running at full speed, so we had to dial down the processor clock. But this changed the ratios of the CPU clock to various bus clocks, making cache misses appear less expensive. It didn’t make sense to optimize the code for cache behavior that would change once we were running in real-time—we needed another solution. With assistance from the vendor, we got tracing running at full speed, but then faced other obstacles. The trace would tell us that there was a stall, but wouldn’t tell us exactly where the stall occurred relative to our code or why it was happening. We had to devise experiments to map the traces we were getting to what was actually happening in the code. We did finally manage to get the code tightly optimized and to meet our application constraints, but the process was much more painful than we thought it would be.
The issues I’ve described here aren’t unique to ARM-based chips. They are largely the result of the increasing chip complexity and the lack of correspondingly sophisticated, accurate, readily available chip-level simulation, tracing, and analysis tools. Software development tool vendors need to recognize that, for complex chips, their tools aren’t giving the software developers the information they need. Without this information, optimization isn’t effective, efficient, or even fun.