Part 2 looks at the tradeoffs between program and data cache optimizations, and shows how to choose the best compromise. It will be published Monday, November 5. For more on this topic see Optimizing for cache performance.
Have you ever profiled your application and asked yourself why a single-line C code function takes hundreds of cycles to execute? If so, you may be asking the wrong question. A better question to ask may be, "Why do I have a single-line function to begin with?"
With today's cache-based memories and long processor pipelines, the partitioning and structuring of code strongly affects its performance. Seemingly arbitrary software architecture decisions can have great impact on performance. In particular, code locality is a determining factor in application performance. This article will address code locality and other software architecture issues with the intent of maximizing instruction cache performance.
Cache memories work under the assumption that the processor core accesses the memory system in a local manner. When defining locality we distinguish between two types of locality:
Temporal locality – the idea that if a memory address is accessed it will be accessed again in the near future. For example, loops fetch the same instructions over and over. As another example, calling and returning from functions causes stack memory to be accessed repeatedly.
Spatial locality – the idea that if a memory address is accessed, nearby addresses will be accessed in the near future. In the absence of a change of flow (COF), such as a conditional branch, a program typically accesses memory in a sequential fashion. When processing an array of data, an access to array member i will very likely be followed by an access to member i+1, with both members located sequentially in memory.
Figure 1 shows the memory access pattern of an example application. The X axis corresponds to time (in cycles) and the Y axis corresponds to memory locations. The highlighted sections show examples of temporal and spatial locality.
(Click to enlarge)
Figure 1. Program access pattern example – good locality.
Figure 2 shows an access pattern with poor locality. We can see that the core accesses a relatively large memory footprint without a specific order, giving the pattern poor spatial locality. We also see little temporal re-use, which also makes this code cache unfriendly.
(Click to enlarge)
Figure 2. Program access pattern example – bad locality.
The name of the game in attaining high cache utilization is reuse. The more an application reuses a section of memory, the better the utilization. For that reason, our objective is to reduce the dynamic memory footprint, that is, the size of the memory brought into the cache at runtime.
Another way to increase utilization is to leverage advanced cache features such as prefetch. These features hide memory latencies and squeeze even more performance out of the hardware. They work best when code is accessed by the core as linearly as possible.
In order to reach these goals we need to look carefully how our code is fragmented in the memory. We will define two types of fragmentation.
When a father function calls a descendent function there may be a penalty for two reasons:
- If instructions of the descendent function are not in cache, a cache miss will occur, resulting in processor stalls.
- When a change of flow (COF) occurs, the pipeline may need to be flushed, resulting in significant cycle penalties.
We will look at this topic in more detail in part 3 of this series.
Functions often contain a significant amount of code that deals with corner cases, that is, cases which are rarely executed. This means that a large number of instructions read into cache are rarely executed, resulting in reduced cache utilization. In addition, corner cases often contain additional change of flow, increasing the number of possible cache misses.
Function-based and code-based fragmentation combined can cause very poor locality, and as a result, very poor cache utilization. Next we'll look at strategies for avoiding these problems and obtaining optimal performance.
Runtime vs. initialization/corner-case code – First and foremost, clearly divide your code between runtime code (i.e., frequently-called code) and initialization/corner case code. Runtime code should contain only critical code that is likely to be executed whenever your algorithm runs. Eliminating corner cases from your runtime code will minimize your dynamic memory footprint.
Linker optimization/code positioning – Functions should not be linked in an arbitrary order. The developer should think about the ordering of functions in memory, at the least for the runtime code. This should take into account the typical realtime function call tree flow of the algorithm.