Design Article
Optimizing for instruction caches, part 3
Mircea Livadariu and Amir Kleen, Freescale
11/12/2007 3:00 AM EST
As we saw in the first two parts of this series, cache optimization is often based on the rearrangement of data and instructions. In this article we discuss an optimization technique for instruction caches based on the placement of functional blocks of code in memory.
The Optimization Concept
The main idea behind code placement is to position functional blocks in memory based on the program's function call tree flow. Linking functions in this manner allows us to harness spatial locality in memory accesses patterns. (For definitions of locality, see part 1.)
Modern caches have mechanisms which prefetch memory that is in the vicinity of the currently executing block. During the execution of a function, code placed after it in main memory is brought into the instruction cache. If the functions have been linked wisely, instruction cache misses can be minimized, resulting in a high cache hit rate.
As an example, consider the function call tree flow in Figure 1. Suppose function F2 is linked near function F1, but function F3 is not linked near F1. When function F1 calls F2, it is possible that F2 is already in the cache and that there will be no cache miss. (The likelihood that F2 is in the cache depends on the sizes of F1 and F2, as well as the location of the call inside F1.) In contrast, there will probably be a cache miss when F1 calls F3. Because F3 is located far away from F1 in memory, it is not likely to be in the cache when F1 calls it.

Figure 1. Function call tree flow example.
The Optimization Method
We saw that rearranging functions in memory can reduce instruction cache misses. The techniques of code positioning (which are performed at the linker level) can be extended to basic blocks of code (which can be performed at compiler level). A basic block is a linear sequence of code with one entry point, one exit point and no jump instructions.
Software optimization techniques using code positioning typically involve the following steps:
- Analyze the behavior of the code
- Generate a new function placement that yields better performance
To analyze the behavior of the code, we profile code using a variety of input data and observe the resulting path of code flow. We can evaluate the code flow for either the most likely code path (typical case) or the most cycle-consuming path (worst case). To evaluate the typical case, we build a tree with every node or leaf representing a function. A node's children represent the functions called by the parent node function, and are placed left to right in the order in which they are invoked by the parent. If a function is called from several places, it is placed only once in the tree in the first place it is called.
The functions are numbered in the order that they are first called. If the tree is traversed using the depth-first (DF) method and the numbers are recorded in the nodes, we should get 1,2,3,4,.,last_number, where last_number equals the number of functions involved.




Comments
zubrowa
11/14/2007 9:36 AM EST
Hi,
I'm little bit confused about the effectiveness of the memory layout achieved by the described algorithm. Unfortunately, there's no detailed reasoning about why the found chains improve I-cache performance.
For example, I don't see an advantage of the third chain 3-8-12-18. After executing node 8, a large amount of instructions of other nodes will be fetched into the cache before node 12 is executed. So, prefetching would not help here (or one assumes that all considered functions are very small which is not realistic).
Can anyone shed some light on this?
Sign in to Reply
mirceal
11/21/2007 4:22 AM EST
Hello,
First of all this is a method aimed at giving a guideline as what kind of optimizations can be done (it is not a general solution to an entire type of problems all involving Instruction Cache Optimizations)
You are right when you say that the size of the functions matters but taking into account that this would make the method far more complex I have prefered to leave it out of the "equation".
Another thing that you must have in mind is that the function 12 is NOT brought into the memory after function 8. The first part of it is when the execution flow reaches the END of the function 8 (let's say at somewhere around the return point of the function 8). How many functions are executing between bringing function 12 into the memory and the actual exection?
The major problem that I see with this method is a function being executed more than once and being evacuated from the cache between calls.
Yet another thing that makes out equation more complicated is that usually caches are n-way associative. Add to that the length of the cache line and the fact that the prefetch mechanism may induce an overhead on the system bus (if prefech is even available) and you will see how tough the general problem really is.
Regards,
Mircea
Sign in to Reply