Design Article
Tuning C/C++ compilers for optimal parallel performance in multicore apps: Part 1
Max Domeika
2/18/2010 9:36 PM EST
One of the better and easier methods of doing this is to apply aggressive compiler optimization. A compiler that targets your processor and features advanced optimizations such as automatic vectorization, interprocedural optimization, and profile-guided optimization can substantially improve performance of your application.
And applying usability features of your compiler such as those aiding compatibility, compile time, and code size, can substantially improve development efficiency.
Why scalar optimization is so important
An absolute prerequisite for parallel optimization is highly tuned scalar performance. Why is this claim true? To understand why this is so, consider a hypothetical performance optimization project with the following requirement: Parallel optimization must provide a 30% increase in performance over the scalar version of the application.
A development team, Team M, is created to develop a prototype of the application that employs parallel optimization and multi-core processors to meet the performance requirement. Another team, Team S, is created to see how much scalar optimization techniques can improve performance. Each team prototypes their improvement and a performance comparison is obtained.
Figure 5.1 below is a graphical representation of the performance obtained by the different teams. As you can see, Team M increased performance by 43% over the original code. Team S increased performance by 11% over the original code. The question " did Team M meet its goal?
![]() |
| Figure 5.1: Hypothetical scalar versus parallel performance improvement |
The last column in the graph shows the performance improvement comparing Team M's results against Team S, which can be considered a new scalar version of the application.
Strictly speaking, Team M did not meet the goal. The performance difference between Team M and the new scalar version (Team S) is only 29%. Now it could be argued that the goal should have been clear in specifying that the original version of the code should be used for the performance comparison, but the reality is we are discussing end results.
If the scalar optimization work could be accomplished with minimal effort and the parallel optimization effort required a large amount of resources, the parallel optimization effort may be terminated. If Team M had known about the performance headroom offered by scalar optimization, Team M could have perhaps applied parallel techniques on more of the application to meet the performance goal.
The importance of compilation
C and C++ compilers are widely used for embedded software development [1]. Compiler optimization can play a large role in increasing application performance and employing the latest compiler technology specifically targeting your processor can provide even greater benefit.
Figure 5.2 below is a comparison of two different compilers and several optimization settings executing SPEC CINT2000 [1] on an Intel Pentium M processor system. The two compilers employed are labeled "comp1" and "comp2 ", respectively [2].
![]() |
| Figure 5.2: Compiler and target architecture comparison |
Four different optimization settings were employed when comp1 compiled the benchmark. Two different optimization settings were used when comp2 compiled the benchmark. The compiler, comp1, targeting an Intel 486 processor serves as the baseline and is represented by the far left bar in the graph.
Using comp1 with the -O3 option and targeting an Intel486 processor produces a 4% performance improvement. The -O3 option provides greater optimization over the default optimization setting. Using comp1 with the -O3 option and its default compiler processor target results in a 12% improvement. Using comp1 with the processor target of an Intel Pentium 4 processor results in a 17% performance improvement.
Employing the second compiler, comp2, and targeting the Pentium M processor leads to a 38% performance improvement. Finally, using comp2 targeting the Pentium M processor and advanced optimization leads to a 62% improvement.
The baseline compiler and option setting (comp1 targeting an Intel486 processor) represents legacy applications that employ dated compilation tools and target older processors while relying solely upon new processors to provide application performance improvement.
The comp2 targeting the Pentium M processor and using advanced optimization represents a compiler with the latest optimization technology and optimization that specifically targets the processor that is used to execute the application in the field. Terms such as advanced optimization and processor target are explained more fully in latter sections of this two part series. But two points should be clear from this data:
1) Legacy applications that have not been recompiled for new processor targets may be sacrificing performance; and
2) Employing a high-performance compiler specifically tuned for your architecture can lead to big performance improvements.
For developers that use C and C++ , knowing the performance features available in their respective compiler is essential. This chapter describes performance features available in many C and C ++ compilers, the benefits of these features, and how to employ them.
In this Part 1 of two articles, I will detail some of the C and C++ compiler performance features such as general optimizations, advanced optimizations, and user-directed optimization. Next, in Part 2, I will go into detail on the process to use when optimizing your application as well as discuss usability features that can aid compatibility, and methods for reducing compile time and code size. With these techniques, embedded developers can extract higher performance from their applications with less effort.





