Design Article
When good compilers go bad, or What you see is not what you execute
Paul Anderson and Thomas W. Reps
2/3/2010 12:00 AM EST
The source-code representation of computer programs is often thought of as the supreme authoritative, precise, and unambiguous specification of what a software program does when it executes. Of course, plenty of errors can occur in source code, and static-analysis tools that find such errors can be effective in pinpointing where the problems are.
However, tools for analyzing source code have a key weakness: computers don't execute source code; they execute machine-code programs that may be generated from source code. The WYSINWYX phenomenon (What You See Is Not What You eXecute) refers to the mismatch between what the source-code description seems to indicate and what is actually executed by the processor.1 The consequence of WYSINWYX is that source-code-analysis tools are fundamentally blind to some kinds of code weaknesses--weaknesses that can only be detected by directly analyzing the machine code.
Compilers introduce WYSINWYX effects for several reasons. Sometimes they're created by machine-code optimization. Another reason is that the compiler author may have interpreted the source-language specification in an unexpected way. Some effects can even be maliciously introduced. Finally, compilers are themselves fairly complex programs and as such may have their own bugs.
In this article, we'll give describe some of these effects and their consequences with a few real-world examples. The cure for WYSINWYX is to use tools that analyze machine code directly, an approach we've taken in our research. We'll discuss some of the challenges we faced.
Incurable optimizations
A classic example of the WYSINWYX effect was found during a security review at Microsoft and reported by Michael Howard.2 When writing secure code, a good guiding principle is to limit the lifetime of sensitive data. This reduces the risk that the data can be retrieved by an attacker or leaked by accident, such as in a crash dump. A common technique is to overwrite the sensitive data with zeroes. In this example, the requirements were that the code should read a password into a buffer, use it to authorize some kind of secret operation, and then scrub it from memory. The code (slightly simplified) looked something like the code in Listing 1.
![]() Click on image to enlarge. |
The programmer's intent was sound, and few reviewers would say this code had a problem. The problem is not in the source code. When compiled with optimization, the call to memset is removed, so the function returns without scrubbing the password, and this sensitive information is left sitting on the stack. The compiler removes that function call because it uses a standard optimization technique called dead-store removal. It observes that the buffer goes out of scope at the end of the function, and the zeroes written by the call are never read, so it concludes that the entire call to memset is redundant and removes it.
Howard suggests three solutions to prevent the optimizer from removing the call:
• Touch the password after the call to memset so the data appears used,
• Replace the call to memset with something that will not get optimized away, or
• Turn off optimizations for that code.
For the first solution, he suggests adding the source code in Listing 2.
![]() Click on image to enlarge. |
Unfortunately, this fix only helps a little! Some compilers are sophisticated enough to recognize that this statement only touches a single byte of the buffer and therefore conclude that the call to memset can be transformed into code that zeros out only the first byte.3
Other functions could be called to zero out the buffer, but some suffer the same disadvantage as memset. As a consequence of finding this problem, Microsoft introduced the SecureZero Memory() function, whose implementation guarantees that the buffer will be cleared.
A second example where the optimizer unwittingly introduced a vulnerability into code that was otherwise reasonable came to light recently when it was used to demonstrate an exploitable vulnerability in the Linux kernel.4 In this case, the offending code is shown in Listing 3.
![]() Click on image to enlarge. |
In this case, the source code is not entirely innocent. It contains contradictory assumptions: if tun could in fact be NULL, then the first line will (under normal circumstances) cause a null-pointer dereference and the kernel will crash. If tun could never be null, the entire if statement is redundant because the condition will never be true. This is a kind of weakness that can be found by some advanced static-analysis tools for source code (including CodeSonar, the tool we work on).
The WYSINWYX effect arises because the compiler notices this redundancy and eliminates the if statement entirely. Although usually reasonable, removing the if statement in this case introduces the vulnerability because an attacker can memory map the NULL address to user-space and from there take control of the kernel.







rbv
2/4/2010 1:01 AM EST
The first two categories of problems really can't be called compilers gone bad, nor are they introduced only during the compile step (conversion to machine code). Many modern CPUs, especially but not exclusively the x86 instruction set architecture and 64-bit derivatives, do out-of-order execution which is the moral equivalent of C not specifying the order of operations in many expressions. And looking at the machine code won't help, you need to know the allowed transformations of the ISA's memory model which the CPU may perform at execution time (and may perform differently on subsequent passes as it collects branch prediction miss data).
Ultimately these can only be described as source code bugs due to making invalid assumptions about the memory model. Also, many compiler optimizations make memory model effects more apparent (e.g. inlining makes a much larger block available for memory read reordering) and might cause users to improperly blame the compiler.
I do understand that memory model issues are outside the scope of this article, but they do disprove the claim that "With machine code, however, all of these ambiguities disappear because the compiler has already committed to a choice."
Sign in to Reply
vapats
2/4/2010 7:39 AM EST
This is a big reason why I prefer FORTH-based language systems; the slight performance cost is offset by the absolutely guaranteed machine-code sequence.
Out-of-order execution effects are visible and predictable to the assembly-code author (as are pipeline latencies and cache behavior), and thus do not pose a real problem -- at least, to anyone who has actually read the CPU manual...
The assembler is the bottom layer for compilers anyway, but happily assemblers tend to be fairly bug-free -- and completely bypass C's mucky ambiguities.
So, what's next? Should we now worry about malicious microcode in the CPU?
- vic plichota
Sign in to Reply
Ches
2/4/2010 8:21 AM EST
I think "yes". For example see CPU's errata.
And even more, what about a FPGA compilers.
Sign in to Reply
DutchUncle
2/4/2010 9:20 AM EST
Back when C was first introduced, *everyone* I knew - most of us working in PL/1 or assembler on IBM mainframes - thought it was completely insane to create a language that didn't even know what size its intrinsic variables were. The sheer number of things left undefined or implementation-dependent told us this was a researcher's curiosity, despite the obviously good ideas of Unix that were demonstrated with it.
35 years later, C has somehow continues to be the lingua franca, and people keep finding the same problems. (I think Pascal had enough additional rules enforced to be better - PL/1 just wasn't quite enough.)
Some of the examples are due to lack of definition. Does "volatile" mean "this may change while you're not looking, so make sure to read it before using" ? Or does it mean "this variable is hardware / special / magical"? If the former, optimizing away the load/store is completely correct. There's no way to *specify* that you are storing to, say, a UART port, and OF COURSE it's stored but never read. At least it's philosophically clear to everyone nowadays that the attribute belongs on the VARIABLE, so that the compiler can make the right decision automatically, rather than putting the burden on the programmer remembering to turn optimization on and off around the operation!
Sign in to Reply
KarlS
2/4/2010 1:53 PM EST
My background is computer design/architecture with some programming going back to IBM mainframes and PL1. Soon I will have a demo of a new cpu that executes "C" statements and expressions directly without compiling to a typical machine language. The execution takes very few clock cycles so there is no need for optimization, pipelining, and branch prediction so those associated concerns go away. It is a cpu, not a C to H thing so the hardware can be verified once and reused as any other soft core. The hardware size is small enough so that multiple cpus can be used instead of an RTOS on a single cpu.
So now a look at C. The statements and expressions are about the same as any language, so the variable typing and memory management seem to be the real issues. Also some may think the lack of an intermediate language and a JIT compiler are negatives. The need for recursive calls and the amount of parameter passing on the call stack needs to be analyzed, but there can probably be multiple call protocols according to application needs.
The software that parses the source and generates the control memory contents is planned to be open source. The user will be able to see how the source is processed and the simulation step will show the flow of data in the run phase.
Altera's SOPC builder has the capability of building a multi-cpu system so making the cpu core a component in SOPC builder essentially completes the process.
A few bugs have to be fixed before the demo can run.
Thanks for your comments.
Sign in to Reply
vapats
2/5/2010 11:21 AM EST
Good luck, KarlS; I need WYSIWYG, and can get it easily without any extravagant effort.
HLLs have a wonderful power of abstraction, but embedded systems require a closer focus.
DutchUncle's comment says it perfectly: why should C (and it's derivatives) be the lingua franca? I suspect that it's a matter of convenience, or laziness... "Why should I care about the hardware details?"
You should care because if the system fails, people will be hurt -- and the wiki cannot be trusted -- you must do your own painful research -- or are you afraid of taking responsibility?
- vic plichota
Sign in to Reply
willc2010
2/8/2010 12:18 AM EST
The solutions and non-solutions to the password problem that are discussed in the article are hacks that work by exploiting the compiler's degree of knowledge or ignorance of certain particular facts, such as the effect of the 'SecureZeroMemory' function. There is no reason in principle why memset should be any different - the compiler just happens to have been specifically coded for it. This is one example of the many peculiar 'guru' facts that you have to be aware of in order to use a given C installation with even a minimal degree of safety.
The fundamental issue is that there needs to be a clear distinction in the language specification between the internal logic of the program and its points of contact with its environment. So, the solution to the password-blanking example should be to declare that the password array is volatile, i.e. that it is such a point of contact (since we don't want that memory snooped in some way). This requires, in turn, a precise definition of 'volatile' that specifies that all reads and writes of a volatile variable that appear in the program text must be performed literally, and no others. Furthermore, compilers should routinely be tested against an independent test suite that users can have confidence in. This would also address the problem discussed in the 'WATCHDOG = WATCHDOG' example.
The second example (the NULL pointer) is a consequence of the NULL value being used to indicate an error, and then relying on a NULL dereference to cause a segmentation fault (or similar), which is system-specific and not reliable anyway. The solution is to define the effect of a NULL dereference in the language itself and require it to be checked for at run time in cases where it cannot be proved impossible statically.
The 'newstr = (char *)malloc(strlen(s)) + 1' bug is just another example of the deep-rooted fragility of the whole C approach. It should not even be possible to compile such a thing. To allow it, and then rely on sophisticated (but not 100% reliable) tools to detect it, is perilous.
At the risk of being tedious, the password array and watchdog problems are addressed by the defined Ada concept of a volatile variable, the effect of a null pointer is defined in Ada, and the malloc problem would not arise. I use Ada as the most obvious example of a mature alternative to C for embedded programming, not because it is the only possibility. Many, maybe most, C programmers do not realise how deep the problems with the C language family go. Tools can help to some extent, but the basic problems are too fundamental to the language for it to really be fixed.
Sign in to Reply
rbv
2/8/2010 4:44 PM EST
@vapats: "to anyone who has read the CPU manual"... hello? We're talking about programmers who couldn't even cope with C's ambiguous order of operations, and you think they've read the CPU's memory model, let alone understood it? I think not.
@KarlS: "executes "C" statements and expressions directly without compiling to a typical machine language"... "The software that parses the source and generates the control memory contents". So what do you call that control memory contents if not "machine language" and what do you call that software if not "a compiler"? Oh, I see, it's not a "typical" machine language because it's a machine language never before seen, so there's no hope of tools such as those described in this article being able to analyze it. There may be advantages to using a brand-new instruction set, but you've completely failed to convince me that they outweigh the costs.
Sign in to Reply
stevev6
2/15/2010 11:31 AM EST
The use of reverse engineering here is commendable. I have long believed that we have to close the feedback loop and reverse engineer the code back into requirements (eg UML) in order to get automatic improvements.
Georgia Tech has done a lot of reverse engineering work in this area.
What I don't understand is why these guys are still walking the streets. Bush's "Patriot Act" and the DCMA should have put them out of the reverse engineering business long ago.
Sign in to Reply
KarlS
2/19/2010 10:43 AM EST
Part of my purpose is to eliminate the need for the kind of tools in the reference. If there is no "typical" compile process that generates code that is taken out in optimization, but rather a simulation/emulation that shows the execution process you truly see what you get. So far the responses range from "huh What?" to "it's just another C2H gadget". Technically, it is a compiler although that brings the assumption of Generate assembly, optimize, and so on. What really happens is that the ram contents are microprogram controls and references to ram locations that contain data/variables and control bits for the next cycles operation. The data flow is designed to allow a straight forward conversion from C code to "machine language". To me. instead of "an instruction set never before seen", "the instruction set is if, else, for, while, do/while, with a function call sequence". In one other comment I mentioned that a compiler can do unexpected things and immediately got a reply that the compiler only does exactly what it is told. Now we have this article describing research and tools to find such things. I still have bugs in the function call, but should soon have a sample output that shows how few cycles it takes to execute C statements. The objective is to minimize the functions that currently are done in HDL because of time constraints, therefore those applications that have an acceptable amount of HDL do not need my design.
So about the costs: I see this as another custom component in something like Altera's SOPC builder, so it is just another block that can be selected and more or less connected automatically just as their NiosII processor. You no doubt have other concerns, but until I at least get the demo running to show the potential performance cost trade offs cannot be done.
Sign in to Reply
Electro8
1/3/2011 10:58 PM EST
Normally I don't get into these conversations, but this was too good to pass by ---- come on guys, if the programmer doesn't understand the behaviour of the processor at the machine level, C should be a consideration anyway, but Pascal or Basic.... ;-) and Dutch Uncle, we _are_ contemporaries -- Fortran IV // APL360 // PL1 // pForth (Vic) etc... and C for as long as it's been out there . One of the major advantages of C (aside from pointers) is that you DO have easy access (inline and otherwise) to the hardware (better than doing pure assembly...). The horsepower of the pointers and the freedoms allowed is probably what made it the 'lingua franca'; but to underscore your points, those freedoms *demand* a dicipline that quite frankly doesn't seem to be taught today in schools that only understand GB ram, TB storage, and OOP that doesn't have any classical programming foundation (Plauger / K&R / etc) to support it. My take is, first see if they can do something constructive with a 4004 or even a TMS9900 THEN let them play C on a DSP, _then_ they can play with the big toys . . .
BR,
SteveW
Sign in to Reply
Electro8
1/3/2011 11:00 PM EST
Apologies, that should have read "C should NOT be a consideration anyway....
Sign in to Reply