In this blog post, I will discuss the data model that ILSpy uses to store the program and how we are ensuring that the model accurately represents the original program. One of the main goals of the decompiler is to ensure that the decompiled program accurately represents the original assembly. You should be able to trust that the produced C# does indeed represent the program. In the last post, Daniel has briefly described how the type inference works. It is a lot of work to turn 0 to "false", but it is necessary to ensure that the output is correct rather than merely looking nice. In this post, I will describe how the stack-based assembly is translated into variable-based AST form. That is, the stack-based data flow is completely replaced by nested expressions and temporary variables. This is the very first step of the decompilation so it is important to get it right.
For a comparison, I have tried to decompile a program with non-trivial control flow in the .NET Reflector to see what code it produces. It created a reasonably nice output which was however completely wrong (see the end of this post). None of the difficult problems are addressed - it just takes a best guess and hopes for the best. This is not the approach that we would want to take.
Back to the ILSpy - we call the intermediate data model "ILAst" which stands for Intermediate Language Abstract Syntax Tree. Abstract Syntax Tree (AST) is the standard name given to the data structure used by virtually all compilers to represent high-level form of the source code. The compilers start with the AST and eventually convert it to some low-level assembly. The assembly code is register-based when compiled directly for the processor and it is usually stack based when it is destined for some virtual machine.
For example, let's consider the expression 'x = a * b + c'.
The compiler will parse the expression and represent it using an AST similar to this:
This is one valid representation of the data. If you are comping to a native assembly, the code will be eventually translated to register-based assembly like this:
mov r1, &a
mov r2, &b
mul r1, r1, r2
mov r2, &c
add r1, r1, r2
store r1, &x
If you are compiling for .NET you will get a stack-based assembly like this:
The job of the decompiler is to convert the assembly to readable C# code. To make this easier, the very first thing we do is to convert the assembly to AST. We call it ILAst, because it is not a rich AST like one a compiler would have - it does not have a node type for every language construct - it stores everything as "ILExpression" instead. The semantics of the ILExpression is defined by storing the IL opcode in it (for example, ldloc). This data model is simple to work with and maps very well to C# code.
So how can we convert from the stack-based assembly to ILAst? Fortunately, it is much simpler than if the assembly was register based. If I were to ask you to store the AST tree of 'x = a * b + c' (see diagram above) to disk in an compact way, how would you do it? We know how many arguments each node has (for example, "add" has two, "stloc" has one). My first approach would be to store the identifier of the node and then store its arguments after it. So I would store it as the following sequence (the brackets are there only for clarification):
"st X" ("add" ("mul" "ld a" "ld b") "ld c")
This is almost exactly what the stack-machine does! The only difference is the arguments are stored first and then followed by the identified. So the sequence is:
(("ld a" "ld b" "mul") "ld c" "add") "st X"
This is a great way to store the program because it very close to the original AST, the order of instructions is in the order in which it will be executed and it can be easily described as a stack-machine.
That was easy so far, so here is the catch - .NET allows the program to branch when the stack is not empty. This is generally useful, but makes the job of decompiler more difficult because it is no longer trivial to convert the program to ILAst (or any other high level form).
So here is an example of what a control flow of a program might look like:
Translating this to AST is not as easy as in the previous case. It can not be represented as a single expression and it can not be trivially separated into separate expressions because there is data dependency between the nodes. However, in this case the solution is easy. We store the result of the "ld a" into a temporary variable and then use the temporary variable for the stores. In fact, this approach can be easily generalized - whenever something is pushed on the stack, store it in temporary variable, and whenever something is loaded from the stack, load the appropriate temporary variable (it is not that difficult to find which one it is). So the 'x = a * b + c' would look like this:
tmp1 = ldloc a
tmp2 = ldloc b
tmp3 = mul (tmp1, tmp2)
tmp4 = ldloc c
tmp5 = add (tmp3, tmp4)
stloc x (tmp5)
A few inlineing optimizations and we have the original AST.
Unfortunately, there is one more problem - the control flow can also converge:
We have to select a temporary local variable based on which branch was taken. This problem is well known in the Single Static Assignment (SSA) form that compilers often use (See http://en.wikipedia.org/wiki/Static_single_assignment_form). The solution is to define a single temporary variable and assign to it at the end of each branch. The store will then use this temporary variable.
In general, the problem is quite complicated because the result of the branch might not be immediately consumed - the branch might be nested in another branch. Also, the converging and diverging control flows can occur at the same time making all the interactions quite complex.
So without further ado, let's discuss to the approach that ILSpy uses. It is not necessarily the most optimal solution, but it is a correct one. The reasoning is that these cases should happen extremely rarely and the decompiler should be able to completely optimize them away. So the goal is to just make sure the decompiler produces the correct output if some exotic compiler or obfuscator is used.
Here is an example of a quite a complicated control flow. The "ld a" and "ld b" converge and then the result is either used or converges again with "ld c". In the later case, the control flow diverges before the value is used.
ILSpy assigns a temporary variable for each argument (input) of an instruction. So in this case, there will be three temporary variables - one for each of the store instructions. When a load instruction is executed, its result will be stored to all the necessary temporary variables. For example, the value of "ld a" may end up in any of the stores so its result will be assigned to all three temporary variables right after it is executed. "ld c" can be assigned to only two of the temporary variables. Of course, this is the worst case scenario - in most programs, the result will be written to only a single temporary variable and so the variable will be most likely be trivially optimized away by inlineing.
So how do we find what "appropriate" variables to assign to? We find the stack state before and after each instruction. Let's say that the stack is empty just before the "ld a" instruction. Right after the instruction is executed, the stack will contain one value. We do not know what value, but we know that there will be one and that it had to be pushed on the stack by the "ld a" instruction. We note that. In a similar fashion, we analized the stack after "ld b". When we get to the branch which immediately follows these two instructions, we know that the stack might be in two different states depending on where the control flow came from. The stack state before the "br" contains one value and the value was pushed on the stack by either "ld a" or "ld b". We do not know which one, so we just note both of them. We do this kind of analysis for the whole program and we will find that the stack before the "st z" will contain one value which could have been pushed on the stack by any of the three loads. So at this point we create a temporary variable for the argument of "st z" and tell all three of the load instructions that they have to store their result into this temporary variable. The other store instructions will do the same so at the end of the day, each load instruction will know which variables it has to store its result in. That is all. A few inlining optimizations and we have nice AST tree which is guaranteed to be correct.
Just for the record, here is how ILSpy decompiles program with the control flow we have just discussed (see the diagram above, the only difference is that the program uses return instead of store):
public static int Test(int switch1, int br1, int br2)
var expr_14 = 1;
var arg_31_0 = expr_14;
var arg_3B_0 = expr_14;
var arg_42_0 = expr_14;
var expr_1A = 2;
arg_31_0 = expr_1A;
arg_3B_0 = expr_1A;
arg_42_0 = expr_1A;
var expr_20 = 3;
arg_3B_0 = expr_20;
arg_42_0 = expr_20;
if (br1 == 0)
return arg_31_0 + 10;
if (br2 == 0)
return arg_3B_0 + 20;
return arg_42_0 + 30;
It is not nice but it can not be - the control flow is complicated. It does take some temporary variables and gotos to represent it correctly.
For comparison, here is the output from .NET Reflector:
public static int Test(int switch1, int br1, int br2)
if (br1 == 0)
return (3 + 10);
if (br2 == 0)
return (2 + 20);
return (1 + 30);
The code looks much nicer, but it is completely wrong. The non-trivial control flow made the Reflector quite confused. It is easy to see that the code is wrong - the program has three possible inputs and three possible outputs with branches in between. So there should be 8 possible results (3 * 3 - 1 because there is no path from "ld c" to "st x" and all other paths are possible). However, the Reflector's code can return only 3 values: 13, 22, 31 (ignoring the default switch case). Furthermore, the value 13 is the one for the path from "ld c" to "st x" - the one that is not supposed to happen! :-)
Well done for reading the whole post. I hope this gives you some insight into how we are ensuring the correctness and that we are taking it very seriously. Next time I will discuss how we are ensuring correctness during code movement and during restoring of high level constructs like loops and conditions.