SharpDevelop Community

Get your problems solved!
Welcome to SharpDevelop Community Sign in | Join | Help
in Search

Daniel Grunwald

ILSpy - Decompiler Architecture Overview - Part 2

The decompiler pipeline can be separated into two phases: the first phase works on a tree representation of the IL code - I described the steps of that phase in the previous post.

The second phase works on C# code, and is the topic of this blog post.

To give you a reminder: the ILAst is a tree of IL instructions, with pseudo-opcodes inserted for the higher-level language constructs that were detected. Let's take a look how the example C# code from the previous blog post looks in the final ILAst (after type inference).

Original C#:

static IEnumerable<IEnumerable<char>> Test(List<string> list)
{
    foreach (string current in list) {
        yield return (from c in current where char.IsUpper(c) || char.IsDigit(c) select char.ToLower(c));
    }
    yield return new List<char> { 'E', 'N', 'D' };
}

Final ILAst:

As you can see, ILAst already detected the following language constructs:

  • yield return
  • "while (enumerator.MoveNext())" loop
  • collection initializer

Still missing are the "foreach" construct, and the contents of the lambdas that were created by the query expression.

As with the previous blog post, you might want to run a debug build of ILSpy and load the above example into it, so that you can see the full output of the intermediate steps. Only debug builds make the intermediate steps of the decompiler pipeline available in the language dropdown.

C# Code Generation

We generate C# code from the ILAst. The C# code is represented using the C# Abstract Source Tree from the NRefactory library. This step uses the type information to insert casts where required, so that the semantics of the generated C# code match the semantics of the original IL code.

However, some semantics, like the overflow checking context (checked or unchecked) or exact type referenced by a short type name, are not yet translated into C#, but stored as annotations on the C# AST. But apart from those details, the resulting code is valid C#.

This step is probably the biggest source of bugs, as matching the IL and C# semantics to each other isn't easy. For example, IL explicitly specifies which method is being called; but C# uses overload resolution. Inserting casts so that the resulting C# will call the correct method is not simple - we don't want to insert casts everywhere, as that would make the code hard to read.

Implementation: AstMethodBodyBuilder.cs

C# Code Transforms

All the remaining steps are transformations on the C# AST. NRefactory provides an API similar to System.Xml.Linq for working with the C# source tree, which makes modifications relatively easy. Additionally, the visitor pattern can be used to traverse the AST; and NRefactory implements a powerful pattern-matching construct.

The decompiler transformations are implemented as classes in the ICSharpCode.Decompiler.Ast.Transforms namespace.

PushNegation

This transformation eliminates negations where possible. Some ILAst operations introduce additional negations - for example "brtrue(lbl, condition); ...; lbl:" becomes "if (!condition) { ... }". The transform will remove double negations, and will move remaining negations into other operators. "!(a == b)" becomes "a != b".

DelegateConstruction

So far, delegates were compiled into code such as "new D(obj, ldftn(M))". That is, a delegate takes two arguments: the target object, and the method being called. The target object is null if the method is static. This isn't valid C#, so we transform it into "new D(obj.M)". However, if the method is anonymous, then we decompile the target method (up to the DelegateConstruction step) and put the decompiled method body into an anonymous method. Or, if the method body consists of only a single return statement, we use expression lambda syntax.

Applied to our example code, we get "(char c) => char.IsUpper(c) || char.IsDigit(c)" and "(char c) => char.ToLower(c)". Now, in this example, the lambdas did not capture any variables, and thus got compiled to static methods. The transform gets more complicated if the lambda does capture variables: In this case, the C# compiler would have created an inner class (called "DisplayClass") to represent the closure. The compiler puts all captured variables as fields into that closure.

To decompile that correctly, the first part of the DelegateConstruction transform will replace any occurrences of "this" within the lambda body with the target object that was passed to the delegate. This makes the resulting code somewhat correct - but the closure is still visible in the decompiled code. For example, a method implementing curried addition of 3 integers would look like this:

public static Func<int, Func<int, int>> CurriedAddition(int a)
{
    DelegateConstruction.c__DisplayClass13 c__DisplayClass;
    c__DisplayClass = new DelegateConstruction.c__DisplayClass13();
    c__DisplayClass.a = a;
    return delegate(int b)
    {
        DelegateConstruction.c__DisplayClass13.c__DisplayClass15 c__DisplayClass2;
        c__DisplayClass2 = new DelegateConstruction.c__DisplayClass13.c__DisplayClass15();
        c__DisplayClass2.CS$8__locals14 = c__DisplayClass;
        c__DisplayClass2.b = b;
        return (int c) => c__DisplayClass2.CS$8__locals14.+ c__DisplayClass2.+ c;
    };
}

In a second step, the DelegateConstruction transformation looks for such display classes, and removes them by promoting all their fields to local variables. If one of the fields is simply a copy of the function's parameter, no new local variable is introduced, but the parameter is used instead.

So after this cleanup step is complete, the curried addition example will look exactly like the original C# code:

public static Func<int, Func<int, int>> CurriedAddition(int a)
{
    return (int b) => (int c) => a + b + c;
}

PatternStatementTransform

This step does pattern-matching. It defines code patterns for the following language constructs:

  • using
  • foreach (both on generic and on non-generic collections)
  • for
  • do..while
  • lock
  • switch with string
  • Automatic Properties
  • Automatic Events (normal events without explicit add/remove accessor)
  • Destructors
  • try {} catch {} finally

The expanded code pattern is searched for using NRefactory's pattern matching. When it is found, some additional checks are performed to see whether the transformation is valid (e.g. a 'foreach' loop variable must not be assigned to). If it is valid, the matched code pattern is replaced with the detected construct.

Since the ILAst has only one loop construct, all generated C# code initially uses only while-loops. But if a loop looks like this: "while (true) { ... if (condition) break; }", then we can change it into "do { } while (!condition);". Using NRefactory's pattern matching, the pattern we look for can be defined easily:

static readonly WhileStatement doWhilePattern = new WhileStatement {
    Condition = new PrimitiveExpression(true),
    EmbeddedStatement = new BlockStatement {
        Statements = {
            new Repeat(new AnyNode("statement")),
            new IfElseStatement {
                Condition = new AnyNode("condition"),
                TrueStatement = new BlockStatement { new BreakStatement() }
            }
        }}};

Pattern matching reuses some ideas from regular expressions: the "Repeat" node will match any number of nodes (like the star operator in regular expressions ), and the strings passed to the "AnyNode" constructor create capture groups with that names. For a successful match, the "statement" group will contain all statements except for the final "if (condition) break;", and the "condition" group will contain the loop condition.

ReplaceMethodCallsWithOperators

This step eliminates invocations of user-defined operators ("string.op_Equality(a,b)") and replaces them with the operator itself ("a == b").

It also simplifies statements of the form "localVar = localVar + 1;" to use the post-increment operator. Note that this transformation is only valid for statements -- within expressions, we would be required to use pre-increment.

IntroduceUnsafeModifier

This transformation looks through the method body and looks for any operations that are valid only in an unsafe context. If any are found, the method is marked with the unsafe modifier.

The step also contains some code readability improvements for unsafe code: "*(ptr + num)" gets transformed to "ptr[num]", and "(*ptr).Member" gets transformed to "ptr->Member".

AddCheckedBlocks

In IL, there are different opcodes for instructions with and without overflow checking (e.g. "add" vs. "add.ovf"). However, in C# the overflow checking cannot be specified on single operators, but only on whole expressions. For example, "a = checked(b + c)" will also evaluate the sub-expressions b and c with overflow checking. If those contain any IL opcodes that didn't use overflow checking, then the C# code must use unchecked expressions within the checked expression.

Code can quickly get unreadable if you do this around every instruction, so we looked for a way to place the blocks intelligently. We formulated the problem as an optimization problem, with the following goal:

  1. Use the minimum number of checked blocks and expressions
  2. Prefer checked expressions over checked blocks
  3. Make the scope of checked expressions as small as possible
  4. Make the scope of checked blocks as large as possible

We use dynamic programming to calculate the optimal solution in linear time. Essentially, the algorithm calculates two solutions for each node: both have optimal cost, but one expects the parent context to be checked, the other expects it to be unchecked. This allows composing the whole solution (global optimum) from the partial solutions (optimal solutions for each node in the two contexts).

DeclareVariables

So far, all variables were declared at the start of the method. This step aims to make the code more readable by moving the variable declarations so that they have the smallest possible scope.

This step will introduce multiple declarations for the same variable whenever this is allowable. This might happen if two loops use the same variable, but the value assigned to the variable by the first loop will never be read by the second loop.

Basically, we split up a variable whenever this is possible without triggering the C# compiler error "Use of unassigned local variable" - if the second code block ensures it always initializes the variable before reading it, it can impossible read the value assigned by the first code block. For this purpose, I implemented C# definite assignment analysis, which is surprisingly complex - the specification is 10 pages long, and makes heavy use of the reachability rules, which take another 10 pages in the C# specification.

ConvertConstructorCallIntoInitializer

This step is all about constructors. First, we look at all constructors in the current class. If they all start with the same instruction, and that instruction is assigning to an instance field in the current class, then we convert that statement into a field initializer.

After that, all constructors should start with a call to the constructor of the base class. We take that call, and change it into an initializer (" : base()" syntax).

IntroduceUsingDeclarations

When initially creating C# code from the ILAst, ILSpy always uses short type names (without namespace name). However, it annotates the type references, so that the referenced type is still known.

This step looks at the annotations and introduces the appropriate using declarations. Then, the step looks at all referenced assemblies, and looks which types were imported by the using declarations. If several types with the same name were imported, that name is marked as ambiguous.

Now, the transformation again looks at all type references, and fully qualifies those that are ambiguous.

Note that this transformation step is disabled when you use ILSpy to look at a single method. It is used only when decompiling a whole class, or when decompiling a whole assembly.

IntroduceExtensionMethods

This step will replace calls to extension methods with the infix syntax. "Enumerable.Select(a, b)" becomes "a.Select(b)".

Now, let me show you the decompiled running example after this step:

private static IEnumerable<IEnumerable<char>> Test(List<string> list)
{
    foreach (string current in list)
    {
        yield return current.Where((char c) => char.IsUpper(c) || char.IsDigit(c)).Select((char c) => char.ToLower(c));
    }
    yield return new List<char> { 'E',  'N',  'D' };
}

IntroduceQueryExpressions

This step takes a look at method calls, and tries to find patterns that look like the output of C# query expressions. Basically, we apply the same steps as the C# compiler when it translates query expressions into method calls, but in reverse.

This results in the following decompiled code:

        yield return 
            from c in current
            where char.IsUpper(c) || char.IsDigit(c)
            select char.ToLower(c);

The IntroduceQueryExpressions step does a mostly literal translation of method calls to query clauses. However, the C# language defined some query expressions to be translated in terms of other query expressions. Examples are "let" clauses and query continuations with "into". Especially let-Clauses are tricky; since they cause the C# compiler to generate so-called transparent identifiers (see C# specification for details). Such a query might look like this:

from <>h__TransparentIdentifier2b in
    from o in orders
    select new
    {
        o = o, 
        t = o.Details.Sum((QueryExpressions.OrderDetail d) => d.UnitPrice * d.Quantity)
    }
where <>h__TransparentIdentifier2b.>1000m
select new
{
    OrderID = <>h__TransparentIdentifier2b.o.OrderID, 
    Total = <>h__TransparentIdentifier2b.t
};

 

CombineQueryExpressions

This step combines LINQ queries to simplify them (e.g. introduces query continuations); and gets rid of transparent identifiers by re-introducing the original 'let' clause. The above query combined results in the easy-to-understand query:

from o in orders
let t = o.Details.Sum((QueryExpressions.OrderDetail d) => d.UnitPrice * d.Quantity)
where t >1000m
select new
{
    OrderID = o.OrderID, 
    Total = t
};

This concludes the transformations done by the decompiler.

There's only one tiny detail left: we run NRefactory's InsertParenthesesVisitor, which introduces both required parentheses, and some additional parentheses to make the code more readable. The parenthesis-inserting step will run even if you use the language drop down to stop the decompilation at a previous step.

The very last step, of course, is the OutputVisitor, which generates text from the C# AST.

Published Apr 26 2011, 10:05 PM by DanielGrunwald
Filed under:

Comments

 

Michiel said:

I can't say I understand much of it, but I can say I appreciate all of it!

May 2, 2011 10:27 PM
 

DanielZiegler said:

Happy Birthday! :D

May 20, 2011 7:54 PM
 

Rupert said:

Awesome post Daniel. This level of detail is really useful for people like me who are looking to use the SharpDevelop codebase for their own research.

ILSpy deserves a place on every .NET developers desktop.

August 5, 2011 11:37 AM
Powered by Community Server (Commercial Edition), by Telligent Systems
Don't contact us via this (fleischfalle@alphasierrapapa.com) email address.