SharpDevelop Community

Get your problems solved!
Welcome to SharpDevelop Community Sign in | Join | Help
in Search

Daniel Grunwald

April 2011 - Posts

  • ILSpy - Decompiler Architecture Overview - Part 2

    The decompiler pipeline can be separated into two phases: the first phase works on a tree representation of the IL code - I described the steps of that phase in the previous post.

    The second phase works on C# code, and is the topic of this blog post.

    To give you a reminder: the ILAst is a tree of IL instructions, with pseudo-opcodes inserted for the higher-level language constructs that were detected. Let's take a look how the example C# code from the previous blog post looks in the final ILAst (after type inference).

    Original C#:

    static IEnumerable<IEnumerable<char>> Test(List<string> list)
    {
        foreach (string current in list) {
            yield return (from c in current where char.IsUpper(c) || char.IsDigit(c) select char.ToLower(c));
        }
        yield return new List<char> { 'E', 'N', 'D' };
    }

    Final ILAst:

    As you can see, ILAst already detected the following language constructs:

    • yield return
    • "while (enumerator.MoveNext())" loop
    • collection initializer

    Still missing are the "foreach" construct, and the contents of the lambdas that were created by the query expression.

    As with the previous blog post, you might want to run a debug build of ILSpy and load the above example into it, so that you can see the full output of the intermediate steps. Only debug builds make the intermediate steps of the decompiler pipeline available in the language dropdown.

    C# Code Generation

    We generate C# code from the ILAst. The C# code is represented using the C# Abstract Source Tree from the NRefactory library. This step uses the type information to insert casts where required, so that the semantics of the generated C# code match the semantics of the original IL code.

    However, some semantics, like the overflow checking context (checked or unchecked) or exact type referenced by a short type name, are not yet translated into C#, but stored as annotations on the C# AST. But apart from those details, the resulting code is valid C#.

    This step is probably the biggest source of bugs, as matching the IL and C# semantics to each other isn't easy. For example, IL explicitly specifies which method is being called; but C# uses overload resolution. Inserting casts so that the resulting C# will call the correct method is not simple - we don't want to insert casts everywhere, as that would make the code hard to read.

    Implementation: AstMethodBodyBuilder.cs

    C# Code Transforms

    All the remaining steps are transformations on the C# AST. NRefactory provides an API similar to System.Xml.Linq for working with the C# source tree, which makes modifications relatively easy. Additionally, the visitor pattern can be used to traverse the AST; and NRefactory implements a powerful pattern-matching construct.

    The decompiler transformations are implemented as classes in the ICSharpCode.Decompiler.Ast.Transforms namespace.

    PushNegation

    This transformation eliminates negations where possible. Some ILAst operations introduce additional negations - for example "brtrue(lbl, condition); ...; lbl:" becomes "if (!condition) { ... }". The transform will remove double negations, and will move remaining negations into other operators. "!(a == b)" becomes "a != b".

    DelegateConstruction

    So far, delegates were compiled into code such as "new D(obj, ldftn(M))". That is, a delegate takes two arguments: the target object, and the method being called. The target object is null if the method is static. This isn't valid C#, so we transform it into "new D(obj.M)". However, if the method is anonymous, then we decompile the target method (up to the DelegateConstruction step) and put the decompiled method body into an anonymous method. Or, if the method body consists of only a single return statement, we use expression lambda syntax.

    Applied to our example code, we get "(char c) => char.IsUpper(c) || char.IsDigit(c)" and "(char c) => char.ToLower(c)". Now, in this example, the lambdas did not capture any variables, and thus got compiled to static methods. The transform gets more complicated if the lambda does capture variables: In this case, the C# compiler would have created an inner class (called "DisplayClass") to represent the closure. The compiler puts all captured variables as fields into that closure.

    To decompile that correctly, the first part of the DelegateConstruction transform will replace any occurrences of "this" within the lambda body with the target object that was passed to the delegate. This makes the resulting code somewhat correct - but the closure is still visible in the decompiled code. For example, a method implementing curried addition of 3 integers would look like this:

    public static Func<int, Func<int, int>> CurriedAddition(int a)
    {
        DelegateConstruction.c__DisplayClass13 c__DisplayClass;
        c__DisplayClass = new DelegateConstruction.c__DisplayClass13();
        c__DisplayClass.a = a;
        return delegate(int b)
        {
            DelegateConstruction.c__DisplayClass13.c__DisplayClass15 c__DisplayClass2;
            c__DisplayClass2 = new DelegateConstruction.c__DisplayClass13.c__DisplayClass15();
            c__DisplayClass2.CS$8__locals14 = c__DisplayClass;
            c__DisplayClass2.b = b;
            return (int c) => c__DisplayClass2.CS$8__locals14.+ c__DisplayClass2.+ c;
        };
    }

    In a second step, the DelegateConstruction transformation looks for such display classes, and removes them by promoting all their fields to local variables. If one of the fields is simply a copy of the function's parameter, no new local variable is introduced, but the parameter is used instead.

    So after this cleanup step is complete, the curried addition example will look exactly like the original C# code:

    public static Func<int, Func<int, int>> CurriedAddition(int a)
    {
        return (int b) => (int c) => a + b + c;
    }

    PatternStatementTransform

    This step does pattern-matching. It defines code patterns for the following language constructs:

    • using
    • foreach (both on generic and on non-generic collections)
    • for
    • do..while
    • lock
    • switch with string
    • Automatic Properties
    • Automatic Events (normal events without explicit add/remove accessor)
    • Destructors
    • try {} catch {} finally

    The expanded code pattern is searched for using NRefactory's pattern matching. When it is found, some additional checks are performed to see whether the transformation is valid (e.g. a 'foreach' loop variable must not be assigned to). If it is valid, the matched code pattern is replaced with the detected construct.

    Since the ILAst has only one loop construct, all generated C# code initially uses only while-loops. But if a loop looks like this: "while (true) { ... if (condition) break; }", then we can change it into "do { } while (!condition);". Using NRefactory's pattern matching, the pattern we look for can be defined easily:

    static readonly WhileStatement doWhilePattern = new WhileStatement {
        Condition = new PrimitiveExpression(true),
        EmbeddedStatement = new BlockStatement {
            Statements = {
                new Repeat(new AnyNode("statement")),
                new IfElseStatement {
                    Condition = new AnyNode("condition"),
                    TrueStatement = new BlockStatement { new BreakStatement() }
                }
            }}};

    Pattern matching reuses some ideas from regular expressions: the "Repeat" node will match any number of nodes (like the star operator in regular expressions ), and the strings passed to the "AnyNode" constructor create capture groups with that names. For a successful match, the "statement" group will contain all statements except for the final "if (condition) break;", and the "condition" group will contain the loop condition.

    ReplaceMethodCallsWithOperators

    This step eliminates invocations of user-defined operators ("string.op_Equality(a,b)") and replaces them with the operator itself ("a == b").

    It also simplifies statements of the form "localVar = localVar + 1;" to use the post-increment operator. Note that this transformation is only valid for statements -- within expressions, we would be required to use pre-increment.

    IntroduceUnsafeModifier

    This transformation looks through the method body and looks for any operations that are valid only in an unsafe context. If any are found, the method is marked with the unsafe modifier.

    The step also contains some code readability improvements for unsafe code: "*(ptr + num)" gets transformed to "ptr[num]", and "(*ptr).Member" gets transformed to "ptr->Member".

    AddCheckedBlocks

    In IL, there are different opcodes for instructions with and without overflow checking (e.g. "add" vs. "add.ovf"). However, in C# the overflow checking cannot be specified on single operators, but only on whole expressions. For example, "a = checked(b + c)" will also evaluate the sub-expressions b and c with overflow checking. If those contain any IL opcodes that didn't use overflow checking, then the C# code must use unchecked expressions within the checked expression.

    Code can quickly get unreadable if you do this around every instruction, so we looked for a way to place the blocks intelligently. We formulated the problem as an optimization problem, with the following goal:

    1. Use the minimum number of checked blocks and expressions
    2. Prefer checked expressions over checked blocks
    3. Make the scope of checked expressions as small as possible
    4. Make the scope of checked blocks as large as possible

    We use dynamic programming to calculate the optimal solution in linear time. Essentially, the algorithm calculates two solutions for each node: both have optimal cost, but one expects the parent context to be checked, the other expects it to be unchecked. This allows composing the whole solution (global optimum) from the partial solutions (optimal solutions for each node in the two contexts).

    DeclareVariables

    So far, all variables were declared at the start of the method. This step aims to make the code more readable by moving the variable declarations so that they have the smallest possible scope.

    This step will introduce multiple declarations for the same variable whenever this is allowable. This might happen if two loops use the same variable, but the value assigned to the variable by the first loop will never be read by the second loop.

    Basically, we split up a variable whenever this is possible without triggering the C# compiler error "Use of unassigned local variable" - if the second code block ensures it always initializes the variable before reading it, it can impossible read the value assigned by the first code block. For this purpose, I implemented C# definite assignment analysis, which is surprisingly complex - the specification is 10 pages long, and makes heavy use of the reachability rules, which take another 10 pages in the C# specification.

    ConvertConstructorCallIntoInitializer

    This step is all about constructors. First, we look at all constructors in the current class. If they all start with the same instruction, and that instruction is assigning to an instance field in the current class, then we convert that statement into a field initializer.

    After that, all constructors should start with a call to the constructor of the base class. We take that call, and change it into an initializer (" : base()" syntax).

    IntroduceUsingDeclarations

    When initially creating C# code from the ILAst, ILSpy always uses short type names (without namespace name). However, it annotates the type references, so that the referenced type is still known.

    This step looks at the annotations and introduces the appropriate using declarations. Then, the step looks at all referenced assemblies, and looks which types were imported by the using declarations. If several types with the same name were imported, that name is marked as ambiguous.

    Now, the transformation again looks at all type references, and fully qualifies those that are ambiguous.

    Note that this transformation step is disabled when you use ILSpy to look at a single method. It is used only when decompiling a whole class, or when decompiling a whole assembly.

    IntroduceExtensionMethods

    This step will replace calls to extension methods with the infix syntax. "Enumerable.Select(a, b)" becomes "a.Select(b)".

    Now, let me show you the decompiled running example after this step:

    private static IEnumerable<IEnumerable<char>> Test(List<string> list)
    {
        foreach (string current in list)
        {
            yield return current.Where((char c) => char.IsUpper(c) || char.IsDigit(c)).Select((char c) => char.ToLower(c));
        }
        yield return new List<char> { 'E',  'N',  'D' };
    }

    IntroduceQueryExpressions

    This step takes a look at method calls, and tries to find patterns that look like the output of C# query expressions. Basically, we apply the same steps as the C# compiler when it translates query expressions into method calls, but in reverse.

    This results in the following decompiled code:

            yield return 
                from c in current
                where char.IsUpper(c) || char.IsDigit(c)
                select char.ToLower(c);

    The IntroduceQueryExpressions step does a mostly literal translation of method calls to query clauses. However, the C# language defined some query expressions to be translated in terms of other query expressions. Examples are "let" clauses and query continuations with "into". Especially let-Clauses are tricky; since they cause the C# compiler to generate so-called transparent identifiers (see C# specification for details). Such a query might look like this:

    from <>h__TransparentIdentifier2b in
        from o in orders
        select new
        {
            o = o, 
            t = o.Details.Sum((QueryExpressions.OrderDetail d) => d.UnitPrice * d.Quantity)
        }
    where <>h__TransparentIdentifier2b.>1000m
    select new
    {
        OrderID = <>h__TransparentIdentifier2b.o.OrderID, 
        Total = <>h__TransparentIdentifier2b.t
    };

     

    CombineQueryExpressions

    This step combines LINQ queries to simplify them (e.g. introduces query continuations); and gets rid of transparent identifiers by re-introducing the original 'let' clause. The above query combined results in the easy-to-understand query:

    from o in orders
    let t = o.Details.Sum((QueryExpressions.OrderDetail d) => d.UnitPrice * d.Quantity)
    where t >1000m
    select new
    {
        OrderID = o.OrderID, 
        Total = t
    };

    This concludes the transformations done by the decompiler.

    There's only one tiny detail left: we run NRefactory's InsertParenthesesVisitor, which introduces both required parentheses, and some additional parentheses to make the code more readable. The parenthesis-inserting step will run even if you use the language drop down to stop the decompilation at a previous step.

    The very last step, of course, is the OutputVisitor, which generates text from the C# AST.

  • ILSpy - Decompiler Architecture Overview

    When ILSpy was only two weeks old, I blogged about the decompiler architecture. The basic idea of the decompiler pipeline (IL -> ILAst -> C#) is still valid, but there were several changes in the details, and tons of additions as ILSpy learned about more features in the C# language.

    The pipeline has grown a lot - there are now 47 separate steps, while in the middle of February (when the previous architecture post was written), there were only 14.

    If you want to follow this post, grab the source code of ILSpy and create a debug build, so that you can take a look at the intermediate steps while I am discussing them. Only debug builds will show all the intermediate steps in the language dropdown.

    It's impossible to give a short sample where every intermediate step does something (the sample would have to use every possible C# feature), but the following sample should show what is going on in the most important steps:

    static IEnumerable<IEnumerable<char>> Test(List<string> list)
    {
        foreach (string current in list) {
            yield return (from c in current where char.IsUpper(c) || char.IsDigit(c) select char.ToLower(c));
        }
        yield return new List<char> { 'E', 'N', 'D' };
    }

    Take this code, compile it, and then decompile it with a debug build of ILSpy, so that you can take a look at the results of the intermediate steps.

    Essentially, the decompiler pipeline can be separated into two phases: the first phase works on a tree representation of the IL code - we call this representation the ILAst. The second phase works on C# code, stored in the C# Abstract Syntax Tree provided by the NRefactory library.

    ILSpy uses the Mono.Cecil library for reading assembly files. Cecil parses the IL code into a flat list of IL instructions, and also takes care of reading all the metadata. Thus, the decompiler's input is Cecil's object model, giving it approximately the same information as you see when you select 'IL' language in the dropdown.

    ILAst

    We construct the intermediate representation ILAst. Basically, every IL instruction becomes one ILAst instruction. The main difference is that ILAst does not use an implicit evaluation stack, but creates temporary variables for every write to a stack location. However, the ILAst also supports additional opcodes (called pseudo-opcodes) which are used by various decompiler steps to represent higher-level constructs.

    Another difference is that we create a tree structure for try-finally blocks - Cecil just provides us with the exception handler table from the metadata.

    Implementation: ILAstBuilder.cs

    Variable Splitting

    Using data flow analysis, we split up variables where possible. So if you had "x = 1; x = add(x, 1);", that will become "x_1 = 1; x_2 = add(x_1, 1)". We do not use SSA form for this (although there's an unused SSA implementation left over in the codebase), we only split variables up when this is possible without having to introduce phi-functions. The goal of this operation is to make compiler-generated variables eligible for inlining.

    Implementation: ILAstBuilder.cs

    ILAst Optimizations

    • Dead code removal. We remove unreachable code, because it's impossible to infer any information about the stack usage of unreachable code. Also, obfuscators tend to put invalid IL into unreachable code sections. This actually already happens as part of the ILAst construction, before variable splitting.
    • Remove redundant code
      • Delete 'nop' instructions
      • Delete 'br' instructions that jump directly to the next instruction
      • Delete 'dup' instructions - since ILAst works with variables for stack locations, we can just read a variable twice, eliminating the 'dup'.
    • Simplify instruction set for branch instructions
      • Replaces all conditional branches with 'brtrue'. This works by replacing the 'b*' instructions (branch instructions) with 'brtrue(c*)' (branch if compare instruction returns true). This step makes use the 'LogicNot' pseudo-opcode.
        The goal simply is to reduce the number of different cases that the following steps have to handle.
    • Copy propagation. This is a classical compiler optimization; however, ILSpy uses it only for two specific cases:
      • Any address-loading instruction is copied to its point of use. This ensures that no decompiler-generated variable has a managed reference as type - "ref int v = someVariable;" wouldn't be valid C# code, so we have to instead use "ref someVariable" in the place where "v" is used.
      • Copies of parameters of the current function are propagated, as long as the parameter is never written to. This mainly exists in order to propagate the "this" parameter, so that the following patterns can detect it more easily.
    • Dead store removal. If a variable is stored and nobody is there to read it, then was it really written?
      Originally we removed all such dead stores; but after some users complained about 'missing code', we restricted this optimization to apply only to stack locations. Dead stores to stack locations occur mainly after the removal of 'pop' instructions.

    The optimizations are primarily meant to even out the differences between debug and release builds, by optimizing away the stuff that the C# compiler adds to debug builds.

    Implementation: ILAstOptimizer.cs

    Inlining

    We perform 'inlining' on the ILAst. That is, if instruction N stores a variable, and instruction N+1 reads it, and there's no other place using that variable, then we move the definition of the variable into the next expression.

    So "stack0 = local1; stack1 = ldc.i4(1); stack2 = add(stack0, stack1); local1 = stack2" will become "local1 = add(local1, ldc.i4(1))". Inlining is the main operation that produces trees from the flat IL.

    Implementation: ILInlining.cs

    Yield Return

    If the method is an iterator (constructs a [CompilerGenerated] type that implements IEnumerator), then we perform the yield-return-transformation.

    Implementation: YieldReturnDecompiler.cs

    Analysis of higher-level constructs

    After inlining, we tend to have a single C# statement in a single ILAst statement. However, some C# expressions compile to a sequence of statements. We now try to detect those constructs, and replace the statement sequence with a single statement using a pseudo-opcode.

    We can detect and replace a construct only if it's represented by consecutive statements, so when one construct is nested in another, we first have to process the nested construct before processing the outer construct. Because constructs can be nested arbitrarily, we run all the analyses in a "do { ... } while(modified);" loop. If you select "ILAst (after step X)" in the language dropdown, decompilation will stop after that step in the first loop iteration.

    • SimplifyShortCircuit: introduces && and || operators.
    • SimplifyTernaryOperator: introduces ?: operator
    • SimplifyNullCoalescing: introduces ?? operator
    • JoinBasicBlocks: The decompiler tries to use the minimal possible number of basic blocks. Some optimizations might remove branches and therefore it is necessary to check whether two consecutive basic blocks can be joined into one after such optimizations. It is important to do this because other optimizations like inlining might not work if the code is split into two basic blocks.
    • TransformDecimalCtorToConstant: changes invocations of the "new decimal(int lo, int mid, int hi, bool isNegative, byte scale)" constructor into literals.
    • SimplifyLdObjAndStObj: replaces "ldobj(ldloca(X))" with "ldloc(X)", and similar for other kinds of address-loading instructions.
    • TransformArrayInitializers: introduces array initializers
    • TransformObjectInitializers: introduces object and collection initializers
    • MakeAssignmentExpression: detects when the result of an assignment is used in another expression, and inlines the stloc-instruction accordingly. This is essential for decompiling loops like "while ((line = r.ReadLine()) !null)", as otherwise the loop condition couldn't be represented as a single expression.
      This step also introduces the 'CompoundAssignment' opcode for C# code like "this.M().Property *10;". Only because this step de-duplicates the expression on the left-hand side of the assignment, the "this.M()" method call can be inlined into it.
    • IntroducePostIncrement: While pre-increments are handled as special case of compound assignments; post-increment expressions need to be handled separately.
    • InlineVariables2: this performs inlining again, since the steps in the loop might have opened up additional inlining possibilities. The next loop iteration depends on the fact that variables are inlined where possible.

    Implementation: ILAstOptimizer.cs, PeepholeTransform.cs, InitializerPeepholeTransform.cs

    To get more of an idea of what is going on, consider the collection initializer "new List<char> { 'E', 'N', 'D' }". In the ILAst, this is represented as 5 separate instructions:

    stloc(g__initLocal0, newobj(List`1<char>::.ctor))
    callvirt(List`1<char>::Add, ldloc(g__initLocal0), ldc.i4(69))
    callvirt(List`1<char>::Add, ldloc(g__initLocal0), ldc.i4(78))
    callvirt(List`1<char>::Add, ldloc(g__initLocal0), ldc.i4(68))
    yieldreturn(ldloc(g__initLocal0))

    The collection initializer transformation will change this into:

    stloc(g__initLocal0, initcollection(newobj(List`1<char>::.ctor), callvirt(List`1<char>::Add, initializedobject(), ldc.i4(69)), callvirt(List`1<char>::Add, initializedobject(), ldc.i4(78)), callvirt(List`1<char>::Add, initializedobject(), ldc.i4(68))))
    yieldreturn(ldloc(g__initLocal0))

    Now after this transformation, the value g__initLocal0 is written to exactly once, and read from exactly one. This allows us to inline the 'initcollection' expression into the 'yieldreturn' statement, thus combining all of the 5 original statements into a single one.

    Loop Detection and Condition Detection

    Using control flow analysis (finding dominators and dominance frontiers), we detect loops in the control flow graph. A heuristic on a control flow graph is used to find the most likely loop body.

    We also build 'if' statements from the remaining conditional branch instructions.

    Implementation: LoopsAndConditions.cs

    Goto Removal

    Goto statements are removed when they are made redundant by the control flow structures built up in the previous step. Remaining goto statements are converted into 'break;' or 'continue;' statements where possible.

    Implementation: GotoRemoval.cs

    Reduce If Nesting

    We try to re-arrange the if statements to reduce the nesting level. For example, if the end of the then-block is unreachable (e.g. because the then-block ends with 'return;'), we can move the else block below the if statement.

    Remove Delegate Initialization

    The C# compiler will use static fields (and in some cases also local variables) to cache the delegate instances associated with lambda expressions. This step will remove such caching, which opens up additional inlining opportunities. In fact, we will have to move this step into the big 'while(modified)' loop so that we can correctly handle lambda expressions within object/collection initializers.

    Introduce Fixed Statements

    .NET implements fixed statements as special 'pinned' local variables. As there isn't any representation for those in C#, we translate them into 'fixed' statements.

    Variable Recombination

    Split up variables were useful for inlining and some other analyses; but now we don't need them any more. This step simply recombines the variables that we split up earlier.

    Type Analysis

    Here, finally, comes the semantic analysis. All previous steps just transformed the IL code. Some were introducing some higher-level constructs, but those were defined as pseudo-IL-opcodes, which pretty much just are shorthands for certain IL sequences. Semantic analysis now figures out whether "ldc.i4(1)" means "1" or "true" or "StringComparison.CurrentCultureIgnoreCase".

    This is formulated as a type inference problem: we determine the expected type and the actual type for each expression in the ILAst. In case some decompiler-generated variables (for the stack locations) weren't removed by the ILAst transformations, we also need to infer types for those.

    Implementation: TypeAnalysis.cs

    This concludes our discussion of the first phase of the decompiler pipeline. In the next post, I will describe the translation to C# and the remaining transformations.

  • ILSpy - Query Expressions

    ILSpy supports LINQ query expression - we added that feature shortly before the M2 release.

    Today, I implemented support for decompiling object initializers and fixed some bugs related to deeply nested lambdas. With these two improvements, query expression translation becomes possible in several more cases.

    This screenshot shows Luke Hoban's famous LINQ ray-tracer.

    Why are queries related to object initializers? Simple: LINQ queries allow only the use of expressions. When an object initializer is decompiled into multiple statements, there's no way to fit those into a "let" or "select" clause, so query expression translation has to abort.

    Another issue with this sample was the deep nesting of the compiler-generated lambdas. Once closures are nested more than two levels deep, the C# compiler starts copying the parent-pointer from one closure into its subclosure ("localsZ.localsY = localsX.localsY;"). This case was missing from the lambda decompilation, so some references to the closure classes were left in the decompiled code. This bug has now been fixed, so nested lambdas should decompile correctly.

    We're now close to supporting all features in C# 3.0, the only major missing item is expression tree support. So LINQ queries currently decompile into query syntax only if they're compiled into delegates (LINQ-to-Objects, Parallel LINQ), not if they're compiled into expression trees (LINQ-to-SQL etc.).

Powered by Community Server (Commercial Edition), by Telligent Systems
Don't contact us via this (fleischfalle@alphasierrapapa.com) email address.