Introduction

The LLVM Compiler Infrastructure is built to provide reusable compiler and toolchain components. The intermediate representation used by LLVM, named LLVM IR, is the basis for various kinds of analysis and instrumentations, both static and dynamic.

For open-source software and operating systems, Clang, the front-end for languages in the C-family, can be leveraged to convert C/C++ source code into LLVM IR, making it possible for doing security analysis on these software and systems.

However, in many situations, we do not always have the source code. One example is the Windows operating system, especially for the binaries that come with the system kernel such as the device drivers.

This restricts the kind of analysis that one can perform. Hence, there is definitely a need for lifting LLVM directly from binaries. In this post, I'm going to talk about lifting LLVM IR from 64-bit x86 Windows binaries.

McSema (by Trail of Bits)

Trail of Bits has developed a framework called McSema for translating compiled code to LLVM bitcode. It supports both x86 and amd64 architectures and can be used for Linux as well as Windows binaries. The translated LLVM IR can even be recompiled as a completely new executable with the exact same functionalities.

However, the LLVM bitcode generated by McSema kinds of mimics the disassembly of the compiled code. Each function's signature, that is lifted off, follows a particular pattern:

; Function Attrs: noinline nounwind
define internal fastcc %struct.Memory* @sub_1c0017f64(%struct.State* dereferenceable(2688) %state2, i64 %pc, %struct.Memory* %memory1) unnamed_addr

The first parameter (%state2) captures the internal state, including all the registers. This format is not easy to analyze since it is more like reading the disassembling. Additional work needs to be done to recover function semantics from the LLVM bitcode generated by McSema.

Recovering Function Semantics

For recovering function semantics, we need to look at %struct.State. The LLVM file generated also provides the following definition:

%struct.State = type {
  %struct.ArchState,
  [32 x %union.VectorReg],
  %struct.ArithFlags,
  %union.Flags,
  %struct.Segments,
  %struct.AddressSpace,
  %struct.GPR,                    # General purpose registers
  %struct.X87Stack,
  %struct.MMX,
  %struct.FPUStatusFlags,
  %union.Flags
}

%struct.GPR represents all the general purpose registers. Here, we only consider 64-bit Windows binaries. While calling a function, the first four parameters are put inside registers rcx, rdx, r8 and r9. The rest of the parameters are pushed on the stack. This is how this struct is defined:

struct alignas(8) GPR final {
  .
  .
  volatile uint64_t _0;
  Reg rax; // 1
  volatile uint64_t _1;
  Reg rbx; // 3
  volatile uint64_t _2;
  Reg rcx; // 5
  volatile uint64_t _3;
  Reg rdx; // 7
  .
  .
  volatile uint64_t _6;
  Reg rsp; // 13
  volatile uint64_t _7;
  Reg rbp; // 15
  volatile uint64_t _8;
  Reg r8; // 17
  volatile uint64_t _9;
  Reg r9; // 19
  .
  .
} __attribute__((packed));

The complete definition can be viewed here. The parameters in the registers are referenced by their pointers returned using getelementptr instruction. For example, the following instruction returns a pointer to rcx, i.e. the first parameter of the function:

%3 = getelementptr inbounds %struct.State, %struct.State* %state2, i64 0, i32 6, i32 5, i32 0

Parameters on the stack are referenced in a slightly complex manner. First, the rsp register value is loaded. As seen from the above structure, rsp is present at index 13.

%11 = getelementptr inbounds %struct.State, %struct.State* %state2, i64 0, i32 6, i32 13, i32 0, i32 0 # A pointer to rsp
%14 = load i64, i64* %11, align 8 # The value loaded in rsp

The fifth parameter is at offset rsp + 40 on the stack. The sixth parameter is at rsp + 48 and so on.

%15 = add i64 %14, 40 # A pointer to the value at 'rsp + 40'

While calling other functions, the same structure %state2 is passed. However, the individual registers are modified to update the parameters using store instructions. Also, the stack grows downwards (to lower memory addresses) by modifying the rsp register. After the first four parameters, the rest are setup up accordingly on the stack. The following example shows how the first parameter is modified while calling another function:

%9 = getelementptr inbounds %struct.State, %struct.State* %state2, i64 0, i32 6, i32 5, i32 0, i32 0 # A pointer to rcx, the first parameter
%87 = add i64 %85, 112 # Computed the parameter to be passed
store i64 %87, i64* %9, align 8, !tbaa !849 # Storing the value in the pointer to rcx

# A call to sub_140002ad0
%92 = tail call fastcc %struct.Memory* @sub_140002ad0(%struct.State* nonnull %state2, i64 %88, %struct.Memory* %MEMORY.0)

A Two-pass Approach

To recover the function semantics from the LLVM bitcode generated by McSema, we use two passes over every LLVM module. The following is calculated for every function:

  • A 'set' of Instruction for every parameter being used by the function. A particular parameter may be referenced by more than one pointers within the same function. The set represents all such possible pointers.
  • A list of functions that it calls.
  • For every function in the above list, a list of parameter values being passed to that function. Note that, there will always be a unique Value being passed so there is no notion of 'set' here.

Apart from this, the following information is also maintained for every function between the two passes:

  • A 'list' of Instruction that are pointers to rsp within the %state2 structure.
  • A 'list' of Instruction that contain actual rsp values 'loaded' from one of the above pointers.
  • For every rsp value loaded above, the corresponding offset of rsp from the value of rsp at the start of the function.
  • For every function that is being called, the corresponding rsp offset being passed.

First Pass

The first pass iterates over the instructions sequentially. The responsibility of this pass is to get the list of parameters being used by this function and also to keep a track of the stack pointer. The following type of instructions are taken into consideration:

  1. GetElementPtrInst: These instructions are used to retrieve pointers to various registers from the %state2 structure. Pointers to the first four parameters (rcx, rdx, r8, r9) along with the stack pointer (rsp) are saved.
  2. CallInst: All function calls to other functions present in the bitcode are recorded. The current rspOffset is also recorded. Also, every function call leads to an increase in the increase of the stack pointer by 8 bytes.
  3. BitCastInst: If the source of a BitCastInst is a recorded pointer to a parameter, the target of the instruction is also added to the same set, since both reference the same parameter.
  4. LoadInst: If a value is being loaded from a pointer to rsp (stored above), the corresponding target is recorded along with the current rspOffset.
  5. StoreInst: If a value is being stored as the new stack pointer (rsp) in %state2, the source must be a value pointing somewhere on the stack. The currentOffset is updated accordingly.
  6. BinaryOperator: Operations on pointers on the stack are performed using this instruction. A new pointer on stack is generated using an existing pointer. Now, this pointer may point to one of the parameters being passed to the coming functions and is recorded accordingly.

Second Pass

Similar to the first pass, the second pass also iterated over the instructions sequentially. The responsibility of this pass is to keep a track of the current state of parameters and associate them with functions that are called. The following types of instructions are taken into consideration:

  1. IntToPtrInst: If the source of this instruction is a pointer on the stack, so is the target.
  2. StoreInst: Parameters are stored in %state2 using this instruction. The target operand is checked for either of the four parameter registers (rcx, rdx, r8, r9) and currentParams is updated accordingly. This instruction is also used to store some parameter on stack. If the target operand is any pointer on the stack, the value pushed is recorded.
  3. Callinst: At this point, a function call is being made and currentParams store the state of the first four parameters being passed. Also, the values pushed on stack earlier are treated as parameters and also recorded with the target function.

Using these two passes we now have a complete list of all parameters being passed to any function. We also have the parameters that a function passes to other functions.

Proofread by Meng Xu