REPT: Reverse Debugging of Failures in Deployed Software

REPT is an intergration in WinDbg. It is a practical system that enables reverse debugging of software failures in deployed systems with low runtime overhead. REPT only relys on the data in a memory dump, and analysis at the binary level instead of source code. REPT uses hardware support to log the control flow and timing information.

REPT comes from Microsoft Research Weidong CUI and Xinyang GE.

REPT is the exact approach I imaged to implement reverse debugging this semester. It uses Intel PT to trace the control flow, and use an iteratively algorithm to reconstruct data flow. However, I was shocked by the paper when it says "we implement the online hardware tracing component as a driver of 8.5K lines of C code", and "we implement offline binary analysis and reverse debugging as a library of 100K lines of C++ code".

After all writing a paper needs that much work, thought the young me.

I was encouraged again when someone tolds me they depoly it into the ecosystem of Microsoft Windows for program tracing, failure reporting, and debugging. Likely, they have to consider which registers are binded to the C# variable x sort of things. I had to assume that Microsoft has a lot of labor.

I'm not interested in how variables binds with registers, at least not now (I do want to know it so I decide to enroll the CS323-Compilers). So, let's dive in and see how it works.

# Control Flow

In debugging, we programmers always wants to know whether a certain block of codes is executed or not. And that's why we need debugging: some codes ran unexpectedly, or some code unexpectedly didn't run.

The control flow, or the execution flow, is mostly about where the program goes. The control flow can told us how the PC register jumps.

Software can certainly trace the control flow, like GDB. But it introduces too much overhead. Hardware components can trace this as well. In REPT, they use Intel Processor Trace (PT) to log control-flow and timing information of a program's execution, with the pre-thread circular buffer mode.

# Data Flow

When debugging, we not only want to know how the programs goes, but also how the data in register and memories changes. However, hardware components won't give us the chance to directly trace that, considering the design complexity it brings.

REPT use offline binary analysis to reconstruct the data flow. When a program halted with an exception, it will generate a memory dump with its final state. REPT use the memory dump and the control flow tracing to guess how the data goes.

Here's how REPT do when it comes to different scenarios.

  • Single instruction sequence with only reversible instructions.
    • Reverse the effects of each instruction to completely recover initial state.
    • e.g. add rax, rbx with {rax=3, rbx=1} can be reversed to {rax=2, rbx=1}.
  • Single instruction sequence with irreversible instructions but without memory access.
    • As long as the destroyed value is derived from some other registers and memory locations, and their values are available, the destroyed value can be recover by forward analysis.
    • Iteratively perform backward and forward analysis to recover data values until no new values are recovered.
  • Single instruction sequence with irreversible instructions and with memory access.
    • The address of memory accesses may be unknown (cannot be deduced)!
    • Error correction
  • Multiple instruction sequences with irreversible instructions and with memory access.

# Data Inference Graph

REPT maintains a data inference graph when performing backward and forward analysis.