Skip to content

Design of a low-overhead record/replay framework for multi-threaded programs

Setting the scene

A record/replay framework provides the possibility  to execute a program once and replay the execution of the program an arbitrary number of times. The main challenge in the record phase is to capture all sources of non-determinism (e.g., the schedule in a multi-threaded program). In the replay phase, this information is used to reconstruct the original (recorded) behavior of the program.

There are various levels of granularity at which information is recorded. For example, the recorder can  log every executed  instruction. One advantage of such a fine-grained recording strategy is that all information is available to replay the application deterministically. The disadvantage, however,  is that recording slows down the program execution approximately two orders of magnitude.  Such a large slowdown renders recording of programs with large workloads impractial.  Fine-grained record/replay frameworks are therefore used for debugging programs rather than testing a program. Fine-grained record/replay frameworks are usually implemented with dynamic binary translation tool (e.g., PIN (http://www.pintool.org/) or FastBT (http://nebelwelt.net/?id=47)).

Our work takes a different approach in that we do not record (results of) individual instructions but events  that introduce non-determinism to the program. We monitor a program (without rewriting code) and record all events that (1) provide an input to the program and (2) can affect the control-flow of the program. A list of events is described in the next paragraph – “Challenges”.  The “Solution” paragraph explains how we eliminate/record events.  Finally, the “Implementation” paragraph highlight the most important implementation details.

Challenges

Here is a  list of sources of non-determinism we are considering so far. The list is probably not complete. So if you have an item to add, please let me know.

  • Address space randomization (ASR)
  • Signals
  • Schedule/interleaving of parallel programs
  • Instructions (e.g., rdts, rdmsr)
  • System calls (e.g., gettimeofday())

Solution

To eliminate the non-determinism that is introduced by the OS/hardware we propose the following strategies:

  • ASR: can be disabled by the OS. E.g., in Linux system, sysctl -w kernel.randomize_va_space=0 disables ASR
  • Signals: To be able to replay the original program execution as concise as possible, signals should be delivered at the exact spot as in the recording phase. This is a non-trivial problem since signals are, in general, asynchronous; a process can receive a signal at an arbitrary program point, and based on the signal, change the control flow. We use hardware performance counters to count the number of retired conditional branches in the recording phase and count the value down in the replay. When the counter reaches zero,the replay-process is stopped and single-steps to the eip where the original signal was received and sends the signal. This way signalling of multi-(threaded/process) applications can replayed exactly.
  • Scheduling: Multi-threaded programs that share memory can introduce non-determinism due to data-races and deadlocks. Trying to replay parallel programs deterministically either requires modifications to the kernel and/or language or runtime systems. All these efforts are still quite “academic” and time will show which approach is most promising. To avoid “parallel problems” we serialize the execution of parallel programs by letting only one thread/process  run at a time. More specifically, the recorder determines which process runs and logs the schedule, which is used in the replayer to implement the same schedule. Well, now you might think that we are already in the area of parallelism  and we are building a recorder that serializes my parallel program. In fact I think that many programs will remain mostly sequential, or apply only a couple of threads. As a result serializing a parallel program will slow down the recording/replay by a modest  factor. The reasoning behind my assumption is that many problems are inherently sequential, or at least large parts of it. Keeping Amdahl’s Law in mind, many programs will use a couple of threads only.
  • Instructions: There are some (x86) instructions that introduce non-determinism. For example, the rdtsc reads the current time, which is given by a 64-bit value. Clearly, this value will differ in the record/replay phase. Since   rdtsc is a common instruction we must make sure that the values are the same in the record/replay. In a Linux environment  rdtsc can be virtualized by using the fcntl  system call. The rdmsr instruction is used to access the PMU (Performance Monitoring Unit). Most data that can be gathered by the PMU is non-deterministic (e.g., cache hit/miss). Currently, we do not handle the  rdmsr instruction specifically. However, since most applications do not make use of PMU data we do not consider this as a major limitation of  our approach.
  • System calls: There are many system calls that provide input to the application. E.g., read, gettimeofday(), or sockets can give different varying (from run-to-run)  inputs to the target application. Therefor, the recorder logs all inputs in the record phase, and the replayer injects the recorded data in the replay phase. This way we can guarantee that the executions in the record/replay do not differ due to system calls.

Implementation

The current system is implemented for a 32-bit Linux-based system. Subsequently, you can find a list of the most important implementation/design decisions:

  • The monitoring process(recorder/replayer) is implemented as a separate process. The reason for using separate address spaces is that the execution of the original (not-monitored)  program should be influenced as llittle as possible.
  • We use ptrace to stop at system call entries/exits and to intercept signals.
  • We virtuaize every system call that uses file descriptors (including sockets)
  • Only system calls that must be executed are actually executed (e.g., mmap)

I will add more implementation details over time. The project is still in an experimental phase and things might change over time. I’ll keep you informed.

That’s it.

Cheers,

Albert

deterministic hardware performance counter event on Sandy Bridge

Hi all,

I found a deterministic hardware performance counter event on a Sandy Bridge architecture. The event counts retired conditional branches. That’s great news, since we can use this event to find the exact spot to deliver a signal in the replayer by counting down the retired conditional branch instructions that were recorded.

Best,

Albert

Hardware performance counter jitter – influencing factors

Hi all,

I’ve tried to find the factor(s) in the system that causes the non-determinism in for various counters that are supposed to be deterministic. After trying various hardware events, it seems that there is at least one event that is not influenced by page faults and hardware interrupts. This event is a sub-category of BR_INST_RETIRED (0xC4 on Core i7 architectures) and is called NEAR_CALL (0x02). The good news is that the event seems to be deterministic, the bad news is that this particular event does not occur very frequently. At least, the event can be used as a guaranteed lower boundary. If the counter 0x2C4 is smaller than the recorded value, we can be sure that we haven’t already gone over the spot were we want to stop.

Enough words, here are some graphs – more will follow.

Best,

Albert

Experimental settings:

– Intel(R) Core(TM) i7 CPU       Q 820  @ 1.73GHz

– OS: Ubuntu kernel:  2.6.39-0-generic

– hpc information gathering: perf_events

-SPEC CPU 2006 benchmarks: compilation: gcc 4.5.2 -O2

If you need more data, please let me know!

We use ptrace to control to the process that is monitored. Execution is stopped before every system call and the content of the performance counters is read. After the data is obtained from the counters, all counters are reset to avoid a drift of the jitter throughout the execution. The graphs presented here do not include data from the Linux loader, since ldd seems to be non-deterministic. Address space randomization is disabled (sysctl -w kernel.randomize_va_space=0), and the process is bound to a specific core (sched_setaffinity(..)).

gcc benchmark:invoked with parameters: ./gcc… 166.i -o 166.s

Jitter of retired instruction/store/branch/near_call count as a function of retired instruction count.

403.gcc.166i.retired_instruction_count

Jitter of retired instruction/store/branch/near_call count as a function of page faults.

403.gcc.166i.page_faults

Jitter of retired instruction/store/branch/near_call count as a function of hardware interrupts.

403.gcc.i166.hw_interrupts

bzip2 benchmark: invoked with ./bzip… dryer.jpg 5
401.bzip2.dryer_5_retired_instruction_count

401.bzip2.dryer_5_page_faults

401.bzip2.dryer_5_hw_interrupts

Determinability of PMU events (INST_RET)

One week has no passed, and I am trying to figure out why the number of retired instructions for a deterministic application is not deterministic. I wrote a small framework, which should allow me to track all sources of non-determinability.

To do so, we eliminate as much randomness as possible before we start the application: (1) we bind the execution of the thread to a particular cpu, and (2) we disable address space randomization, which is enabled by default in newer Linux distributions. If you know any more, please let me know! So far, so good.

There are two known sources of randomness that we track during execution: (1) page faults and (2) hardware interrupts. More specifically, the RETIRED_INST counter adds +1 instruction for each page fault and for each hardware interrupt. However, from what I have found out so far (and which I not quite sure of), the following scenario adds non-determinism (or the counter for hardware interrupts is not precise?):

inst: n

<- hw-signal 1

<- hw-signal 2

inst: n+1

In this example, the RETIRED_INST does not reflect the second signal. If any one has comments on that (or experienced the same) please let me know.

Have a nice weekend,

Cheers,

Albert

Hi all,

I just started my internship at Mozilla last week. So far, everything is great and the project I am going to work on this summer is very cool. The main aim of the project is to develop a debugger that enables the programmer to exactly trace down the source of a bug. Very often, the effects of a bug (e.g., write to an unintended memory location) are seen later (seconds, hours, days) in the program execution. Traditional debuggers cannot reveal such a  bug, since (1) they do not record execution traces and (2) they cannot go back in time (reverse-debug).

There exist several approaches that record execution traces to enable reverse debugging. However,  recording execution traces is extremely expensive. It can cause a slowdown of the application up to a factor of 100!  In a real world environment, such a slow down is not acceptable. To avoid an excessive overhead, other approaches use check-pointing. Instead of recording every executed instruction, check points are created (e.g., every second) from which the program can be re-executed an arbitrary number of times. The difficulty that comes with check-pointing is that the result of every  non-deterministic operation  must be recorded in order to guarantee that the replay produces the same output as the original run. E.g., the results of system calls must be recorded during the original run. In the replay run, the system call is not executed, but instead the recorded  result is used.

There exist frameworks which can do that. However, it remains unclear if their replay is 100% deterministic. Furthermore, they incur a slowdown of approx. 50%.

The goal of my project is to enable reverse debugging at almost zero overhead. To achieve this goal, we want to use hardware performance counters (HPC) that are available on all modern CPUs. In particular, upon every signal that is sent to the application, we intercept the signal and record the type of the signal.  In a second step, we read the the number of retired instructions. Consequently, we can record the source of non-determinism and the exact point when it happened. This should (at least we hope so) be enough information to enable a fully deterministic replay.

In a first step, we try to find out how deterministic hardware performance counters are. Any results will (hopefully) be published soon here!

Cheers,

Albert