Moved to How to build a Time Machine

You can imagine the worlds most naive time travel algorithm - which is to always replay the recording up until the point that you are interested in. This works great for short recordings like unit tests, for instance. If you want to know the value of x at the 1.5 second mark, the algorithm is clear: you just have to replay 1.5 seconds of recording and then evaluate the expression x. If you want to know what the screen looks like at second five you replay five seconds of recording and then take a screenshot.

In this naive time-travel debugger, the upper bound for how long it can take to answer a single question is the length of the recording. This is terrible for long recordings, or if you have a lot of questions.

In comes the first major performance gain: process forking.

Forking the replaying process allows us to run the program through once while leaving “save points” for asking future questions from. If we leave a fork at every five second interval, then theoretically the worst case scenario for answering any question is replaying five seconds of recording. To see how this works imagine leaving a fork every five seconds through a 30 second recording, and then asking a question about the state of the program 27 seconds in. All we would need to do is find the fork sitting at second 25, create a brand new fork from it, and run that process forward for two seconds, and we would have a running process sitting at second 27, allowing us to skip replaying the first 25 seconds again.

📌

This insight - that recordings can be partially replayed and saved - is the foundation that makes time travel debugging practical from a performance perspective. Interestingly, it shares some intellectual lineage with other time-travel techniques like nondeterministic programming, which is sort of the inverse of time-travel debugging. One uses determinism to traverse execution space with confidence, while the other uses non-determinism to explore execution space as efficiently as possible. The fact that linux forks act as a sort of brute-force continuation has a certain beauty to it. We are definitely not the first people to realize this was possible, but Replay might be the most advanced use case of this idea so far.

But so far this implementation of “save points” still has some major problems:

It still has to run through the recording from the start once so that it can generate the forks.

That’s potentially a lot of forks. Forks don’t take up a ton of memory thanks to copy-on-write, but they still take up *some* memory. This means that as the length of our recording grows so do our memory requirements, and because each fork needs to come from the same original process it’s hard to spread this load out across more than one machine.

We can improve the system further by adding in two additional features:

We can take “snapshots” of a running process which can be used to jump ahead to a particular point in the replay without having to actually rerun the entire process up until that point.

Using snapshots we can slice the recording up into segments of fixed size and run those segments independently of one another - meaning that we can spread the resources required to replay a recording across many different compute nodes. This means that as recording length grows we can also grow the amount of resources available for replaying. Now, as our number of save points grow so can our available memory, and we can expect the system to remain performant.

We introduced those improvements last summer under the term “Turbo Replay”. Using these techniques, we can (in theory) load recordings in a fixed amount of time - regardless of the length of the original recording. In practice, we are still figuring out how to make very long recordings load quickly, but we have made some recent improvements and have many more on the way [could go into more detail about the recording data changes, but I’m not sure how interesting they are].

📌

The following is me trying out several different ways of saying the same thing…

Random tangent on program state

Version A

What would be the next step to making our time travel debugger more performant?

Snapshots can save us a lot of computation, but at the cost of space. Recording data is surprisingly small, and with it we can replay to any state that the original recording reached, but only if we follow the original execution exactly - there are no shortcuts to be had here. Snapshots allow us to skip executing the original program, but in return we have to write down everything about the state of the program that could possibly affect its future execution. Another way of saying that is that replaying has a compact representation, but might require significant computation; snapshots have a verbose representation, but require very little computation to use.

Version B

Snapshots do have tradeoffs though: they are a verbose representation of recording data. You can sort of think of it this way: there are two ways to arrive at a state part-way through a recording: we can start with only the minimal set of data required to enforce effective determinism and the program that was originally recorded. The combination of those two allows us to reconstruct any point in the program’s execution given enough time. Or, we can write down the entire internal state of the program at the point we are interested in, allowing us to jump to that state directly. It’s important to note that both of these techniques get us to the same place, choosing between them is a question of tradeoffs. Replaying has a compact representation, but might require significant computation. Snapshots have a verbose representation, but require very little computation to use.

Version C

We somewhat solved the “lot of forks problem” with turbo replay, so now we can split a large recording across multiple different compute nodes. So now our startup time is roughly correlated to how far apart our snapshots are. The snapshots we use to power turbo replay are relatively large (gigabytes even after compression), so we can’t afford to make lots of them. Not only that, but they continue to grow as the recording goes on. But, much like the insight around copy-on-write, which hinges on the fact that most of the memory that a process uses is unlikely to change in either process, we can take advantage of the fact that much of each snapshot is the same, allowing us to store diffs, rather than complete copies.

📌

There are lots of interesting things to talk about here. For instance, we had to work around ASLR by embedding a second copy of our runtime inside of the replaying process to make sure that snapshot pointers remained valid. We had to fork and tweak an implementation of malloc to get around nasty allocation bugs and their interaction with snapshotting. But I think I’ll leave all of this out, it’s interesting, but not very pertinent.

Josh’s notes/brain dump

Moved to How to build a Time Machine

Version A

Version B

Version C