Let’s imagine the world’s most naive time travel algorithm: given a program which executes deterministically, any point in that program can be reproduced by replaying the recording up until the point that you are interested in. This works great for short recordings like unit tests. If you want to know the value of
x
at the 1.5 second mark, the algorithm is clear: you just have to replay 1.5 seconds of recording and then evaluate the expression x
. If you want to know what the screen looks like at second five you replay five seconds of recording and then take a screenshot.In this naive time-travel debugger, the upper bound for how long it takes to answer a single question is the length of the recording. Of course, this is terrible for long recordings, or if many questions need to be answered quickly about different points in the recording. It would be much better if we could jump backwards in the recording without having to restart from the beginning. For instance, if we could create some sort of “save point” partway through the recording that we could reset back to later. Enter the first major performance improvement for this time machine: forking.
Forking the replaying process allows us to run the program once while leaving save points for asking future questions from. If the program forks at every five second interval, then theoretically the worst case scenario for answering any question is replaying five seconds of recording. To see how this works imagine leaving a fork every five seconds through a 30 second recording, and then asking a question about the state of the program 27 seconds in. To answer that question, the program would need to find the fork sitting at second 25, create a brand new fork from it, and run that process forward for two seconds to have a running process sitting at second 27, skipping the first 25 seconds completely. (see this CodeSandbox for a visual explanation: https://12milh.csb.app/)
This insight - that recordings can be partially replayed and saved - is the foundation that makes time travel debugging practical from a performance perspective. Interestingly, it shares some intellectual lineage with other time-travel techniques like nondeterministic programming, which is sort of the inverse of time-travel debugging. One uses determinism to traverse execution space with confidence, while the other uses non-determinism to explore execution space as efficiently as possible. The fact that linux forks act as a sort of brute-force continuation has a certain beauty to it. We are definitely not the first people to realize this was possible, but Replay might be the most advanced use case of this idea so far.
But so far this implementation of “save points” still has some major problems:
- It still has to run through the recording from the start once so that it can generate the forks.
- That’s potentially a lot of forks. Forks don’t take up a ton of memory thanks to copy-on-write, but they still take up some memory. This means that as the length of the recording grows so do the memory load required for our time machine, and because each fork needs to come from the same original process it’s hard to spread this load out across more than one machine.
The system can be made much more performant by adding two additional features:
- Taking “snapshots” of a running process which can be used to jump ahead to a particular point in the replay without ever having to replay the process up until that point.
- Building on 1, snapshots can be used to slice the recording up into segments of fixed size and those segments can be run independently of one another - meaning that the load requirements of the replay can be spread across many different compute nodes. This means that as recording length grows so do the amount of resources available for replaying. Similarly, as the number of save points grow so can the amount of available memory, and the system can remain equally performant for long and short recordings.
Those improvements were shipped by Replay last summer under the term “Turbo Replay”. Using these techniques, Replay can (in theory) load recordings of any size in a fixed amount of time - regardless of the length of the original recording. In practice, we are still figuring out how to make very long recordings load quickly, but we have made some recent improvements and have many more on the way [could go into more detail about the recording data changes, but I’m not sure how interesting they are].
What would be the next big step to making this time machine more performant?
In order to start faster, the machine needs to spend less time replaying, and the cleanest way to do that is to take snapshots more often. Snapshots can save a lot of computation, but at the cost of space. Recording data is surprisingly small, and with deterministic replay it’s possible to replay to any state that the original recording reached, but only if the time machine follows the original program execution exactly - there are no shortcuts to be had here, which means that replaying time will grow roughly linearly with recording time. Snapshots allow skipping chunks of the original program, but in exchange it has to store everything about the state of the program that could possibly affect future execution. Another way of saying that is that replaying has a compact representation, but might require significant computation; snapshots have a verbose representation, but require a fixed amount of computation to use.
To put some rough numbers on those tradeoffs: the snapshots which power Turbo Replay can easily be over a gigabyte in size, and we often need to retrieve multiple of them from storage at the beginning of a session. If we were to try and take snapshots 10 times more often then we would not have to replay very much, but we would spend way too long shuffling snapshots bytes around at the beginning of the session. The solution is what we’re working on now: incremental snapshots. The idea here is relatively simple: most of the data in a snapshot does not change during any given chunk of the recording. Rather than storing the entire snapshot every time, we can store only the diff across the time that we care about. That allows us to download dozens of snapshots at the beginning of the session while only slightly increasing the amount of data that we need to download overall.
A fun way to think about this is in terms of git - imagine how large a repo would be if for every commit the repository stored a copy of every line of code, rather than just the lines of code that changed. Diffing whole segments of memory is a little bit trickier to debug than diffing text documents, but the performance gains are the same.
We got incremental snapshots working in our runtime test environment last month. We just completed wiring it up in our production backend yesterday. We’re beginning to stress test real recordings today. We’ve crossed all of our collective fingers that we’ll be able to stabilize incremental snapshots over the next couple of weeks. Will report back soon!