logo

How to debug an Effectively Deterministic Time Travel Debugger? (Seriously…how?!)

Kannan goes deep on how to debug Replay.
profile photo
Kannan Vijayan

The Pledge

At Replay we’re building a time travel debugger the likes of which hasn’t been seen before. This is not normal software. Not only does it do magical things but its implementation and design stretches the limits of my imagination, and genuinely puts me back in that heady mental state where I’m wondering if I know enough to do my job effectively.
Replay turns a program’s entire execution into a first-class object. You and I alike are accustomed to thinking of a program execution as an occurrence, an event. A program will execute, is executing, or has executed. Replay turns that into “an execution”, transmuting verb to noun. And it doesn’t do this for any trivial program, but for entire runtimes whose job it is to execute arbitrary programs themselves. Browsers running web pages, and virtual machines like NodeJS (and others, as we get around to them) running arbitrary scripts.
And why do we want to turn a program execution from an event to an object? What do we gain? Well, we gain the ability to look at it, to poke at it, to pick it apart, to understand its nature. And why do we want to do that? To put it bluntly because we don’t understand the programs we write. Or to be more precise: we don’t understand all the possible ways in which the programs we have written can execute. When a program behaves in a way we don’t expect, we are often at a loss when we want to know why.
Traditional debuggers try to do this but they don’t turn executions into objects. All they really do is instrument a particular execution so that, at particular times, you can observe some very limited aspects of what’s happening inside the program, and slow down or pause the program’s execution to happen on your timetable. You step the execution forward, you inspect a variable, you observe how a particular memory location changes, but the program inexorably runs forward all the same.

The Turn

The mind bending thing for me is how Replay turns executions into first-class objects. It does so by partially recording the low-level system interaction of the program from the start of execution to the end, and then reconstructing the execution on-demand by re-executing it from the start in a virtualized environment, using the previously recorded information to simulate “the world as it was” when the original execution occurred. By doing so, it acquires the ability to force programs to re-execute the same way the 2nd, 3rd, or Nth time. And using that mechanism, we can build tools and UIs that let us answer questions about the state of the program at any point in time during the execution. Powerful questions whose answers let us understand what our programs do, and more critically, where and how they might go wrong.
The insidious word in the prior paragraph is partially. This whole business would be so much easier to deal with if it wasn’t for that accursed partially. And yet, it’s a fundamental part of our design and cannot be eliminated. Without it, we could build a product that’s technically the same, but in practice would be so much slower and less usable that it would simply not be worth the time.
Download our browser and record a web-page and it will feel like you’re using a regular web browser. In the background our browser is producing a first-class object capturing that page’s execution down to system-call level, encompassing all that it does, but the only indication you will receive of that is a small red dot on the “<rec>” button next to the URL bar. Stop recording and you will be whisked away to a debugger view of the execution you just recorded. It was silently being created for you behind the scenes and uploaded to our cloud servers.
The smoothness of this entire experience depends intrinsically on that partially bit. Take that away, and the amount of data we’d need to record to re-construct the execution shoots up by an order of magnitude. The amount of time the recording browser would take to trap and track every thread synchronization point, and the order of locks handed out to each thread, shoots up by an order of magnitude. The entire product becomes glacier-like, slow, unusable.
The partially is the reason we call our system “effectively deterministic”. It’s not completely deterministic, but only as deterministic as far as you, our user, need to be concerned about. Our original plans to name this behaviour “I can’t believe it’s not fully deterministic” was unfortunately shouted down by our investors, family, and hair care professionals alike.

The Panic

And this aspect of our implementation is what leads me to the head-scratching complexity and development challenges I want to talk to about. I’ve spent a good twenty years working on software platforms and problems that I’d venture to classify as “non-trivial”. Things like building ORMs in Perl and then building a mRNA sequence analysis platform on top of that, or designing and implementing a baseline JIT for a production VM, or attempting a novel “better-than-brotli” compression algorithm for a binary syntax encoding for Javascript. I like to think I can wrap my head around most software development ideas.
And now this problem has me losing my hair at a slightly faster rate than when my first and only child was born. It’s a hard problem. It’s a new problem, and quite unlike any I’ve seen before.
In subsequent articles, I’ll talk about the details of this problem. Why this partially business keeps me up at night, how I’m trying to go about working on it, and what I find frustrating about it.
And if you read these articles and find the challenges intriguing, and you’re a systems developer looking to work on hard problems where it’s not clear what the solution is, and sometimes not even clear what the development approach should be.. maybe you want to consider coming and joining us, and helping me out.
Because as you’ll soon find out: I can use the help.
Related posts
post image
In this failure, we look into a bug where a React component calls Math.round and gets a different value when the test passes and fails!
post image
Even seemingly minor changes to your code can have unexpected consequences for your e2e tests. We have been repeatedly learning this lesson while helping Metabase drive down their e2e test flakes.
post image
Test flakiness is annoying, but it can sometimes point to a real problem in the application. This is sometimes referred to as “false positive”. A false positive happens when a test should fail, but instead it passes. Learn how you...
Powered by Notaku