logo

Debugging a Time-Travel Debugger is a lot like playing Wordle!

profile photo
Jason Laster
Image without caption
One of the great ironies of building a Time-Travel Debugger is that when it fails, the best way to figure out why is with an elaborate game of Wordle.
It’s painful to type these words, so let me explain how we record and why diagnosing issues feels like an elaborate game of wordle on a massive two dimensional board.
How Replay records and replays
The key insight behind Replay is that most software is deterministic. This means that given the same inputs, the program with return the same outputs. You do not need to record a fibonacci program to be able to replay it.
The primary thing that Replay records are library calls. Take this helloWorld example which reads the first line of a file.
plain text
#include <stdio.h>int main(int argc, char** argv) { FILE* fp = fopen(“hello.txt”, “r”); char buf[256]; fgets(buf, sizeof(buf), fp); puts(buf); fclose(fp); return 0; }
When Replay records helloWorld the only thing it needs to capture are the calls to fopen and fgets. Everything else is deterministic!
Where Wordle comes in
Okay, so helloWorld was fine. Once you capture the calls to fopen and fgets you can replay the program a million times and everything will play the same.
The problem is that runtimes are not single threaded. They are heavily multi-threaded. Take this example which creates several threads that fill up a buffer and then prints that buffer.
plain text
#include <pthread.h> #include <stdio.h>pthread_mutex_t mutex; char buffer[51]; int bufferIndex;void* thread_main(void* ptr) { for (int i = 0; i < 5; i++) { pthread_mutex_lock(&mutex); buffer[bufferIndex++] = ‘a’ + (int)ptr; pthread_mutex_unlock(&mutex); } return NULL; }int main(int argc, char** argv) { for (int i = 0; i < 10; i++) { pthread_t thr; pthread_create(&thr, NULL, thread_main, (void*)i); } pthread_mutex_lock(&mutex); puts(buffer); pthread_mutex_unlock(&mutex); return 0; }
Compiling and running this program repeatedly prints different strings. Uh oh!
plain text
aaaaabbbbbcccccdddddfffffeeeeehhhhhgggggiiiii aaaaabbbbbccccceeeeedddddfffffhhhhhgggggiiiiijjjjj abdcccecbeebbaaaaddddfffffggggghhhhhiiiiijjjjj abcaaaabbbbccccdddddeeeeehhhhhgggggfffffiiiii aaaaabbbbbcccccdddddfeefefggggghhhhhiiiii
It turns out that inter-thread non-determinism is just as important as intra-thread non-determinism. The primary way we record inter-thread non-determism is by ordering some of the locks which works like any other lock or mutex, except that when replaying it will be acquired by different threads in the same order that it was acquired when recording.
This is great, except it that in practice if we record all of the locks it would bloat the recordings. It also turns out that most of the locks are not needed to maintain the same behavior.
This is where Wordle comes in. When a recording diverges while replaying the question is what unordered lock is actually important for preserving the original behavior.
This is a search game where the search space is the runtime itself (Chromium, Firefox, Node) and our clues are the recording and replaying stack traces. So where do we start? The first step is to add an assert (123). The next step is to wait for one of the asserts to hit. This narrows the search space down. Sometimes that is enough to find the right lock.
Most of the time, we place new asserts and wait for the next hit. We’re guessing a word, not with the expecation that we’re right, but with the hope that we learn enough to be able to guess the answer next time!
If you’d like to learn more about how Replay Works, we wrote a great deep dive with more examples here. Also if you’re really into Wordle or the inne mechanics of runtimes and Time-Travel Debugging, say hello in replay.io/discord.
We believe in the future is replayable runtimes will be a given and ultimately that is a community effort!
Related posts
This post is the story of how we’re giving AI devs the tools they need to be successful and the incredible results we’re beginning to see as a result. The story starts with fixing failing browser tests, but progresses towards a ge...
post image
The best way to setup an octokit instance so that it can interact with the API on your behalf.
post image
How to set up a GitHub App that can listen for pull requests and create checks.
Powered by Notaku