One of the great ironies of building a Time-Travel Debugger is that when it fails, the best way to figure out why is with an elaborate game of Wordle.
It’s painful to type these words, so let me explain how we record and why diagnosing issues feels like an elaborate game of wordle on a massive two dimensional board.
How Replay records and replays
The key insight behind Replay is that most software is deterministic. This means that given the same inputs, the program with return the same outputs. You do not need to record a fibonacci program to be able to replay it.
The primary thing that Replay records are library calls. Take this
helloWorld
example which reads the first line of a file.plain text#include <stdio.h>int main(int argc, char** argv) { FILE* fp = fopen(“hello.txt”, “r”); char buf[256]; fgets(buf, sizeof(buf), fp); puts(buf); fclose(fp); return 0; }
When Replay records
helloWorld
the only thing it needs to capture are the calls to fopen
and fgets
. Everything else is deterministic!Where Wordle comes in
Okay, so
helloWorld
was fine. Once you capture the calls to fopen
and fgets
you can replay the program a million times and everything will play the same.The problem is that runtimes are not single threaded. They are heavily multi-threaded. Take this example which creates several threads that fill up a buffer and then prints that buffer.
plain text#include <pthread.h> #include <stdio.h>pthread_mutex_t mutex; char buffer[51]; int bufferIndex;void* thread_main(void* ptr) { for (int i = 0; i < 5; i++) { pthread_mutex_lock(&mutex); buffer[bufferIndex++] = ‘a’ + (int)ptr; pthread_mutex_unlock(&mutex); } return NULL; }int main(int argc, char** argv) { for (int i = 0; i < 10; i++) { pthread_t thr; pthread_create(&thr, NULL, thread_main, (void*)i); } pthread_mutex_lock(&mutex); puts(buffer); pthread_mutex_unlock(&mutex); return 0; }
Compiling and running this program repeatedly prints different strings. Uh oh!
plain textaaaaabbbbbcccccdddddfffffeeeeehhhhhgggggiiiii aaaaabbbbbccccceeeeedddddfffffhhhhhgggggiiiiijjjjj abdcccecbeebbaaaaddddfffffggggghhhhhiiiiijjjjj abcaaaabbbbccccdddddeeeeehhhhhgggggfffffiiiii aaaaabbbbbcccccdddddfeefefggggghhhhhiiiii
It turns out that inter-thread non-determinism is just as important as intra-thread non-determinism. The primary way we record inter-thread non-determism is by ordering some of the locks which works like any other lock or mutex, except that when replaying it will be acquired by different threads in the same order that it was acquired when recording.
This is great, except it that in practice if we record all of the locks it would bloat the recordings. It also turns out that most of the locks are not needed to maintain the same behavior.
This is where Wordle comes in. When a recording diverges while replaying the question is what unordered lock is actually important for preserving the original behavior.
This is a search game where the search space is the runtime itself (Chromium, Firefox, Node) and our clues are the recording and replaying stack traces. So where do we start? The first step is to add an assert (1, 2, 3). The next step is to wait for one of the asserts to hit. This narrows the search space down. Sometimes that is enough to find the right lock.
Most of the time, we place new asserts and wait for the next hit. We’re guessing a word, not with the expecation that we’re right, but with the hope that we learn enough to be able to guess the answer next time!
If you’d like to learn more about how Replay Works, we wrote a great deep dive with more examples here. Also if you’re really into Wordle or the inne mechanics of runtimes and Time-Travel Debugging, say hello in replay.io/discord.
We believe in the future is replayable runtimes will be a given and ultimately that is a community effort!