logo

How to time travel every time

What exactly is time travel? How do we ensure we can reliably record and deterministically replay any website?
profile photo
Jason Laster
Image without caption
At Replay.io we’re building time travel enabled Browser DevTools. Time-travel lets us add debugging features like retroactive print statements. Time-travel also solves the reproducibility problem because you can now share replays with your team.
But, what exactly is time travel? How do we ensure we can reliably record and deterministically replay any website?

How we record and replay

Most of the code we write is deterministic. If you want to replay fibonacci 10 you can simply re-run it. If you re-wrote fibonacci so instead of receiving the input, it read it from a file, you’d need to start recording fopen, fread, and fclose.
At Replay.io, we record and replay browser tabs and node scripts. So instead of recording fibonacci we’re recording content processes like Chrome: 10s of millions of lines of C++ responsible for JavaScript, Graphics, Networking, DOM, Layout, etc. But the same principles apply.
It turns out that if you can record about ~400 libc calls, you’re pretty close to being able to record and replay a browser tab. The remaining non-determinism is introduced by inter-process communication and non-deterministic data structures like Hash Tables that rely on pointers.
We aim to deterministically replay at the JS level, not at the C++ level. This means that it’s okay if, for example, Chrome’s GC performs sweeps at different times while replaying as long as from the perspective of the web application everything is the same. Promises should resolve in the same order. Worker messages should be received in the same order. But if you connect LLDB to a recording of Chrome, you’ll notice that it runs in a different way each time. This lossiness is a feature and is critical to achieving a real time low overhead recording.
If you’d like to dive a lot deeper into how our recorder works, we wrote a three part series here.

How do we improve our recorder

Because our recorder is lossy, we need to take an empirical approach to replayability. We run a fairly large test suite daily, categorize the crashes that come in, add diagnostics that help us narrow the problems down, and add fixes that make Chrome more deterministic. It is fairly easy to see what our crash rate is, how many crashes are categorized, and for each class of issues, how many diagnostics we’ve added.
The technical term for a crash is a divergence. We detect divergences by placing asserts in Chrome which act like print statements. When we get an assert in a different order while replaying we know that we’ve diverged.
When we want to narrow down a divergence, we go in and add more asserts. Here is an example of some asserts Dominik added recently to better understand an issue we’re seeing in the PictureLayerTilingSet component.
c++
bool PictureLayerImpl::UpdateTiles() { // https://linear.app/replay/issue/RUN-550 recordreplay::Assert("[RUN-550] PictureLayerImpl::UpdateTiles Start %d", id()); if (!CanHaveTilings()) { // https://linear.app/replay/issue/RUN-550 recordreplay::Assert("[RUN-550] PictureLayerImpl::UpdateTiles #1"); ideal_page_scale_ = 0.f; ideal_device_scale_ = 0.f; ideal_contents_scale_ = gfx::Vector2dF(0.f, 0.f); ideal_source_scale_ = gfx::Vector2dF(0.f, 0.f); SanityCheckTilingState(); return false; }
In this case, if UpdateTiles is called in a different order while replaying we might get this stream while recording
c++
"[RUN-550] PictureLayerImpl::UpdateTiles Start 1" "[RUN-550] PictureLayerImpl::UpdateTiles #1" "[RUN-550] PictureLayerImpl::UpdateTiles Start 2" "[RUN-550] PictureLayerImpl::UpdateTiles #1" "[RUN-550] PictureLayerImpl::UpdateTiles Start 3" "[RUN-550] PictureLayerImpl::UpdateTiles #1"
and this new stream while replaying
c++
"[RUN-550] PictureLayerImpl::UpdateTiles Start 1" "[RUN-550] PictureLayerImpl::UpdateTiles #1" "[RUN-550] PictureLayerImpl::UpdateTiles Start 3" <--- notice how 2 & 3 swapped "[RUN-550] PictureLayerImpl::UpdateTiles #1" "[RUN-550] PictureLayerImpl::UpdateTiles Start 2" "[RUN-550] PictureLayerImpl::UpdateTiles #1"
The process for narrowing down a divergence is iterative and somewhat mechanical. But if you know where to look, you can quickly narrow it down to a source of raciness or memory reference.

How do we model our crash rate?

At a high-level, it is fairly easy to know how well we’re doing. You can simply look at what % of recordings we are able to successfully replay.
This problem is similar to looking at a leaky bucket. You don’t need to know how many holes there are or how big they are to know how leaky the bucket is. You can simply look at how quickly the water flows out of the bucket.
However once you want to actually plug the holes in the bucket, it’s helpful to know how many there are and which ones are biggest in order to prioritize your efforts. In our case, we want to know how many sources of non-determinism we have and how significant they are.
Image without caption
Today we have 96 known issues in Chrome and we estimate that there are 162 remaining issues. When we look at the distribution of crashes across issues, we see that 30% of the issues account for 80% of our crashes. This is great because it means we only need to fix 38 issues in order to reduce our crash rate by 80%. An 80/30 is not as good as an 80/20 power curve, but we’ll take it!
Image without caption

Setting an end date

The purpose of models is to predict the future. Once we have an idea of how many issues we have, we can begin measuring how long it takes for us to fix any given issue and begin to predict how long it will take for us to fix our top 30% of issues and reduce our crash rate by 80%.
Related posts
post image
In this failure, we look into a bug where a React component calls Math.round and gets a different value when the test passes and fails!
post image
Even seemingly minor changes to your code can have unexpected consequences for your e2e tests. We have been repeatedly learning this lesson while helping Metabase drive down their e2e test flakes.
post image
Test flakiness is annoying, but it can sometimes point to a real problem in the application. This is sometimes referred to as “false positive”. A false positive happens when a test should fail, but instead it passes. Learn how you...
Powered by Notaku