Building the impossible

Time Travel debugging is one of those fun problems that in theory is impossible and in practice is nearly impossible.

Time-travel debugging requires recording the runtime and replaying it deterministically. The runtime thinks it’s talking to a real computer and interacting with a real user when in fact it’s in the Matrix.

Machines perform billions of operations a second. In theory, Replay needs to replay those billions of operations. In practice, being truly deterministic would be too slow, which is why we strive for effective determinism.

Here are some of the strategies which are helping us build the impossible. And like everything we do, we’re continually iterating here as well.

Bias towards learning

There is a parable about a team that built a plane in a month. They knew how to rebuild the plane in a day and had thirty attempts to get it right. Iterate quickly helps you run cheap experiments, identify risks, and learn quickly.

People often reference Facebook’s value “move fast and break things”, but Facebook only got it half right. We also “break things, and learn fast”. That’s fine to say, but in practice it’s hard to do.

We intentionally set stability OKRs to help align our risk preferences. A week with 50% fewer issues is just as bad as a week with 50% more if it inhibited our learning. We strive to create a culture of psychological safety because you cannot take risks if you do not feel safe. If your goal is to go to Mars, you need some rockets to crash in order to test the boundary conditions.

Communicate early. Communicate often.

On any given Tuesday we are investigating low-level browser internals, operating system library calls, source mapping edge cases, replaying divergences, and backend challenges associated with running thousands of browsers at once. DevTools features like inspecting elements, evaluating print statements, and viewing network calls is the easy stuff.

Communication is the best predictor for success. If you listen to great basketball teams, they’re constantly communicating: calling out what they’re seeing, adapting on the fly, and changing assignments. The Warriors model their culture after jazz musicians and strive to play fast, loose, and disciplined. The contradictions are intentional!

For us, it starts with every team member establishing a strong cadence of communication, which we affectionately refer to as the heartbeat. The heartbeat starts with sharing your plans, your questions, your learnings, your roadblocks, your workarounds… When everyone is sharing what they’re seeing, we can all support one another and adapt on the fly.

Linus is famous for saying “Given enough eyeballs, every bug is shallow”. We believe that given enough communication, good ideas emerge, and we can tighten the learning loops. There’s a great paper on how Hypotheses aid the debugging process, which finds that most bugs are solvable given three hypotheses. The challenge for beginners is coming up with 3 hypotheses. Real-time collaboration is one of the best ways to come up with 3 hypotheses.

There are other benefits of real-time communication as well. Sharing is a great way of unblocking yourself (rubber ducking). Following along as your peers solve hard problems is motivating and normalizes setbacks. It’s also more fun.

Two days. Two weeks. Two years.

When we interview candidates we primarily focus on the soft skills. We ”map the potato” to simulate discussing unknowns, we pair program to simulate breaking down a problem, and we discuss our current “best” thinking to assess judgement. Prioritization is a core skill. Being able to plan in two day, two weeks, and two year increments is critical.

Everyone should be able to hold their roadmap, their team-member’s roadmap, and the company’s roadmap in their head. Not easy! The best decisions come from making informed trade-offs. We look for constraints that simplify the problem.

Conclusion

Replay’s culture would not be great for everyone, but it is great for us. Many people would be more comfortable iterating slower, taking fewer risks, and sharing findings after the fact. We work the way we do because it gives us the best chance of building the impossible and when we’re in the state flow and everyone is supporting one another it feels incredible.