Fixing a browser test failure with AI and time travel

Table of Contents

Do you want an AI to fix test failures in your PRs? We think this would be pretty neat and we’ve been experimenting. LLMs can be uncannily good at this, but they need devtools that surface the right information. Doing program analysis on Replay recordings is a great way to build these devtools.

Let’s look at this PR made a couple months ago against Replay’s debugger. The PR removes and streamlines a fair amount of functionality, and as part of that removes a CSS attribute which some tests depended on. When those tests failed, the PR author had to investigate and update the PR to fix the problem. Can we use an LLM instead to fix the PR?

We did an experiment to test whether an LLM can solve the problem from a prompt with the PR changes, the failing test, and the log when it fails. With 3 trials each of 4 LLMs (gpt-o1-preview, gpt-o1-mini, gpt-4o, and claude-3.5-sonnet) the only one able to explain the problem and produce suitable code changes to fix it was gpt-o1-preview, which succeeded in 2 of its trials.

We then repeated the experiment, adding a few sentences to the prompt describing the immediate cause of the failure. With this simple change the LLMs succeeded in 11 out of the 12 trials. Details and transcripts are available here.

Here are the sentences we added:

I compared this failing run against a passing run and found some interesting information: When the test passes, there is an element matching the locator being searched for. When the test fails, a similar element is present, but the locator fails to match because the element does not have a data-test-type attribute. This element was created by a Logpoint component.

This information about the runtime behavior of the application is not available from logs. A human developer can compare the DOM in a passing and failing test run and notice this discrepancy. For an AI developer, we can analyze the application’s behavior to generate a summary describing the failure.

Replay is great for this: we create a low overhead recording that completely captures the application’s behavior, and then time travel anywhere in the recording to perform detailed analysis. In this case we compare the DOM in recordings of the test passing (i.e. without the PR applied) and failing and use the test’s selectors to identify the element which wasn’t found in the test failure due to different attributes. Then, we look at the element’s properties to find the associated React component.

We built an analysis for test failures and incorporated it into a workflow that takes a PR and its test results, and applies the changes suggested by an AI to fix the test failure. The demo video below walks through this for the example above.

This is just a single example, though we believe that most test failures fall into patterns that can be solved by combining Replay based analysis with LLMs. We want to continue improving this, and the best way is to look at the real failures you’re encountering. If you want to try this out and help us, we’d love to hear from you! Reach us directly at hi@replay.io, or fill out our contact form and we’ll be in touch.

Analysis

Let’s look more closely at this example to better understand what’s going on. The original change in the PR which caused the failure is below. This code sets the CSS attributes on elements rendered by a Breakpoint React component, and the change removes the data-test-type attribute on these elements.

javascript
     return (
       <div
         className={styles.Point}
-        data-test-name="Breakpoint"
-        data-test-type={type}
+        data-test-name="LogPoint"
         data-test-column-index={point.location.column}
         data-test-line-number={point.location.line}

As a human developer, what information do we need to understand that this is the faulty change? Let’s look at the test failure log:

plain text
  1) packages/e2e-tests/tests/logpoints-02.test.ts:18:5 › logpoints-02: conditional log-points ─────

    Test timeout of 60000ms exceeded.

    Error: locator.getAttribute: Page closed
    =========================== logs ===========================
    waiting for locator('[data-test-name="LogPoint"][data-test-type="logpoint"][data-test-line-number="20"]').locator('[data-test-name="LogPointToggle"]')
    ============================================================

       at packages/e2e-tests/helpers/pause-information-panel.ts:367

      365 |   const targetState = enabled ? POINT_BEHAVIOR_ENABLED : POINT_BEHAVIOR_DISABLED_TEMPORARILY;
      366 |   const toggle = pointLocator.locator('[data-test-name="LogPointToggle"]');
    > 367 |   const currentState = await toggle.getAttribute("data-test-state");
          |                                     ^
      368 |   if (targetState !== currentState) {
      369 |     await debugPrint(page, `Toggling point to ${targetState}`, "togglePoint");
      370 |     await toggle.locator("input").click();

        at togglePoint (/Users/brianhackett/recordreplay/devtools/packages/e2e-tests/helpers/pause-information-panel.ts:367:37)
        at /Users/brianhackett/recordreplay/devtools/packages/e2e-tests/tests/logpoints-02.test.ts:52:20

    Pending operations:
      - locator.getAttribute at packages/e2e-tests/helpers/pause-information-panel.ts:367:37

The test uses Playwright locators to access elements on the page. From this failure log, we can see the test got stuck waiting for an element matching a locator to appear. We don’t know why the element never appeared, however. It could be a problem with the locator being used, but it could also be a problem with the page rendering incorrectly.

One of the CSS attributes in the failing locator is data-test-type. Because the PR removes this attribute on the Breakpoint component’s element, we can make an educated guess that this removal broke the locator and is causing the test to fail.

OpenAI’s latest and most advanced model, gpt-o1-preview, is the only LLM we tested that was able to make this inference, and even then it does not do so consistently. Generally, without other data the LLMs tend to get distracted by non-pertinent information in the failure output. When gpt-o1-preview failed it focused on the LogPointToggle locator not matching, and incorrectly explained that the code for rendering the toggle button was removed.

Even though we have enough information to guess at the cause of the failure, it isn’t easy to do so, even for a human – this is a pretty large PR. We can make this task easier by using devtools to better understand the failure.

As a human we can take the following steps:

Use Chrome’s devtools to look at the DOM when the test fails, and when the test passes by running it without the PR’s changes.

See the DOM is structured similarly in both cases, and that the data-test-type attribute is present on the matching element when the test passes but not on the corresponding element when the test fails.

Use React’s devtools to see this element is part of a Logpoint component, and zero in on the problematic change in the PR.

An LLM is going to have a lot of trouble replicating this workflow. There are a lot of steps involved in using these tools to study the DOM. Each of these is a good opportunity for the LLM to confidently say or do something wrong and get stuck.

LLMs deal best with natural language interfaces (“language” is the second L in LLM after all) and we believe the best way to help them be more effective is to give them information in natural language about the problem they’re solving. We can reliably and accurately get the information we need here using program analysis: running algorithms to mechanically extract and summarize properties of the program’s static source or its dynamic behavior while executing.

Replay’s core technology is designed for statically analyzing recorded executions. All of a program’s dynamic behavior can be extracted from a Replay recording. We can reconstruct the application state at any point in time, as if we had a time machine, and build analyses to query this state. Our analysis for locator problems takes the following steps:

Find the element matched by the locator when the test passes.

Look at the DOM when the test fails for a similar element.

Look for a change to the locator in the failing test which matches that similar element. From this we determine the element in the failure is missing a data-test-type attribute.

Read the JS _debugOwner property on the element to identify the associated React component.

Reporting this information to the LLM makes it much better at solving the problem. This analysis we’re doing on the Replay recordings is essentially a devtool for the AI.

Analysis by itself isn’t enough to fix the problem in this PR. Even with an explanation of the failure that makes it easier to understand, we still need the LLM to put things together and create a suitable patch. The combination of these technologies is more powerful than either one alone, and that will continue to be the case no matter how smart LLMs get. Good devtools help human developers be more effective, and the same holds for AI developers.

We’re excited to explore the combination of analysis and LLMs to automate common development tasks. As we said earlier, we want to work on fixing the real problems you’re running into. If you want to try this out and help us, reach us directly at hi@replay.io, or fill out our contact form.