logo

Our Cloud Development Environment Journey

profile photo
Dan Miller
Less than a year after Replay started we invested in a “cloud development environment”, or CDE as it is starting to become known. We ended up writing one ourselves, though there are plenty of more off-the-shelf solutions. In this post I’ll describe the motivations for our CDE and what it looks like.

Why a CDE?

Replay requires a lot of computer resources to run. Even replaying just a 30 second recording can require many chrome processes with multiple GBs of resident memory each to replay the recording. For longer recordings we have to start hundreds or even thousands of those processes and distribute them across many computers to achieve acceptable response times. These workloads are not feasible to run on a single local computer, and even if they were, that setup would not resemble the operating and performance characteristics of production. We get a much more faithful development environment by developing in the cloud, even leaving aside potential differences in operating system (macOS vs Windows vs Linux) and CPU architecture (arm64 vs x86_64).
These requirements lend themselves quite well to Kubernetes, which is what we use to run our backend. Kubernetes is a runtime, like Linux or Windows. It provides APIs that developers can use in their apps. For example, at Replay we use the Kubernetes APIs to create the aforementioned browser instances on the fly, depending on what the user needs. When a user asks for a recording to be replayed our app can decide which container to launch, what kind of hardware it should run on, and simply tells Kubernetes to make it so. This is exactly the level that developers want to be thinking at. Due to the fact that we use these APIs, you can’t just run Replay on a computer without Kubernetes, and this in turn means using containers.
Containers on Linux are fine, but they are a recipe for sadness on macOS.

Why containers suck on macOS

Docker doesn't try to be agnostic of the platform it runs on, it directly ties itself to Linux. This is where the problems start for macOS. In order to be able to run Docker on macOS you need to have a Linux virtual machine running in the background. When you run a docker container, or a Kubernetes pod, it actually runs in the VM.
This has worked OK for macOS users for most of Docker's life. It provides you a high fidelity development environment to test out your code before it hits production. But as we use containers for more and more things in development, this high fidelity environment becomes a liability primarily due to disk I/O.
Say you wanted to write a bash one-liner that checks the copyright comment header on all of the files in a git repository.
bash
git ls-files | while read -r file; do head -n 1 "$file" | grep -q "Copyright" || echo "$file"; done
Perhaps due to the differences between BSD sed and GNU sed you want to run it in a container. I implemented this using a simple Earthfile:
bash
copyright: FROM alpine RUN apk add --no-cache git COPY . . RUN git ls-files | while read -r file; do head -n 1 "$file" | grep -q "Copyright" || echo "$file"; done
Here are the results of running this command on Linux and macOS, inside of a container and outside of a container, with the apk add cached:
Setup
Result
2021 M1 Max (container)
5s
t3.xlarge Linux VM (container)
3 seconds
2021 M1 Max (native)
.3s
t3.xlarge Linux VM (native)
1s
That’s a huge 16x penalty on macOS, but only a 3x penalty on Linux. The reason for this is on macOS you need to copy all of the files that you want to operate on from host OS to the VM. This is much slower than copying files on your hard drive, it’s more akin to copying files over the network. On Linux you don’t need to incur this cost because the container’s filesystem is the same filesystem as the host OS.

Our Cloud Development Environment

Due to the infeasibility of running Replay on a laptop, plus the container penalty you pay for running it on macOS, along with the traditional CDE benefit of easy onboarding, we built our own and switched all engineers over to it in 2021. It consists of three parts: a Kubernetes server, a development flow orchestrator and remote builds.
Let’s get the simplest one out of the way first: Kubernetes. We have a Kubernetes cluster in a separate AWS account that runs all of the services. Each developer gets their own namespace. We have a little cronjob that cleans up namespaces that haven’t been used in a while to allow the underlying node groups to scale down on weekends.
Next, we use Tilt to orchestrate our development flow. When a developer starts working they just run tilt up and tilt logs in to Docker, builds docker images, and applies configuration to the Kubernetes cluster. Tilt watches all of your files and understands which services it needs to build and deploy depending on what files changed. It also provides other workflows such as one that starts a web browser pointed at our web UI and pre-configured to send newly created recording to the development Kubernetes cluster.
Initially this was our entire CDE, but due to the macOS issues mentioned above building 5+ Docker images on a macOS computer could take a really long time. It also really hurt battery life. As a result we switched to using Earthly with Earthly Satellites to do remote Docker builds. Now when a developer starts Tilt all work, from builds to service execution, happens in the cloud. You don’t need to use Earthly to do this, any remote buildkit instance works well, but we really like the UX Earthly offers. One trick we use here is to run all Earthly builds with the --no-output and --push flags. This means that when Earthly builds an image it does not download it back to the local laptop, because we have no need for it, and instead pushes it directly up to the AWS docker image repository.
Related posts
post image
Even seemingly minor changes to your code can have unexpected consequences for your e2e tests. We have been repeatedly learning this lesson while helping Metabase drive down their e2e test flakes.
post image
Test flakiness is annoying, but it can sometimes point to a real problem in the application. This is sometimes referred to as “false positive”. A false positive happens when a test should fail, but instead it passes. Learn how you...
post image
This blog post walks through how you can use Replay to debug a real world flaky test that we investigated with the Metabase team.
Powered by Notaku