Building a remote cache server for Turborepo

12 February 2022

Turborepo is a tool that came across my virtual desk recently. Monorepo develoment has been around for a long time. This is a strategy where all of your code remains in one repository regardless of services. A lot of people use monorepo's even for microservices. The huge upside is to keep everything in one place, which allows for development efficiency, such as grepping an entire codebase for specific keywords. A quick example would be a top level directory which has child directories that each contain an npm package, unlike publishing these packages, you access them locally as though they were published.

There are many tools in the Javascript ecosystem to manage monorepos. Yarn, and npm both have their own workspaces. Lerna is a tool that people use to run commands a cross these packages. I've been a huge fan of monorepos for years. One of the big problems with this setup is build times. At Quala we have around 38 packages, and some of my previous employers have had over 100. When you have these large repos sometimes you can make a change in a single package, but when you run build you have to wait to build the entire repository which can take a long time.

Turborepo, however caches the build output of packages, so when you change a package it will get cache hits on particular packages, and thus you only build the changes you make. This is not a new idea. Years ago, google built bazel, A lot of people in C++ land have had remote builds. With Turborepo it seems the only official way to have remote caches is to use Vercel, or host your own server. For many reasons at Quala I decided to opt for hosting our own server.

So to add turborepo to your monorepo, you need to add some simple config to the root of your workspace, and your root package.json needs to replace its build command with turborepo build. In the case of remote caches you need to add

--api="https://yourCacheServer.dev" --token="token" --team="team"

Notice, the api flag does not contain a / at the end. Now according to the docs you don't need to pass a team, but I was unable to get the caches to register without it 🤷‍♀️

The API

According to the docs

You can self-host your own Remote Cache or use other remote caching service providers as long as they comply with Turborepo's Remote Caching Server API. I opted to write the server in go, and I used Go Fiber. At first I figured I could copy their structs to my project but honestly the API is so simple, there is no advantage to this.

To get a list of the API's you need, you are linked to some code written in Go. I reverse engineered this code a bit, and came up with 4 APIs, and an AUTH token

Authorization: Bearer ${token}
PUT: /v8/artifacts/:hash
GET: /v8/artifacts/:hash
GET: /v2/user
GET: /v2/teams

Authorization

When turborepo sends requests it appends the Authorization header which will contain our token. Ideally you would add to your server a way to auth a user and give them this token. In the below example we have a single token that comes from an environment variable. You really should have per user auth.

	app.Use(func(c *fiber.Ctx) error {
		authHeader := c.Get("Authorization")
		if authHeader != "Bearer "+token {
			c.Status(401).SendString("Unauthorized")
			return nil
		}
		return c.Next()
	})

Handling Requests

The API pretty much breaks down like this.

PUT: /v8/artifacts/:hash will send a file that you must write somewhere. Some people opt for sending it to S3, I decided to use a persistent disk, and save on the disk. I wanted the fastest responses for the caches. Heck if I'm going to remote cache something that would still be kinda quick on an M1, it better perform.

	app.Put("/v8/artifacts/:hash", func(c *fiber.Ctx) error {
		fmt.Println(string(c.Request().URI().QueryString()))
		return os.WriteFile("./cache/"+c.Params("hash"), c.Request().Body(), 0644)
	})

The same URL but on a get is simple. Retrieve a file and serve it up, or return a 404

	app.Get("/v8/artifacts/:hash", func(c *fiber.Ctx) error {
		fmt.Println(string(c.Request().URI().QueryString()))
		return c.SendFile("./cache/" + c.Params("hash"))
	})

The last two honesty you don't need to make things work. You can just return a 200

	app.Get("/v2/teams", func(c *fiber.Ctx) error {
		return c.SendStatus(fiber.StatusOK)
	})

	app.Get("/v2/user", func(c *fiber.Ctx) error {
		return c.SendStatus(fiber.StatusOK)
	})

The /v2/user API is supposed to return information about the current user in the following shape. I'm pretty sure (not positive) created at is an epoch of the time the user was created. I'm guessing its largely used for Vercel.

{
	ID        string
	Username  string
	Email     string
	Name      string
	CreatedAt int   
}

The team api is supposed to look something like the following.

{
    Pagination {
        Count: int,
        Next: int,
        Prev: int
    }
    Teams [
        Team {
            ID: string,
            Slug: string,
            Name: string,
            CreatedAt: int,
            Created: string
        }
    ]
}

What about the --team flag?

So when requests are made with --team a query string ?slug=team is added to the request. You can use this to ensure a particular user is in the given team, and you can fragment your caches by team. I ommitted that code from the above example, but the easiest way would be to have ./cache/${team}/${hash} directory structure for the caches on disk. Note, on the GET requests you should auth the token against the team ID, and return a 404 if the user is not in the team. I would not opt to return a Unauthorized header, as that can be used by bad actors to cycle through tokens to know which one will work to cause harm.

The Result

An extremely minimal server is in this github repo (although you shouldn't probably use it without building it out more).

Tagged In:
go javascript turborepo devops build node.js