Using shard merging to tackle the long tail of tiny and stale repos
In this post we give you a brief overview of how we're tackling the long tail of tiny repositories by introducing shard merging to Zoekt, in our quest to index the OSS universe.
What is Zoekt?
Zoekt is a code search engine that performs trigram-based regex search. Originally created by Han-Wen Niehuys, Zoekt is fast, easy to deploy, and easy to maintain, which makes it a great choice for our self-hosted customer deployments. Sourcegraph is actually taking on maintainership of Zoekt, which you can read about here.
Why not just scale out?
Naturally, Zoekt's size of the index scales with the size of the input data. Zoekt’s index is about 2–3 times larger than the input data. Some of that data, like the trigrams, is kept in memory. At the scale of the open source universe, it quickly becomes too costly to just scale out. Luckily, Zoekt still has a lot of untapped potential when it comes to more efficient data structures. For example, in a previous post, Ryan Hitchman explained how we changed one of Zoekt's core data structures to reduce memory by 5x.
The core idea of this project is to merge the long tail of small, stale repositories into more efficient representations on disk and in memory. To understand the motivation behind this idea we have to first dive into Zoekt's data model.
Zoekt’s data model
Zoekt indexes repositories and stores the index in one or more files on disk. A repository contains a hierarchy of documents, such as source code, binaries, text files and so forth. At index time documents are added, one at a time, to the index builder. Once we cross a threshold of input data, currently configured to be 100 MiB, the builder flushes the index to a file on disk. This file is called a shard. For small repositories, there is a 1:1 relationship between the repository and its shard. Larger repositories, such as Kubernetes or the Linux Kernel map to several shards. At query time, shards are treated independently.
The long tail
Most repositories are tiny and fit within a single shard. The following histogram shows the distribution of shard sizes on one of our production instances.
75% of the shards in our sample are smaller than 2.1 MiB. Each shard contains, among other data, the trigrams we created during indexing. On startup, Zoekt loads those trigrams into memory. Trigrams for content and file names make up more than 70% of the heap of Zoekt’s web server. The more unique trigrams a shard has, the more costly its in-memory representation is.
The two charts below show the number of trigrams in a shard vs. the shard's size. Plot A shows that most shards have less than 500k trigrams. Plot B shows a subset of the data in A (red box).