This post was written by
Derrick Stolee
, a Git contributor since 2017 who focuses on performance. Some of his contributions include speeding up
git log --graph
and
git push
for large repositories. You can hear him speak at
Git Merge
in Los Angeles on March 4.
March 31, 2021 update
To fix a bug, we’ve updated the code provided in the
Cloning in Sparse Mode
and
Sparse-checkout and partial clones
sections of this post. Thanks to readers for catching this.
Git 2.25.0
includes a new experimental
git sparse-checkout
command that makes the existing feature easier to use, along with some important performance benefits for large repositories.
Does your repository have so many files at root that your source directory is growing out of control? Do commands like
git checkout
or
git status
slow to a crawl? These problems can be extremely frustrating, especially for developers who just need to modify a small fraction of the files available. Now, an improved and experimental sparse-checkout feature allows users to restrict their working directory to only the files they care about. Specifically, if you use the “microservices in a monorepo” pattern, you can ensure the developer workflow is as fast as possible while maintaining all the benefits of a monorepo.
Previously, in order to use the sparse-checkout feature you needed to manually edit a file in your .git directory, change a config setting, and run an obscure plumbing command. The instructions for getting back to normal were even more complicated. The new
git sparse-checkout
command makes this process much easier.
To follow along, I created a
sample Git repository
that you can clone and test yourself. This repository doesn’t actually build a real application, but imagine that it’s the monorepo for a photo storage and sharing application.
This repository includes the following file structure in the initial two levels:
Imagine that this repository is a monorepo containing many microservices and clients.
The
client
directory contains independent clients for three different platforms: Android, Desktop (using Electron), and iOS.
The
service
directory contains all of the server-side logic for several independently-deployable microservices.
The
web
directory contains the Javascript-enabled “serverless” static web pages that use Javascript to communicate with those microservices.
While all of this code is in the monorepo, every user of the repository doesn’t need every directory. But, the users are constantly updating all 1,557 files in the repo when they use
git pull
to update their local changes with the latest
master
. Note that this number is very small, so imagine adding an extra three zeroes to the end of the numbers.
To get started with the sparse-checkout feature, you can run
git sparse-checkout init --cone
to restrict the working directory to only the files at root (and in the .git directory).
A developer can do very little with only these files. However, the team building the Android app can usually get away with only the files in client/android and run all integration testing with the currently-deployed services. The Android team needs a much smaller set of files as they work. This means they can use the git sparse-checkout set command to restrict to that directory:
$ git sparse-checkout set client/android
bootstrap.sh* client/ LICENSE.md README.md
$ ls client/
android/
$ find . -type f | wc -l
For more complicated scenarios, the architecture team created a boostrap.sh script at the root of the repository. This script exists even when the repository only contains the files at root, and provides the commands for teams that need more detailed sparse-checkout cones.
For example, the team running the identity service needs the code for their microservice, as well as the directory containing common code for all microservices.
The team that builds the photo browser in the web runs a web app and occasionally needs to deploy their own version of the microservices to do local testing of their web features. Their bootstrap code requires a larger cone, including all microservice directories and the web/browser directory.
$ ./bootstrap.sh browser
Running ‘git sparse-checkout init --cone’
Running ‘git sparse-checkout set service web/browser’
bootstrap.sh* LICENSE.md README.md service/ web/
$ ls service/
common/ identity/ items/ list/
$ ls web
browser/
$ find . -type f | wc -l
At any point, we can check which directories are included in our working directory using the list subcommand.
Given a file path such as /A/B/C/D.txt, Git includes the path in the working directory if any of the following are true:
A/B/C is in the parent set,
A/B/C is in the recursive set,
A/B is in the recursive set, or
A is in the recursive set.
In this way, Git matches the M patterns across N files in O(M + N*d) time, where d is the maximum folder depth of a file. Of course, this assumes that Git inspects the files in an arbitrary order. When Git is evaluating which files match the sparse-checkout patterns, it inspects the files in a sorted order. This means that when the start of a folder matches a recursive pattern exactly, Git marks everything in that folder as “included” without doing any hashset lookups. Git also detects the start of a folder that’s outside of our cone and marks everything in that folder as “excluded” similarly. Finally, this reduces the typical time to be closer to O(M+N).
Continuing the process, the files on root get included immediately. The directories check if they’re recursive or parent closures. The only recursive match is “client/android” so that would auto-fill all contained paths as “matched”. The “client” directory appears in the parent pattern set, so we test all entries in that directory but the only other match is the README file. The parent and recursive pattern sets don’t include the “service” and “web” directories, so we automatically skip the entries in those directories.
In the figure, each green check mark is a successful hashset containment query and each red X is an unsuccessful hashset containment query. There are many fewer hashset containment queries than there were pattern match queries. This logic to auto-match all paths within a folder replaced a TODO comment from eight years ago. It’s unlikely that this kind of logic could be applied consistently without restricting the set of patterns the way that cone mode does.
Finally, in our real example that took five minutes before, our pattern matching now takes less than one second.
Sparse-checkout and partial clones
Pairing sparse-checkout with the partial clone feature accelerates these workflows even more. This combination speeds up the data transfer process since you don’t need every reachable Git object, and instead, can download only those you need to populate your cone of the working directory. You can test it right now with the example repository by adding --filter=blob:none to the clone command:
$ git clone --filter=blob:none --no-checkout https://github.com/derrickstolee/sparse-checkout-example
Cloning into 'sparse-checkout-example'...
Receiving objects: 100% (373/373), 75.98 KiB | 2.71 MiB/s, done.
Resolving deltas: 100% (23/23), done.
$ cd sparse-checkout-example/
$ git sparse-checkout set --cone
$ git checkout main
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (2/2), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 1 (delta 0), pack-reused 1
Receiving objects: 100% (3/3), 1.41 KiB | 1.41 MiB/s, done.
Already on 'main'
Your branch is up to date with 'origin/main'.
$ git sparse-checkout set client/android
remote: Enumerating objects: 3, done.
remote: Counting objects: 100% (3/3), done.
remote: Compressing objects: 100% (3/3), done.
remote: Total 26 (delta 0), reused 1 (delta 0), pack-reused 23
Receiving objects: 100% (26/26), 985.91 KiB | 13.69 MiB/s, done.
In the examples, let’s take a look at the number of objects it took to adopt each persona. To clone the entire repository, we needed only 373 objects instead of ~2,000.
Similarly, look at the total object count in the “Receiving objects” line for each git sparse-checkout command. When initializing the sparse-checkout feature, three objects are downloaded for the files at root. Finally, 26 more objects are downloaded when populating the client/android directory.
That’s the wonder of partial clone combined with sparse-checkout: you pay for what you need. Keep an eye out for the partial clone feature to become generally available[1].