Managing large repositories

GitLab, like any Git based system, is subject to similar performance restraints when it comes to large repositories that size into the gigabytes.

In the following sections, we detail several best practices for improving performance with these large repositories on GitLab.

Large File System (LFS)

It’s strongly recommended in any Git system that binary or blob files (for example, packages, audio, video, or graphics) are stored as Large File Storage (LFS) objects. In such setup, the Objects are stored elsewhere, such as in Object Storage, and this can reduce the repository size significantly, thus improving performance.

To analyze if the repository has these sorts of objects, it’s recommended to run a tool like git-sizer to get a detailed analysis. These tools can show in detail what makes up the repository as well as highlights any areas of concern. If any large objects are found, it’s then recommended removing them with tools such as git filter-repo.

Refer to the Git LFS documentation for more information.

Gitaly Pack Objects Cache

Gitaly, the service that provides storage for Git repositories, can be configured to cache a short rolling window of Git fetch responses. This is recommended for large repositories as it can notably reduce server load when your server receives lots of fetch traffic.

Refer to the Gitaly Pack Objects Cache for more information.

Reference Architectures

Large repositories tend to be found in larger organisations with many users. The GitLab Quality and Support teams provide several Reference Architectures that are the recommended way to deploy GitLab at scale.

In these types of setups it’s recommended that the GitLab environment used matches a Reference Architecture to improve performance.

Gitaly Cluster

Gitaly Cluster can notably improve large repository performance as it holds multiple replicas of the repository across several nodes. As a result, Gitaly Cluster can load balance read requests against those repositories and is also fault-tolerant.

It’s recommended for large repositories, however, Gitaly Cluster is a large solution with additional complexity of setup, and management. Refer to the Gitaly Cluster documentation for more information, specifically the Before deploying Gitaly Cluster section.

Keep GitLab up to date

Performance improvements and fixes are added continuously in GitLab. As such, it’s recommended you keep GitLab updated to the latest version where possible to benefit from these.

Reduce concurrent clones in CI/CD

Large repositories tend to be monorepos. This in turn typically means that these repositories get a lot of traffic not only from users, but from CI/CD.

CI/CD loads tend to be concurrent as pipelines are scheduled during set times. As a result, the Git requests against the repositories can spike notably during these times and lead to reduced performance for both CI and users alike.

When designing CI/CD pipelines, it’s advisable to reduce their concurrency by staggering them to run at different times, for example, a set running at one time, and another set running several minutes later.

There’s several other actions that can be explored to improve CI/CD performance with large repositories. Refer to the Runner documentation for more information.