Troubleshooting monorepo performance

Review these suggestions for performance problems with monorepos.

Slowness during git clone or git fetch

There are a few key causes of slowness with clones and fetches.

High CPU utilization

If the CPU utilization on your Gitaly nodes is high, you can also check how much CPU is taken up from clones by filtering on certain values.

In particular, the command.cpu_time_ms field can indicate how much CPU is being taken up by clones and fetches.

In most cases, the bulk of server load is generated from git-pack-objects processes, which is initiated during clones and fetches. Monorepos are often very busy and CI/CD systems send a lot of clone and fetch commands to the server.

High CPU utilization is a common cause of slow performance. The following non-mutually exclusive causes are possible:

Cause: too many large clones

You might have too many large clones for Gitaly to handle. Gitaly can struggle to keep up because of a number of factors:

  • The size of a repository.
  • The volume of clones and fetches.
  • Lack of CPU capacity.

To help Gitaly process many large clones, you might need to reduce the burden on Gitaly servers through some optimization strategies such as:

The other option is to increase CPU capacity on Gitaly servers.

Cause: poor read distribution

You might have poor read distribution on Gitaly Cluster.

To observe if most read traffic is going to the primary Gitaly node instead of getting distributed across the cluster, use the read distribution Prometheus metric.

If the secondary Gitaly nodes aren’t receiving much traffic, it might be that the secondary nodes are perpetually out of sync. This problem is exacerbated in a monorepo.

Monorepos are often both large and busy. This leads to two effects. Firstly, monorepos are pushed to often and have lots of CI jobs running. There can be times when write operations such as deleting a branch fails a proxy call to the secondary nodes. This triggers a replication job in Gitaly Cluster so that the secondary node will catch up eventually.

The replication job is essentially a git fetch from the secondary node to the primary node, and because monorepos are often very large, this fetch can take a long time.

If the next call fails before the previous replication job completes, and this keeps happening, you can end up in a state where your monorepo is constantly behind in its secondaries. This leads to all traffic going to the primary node.

One reason for these failed proxied writes is a known issue with the Git $GIT_DIR/packed-refs file. The file must be locked to remove an entry in the file, which can lead to a race condition that causes a delete to fail when concurrent deletes happen.

Engineers at GitLab have developed mitigations to try to batch reference deletions.

Turn on the following feature flags to allow GitLab to batch ref deletions. These feature flags do not need downtime to enable.

  • merge_request_cleanup_ref_worker_async
  • pipeline_cleanup_ref_worker_async
  • pipeline_delete_gitaly_refs_in_batches
  • merge_request_delete_gitaly_refs_in_batches

Epic 4220 proposes to add RefTable support in GitLab, which is considered a long-term solution.