Using NFS with GitLab

NFS can be used as an alternative for object storage but this isn’t typically recommended for performance reasons.

For data objects such as LFS, Uploads, Artifacts, and so on, an Object Storage service is recommended over NFS where possible, due to better performance. When eliminating the usage of NFS, there are additional steps you need to take in addition to moving to Object Storage.

File system performance can impact overall GitLab performance, especially for actions that read or write to Git repositories. For steps you can use to test file system performance, see File System Performance Benchmarking.

Gitaly and NFS deprecation

Starting with GitLab version 14.0, support for NFS to store Git repository data is deprecated. Technical customer support and engineering support is available for the 14.x releases. Engineering is fixing bugs and security vulnerabilities consistent with our release and maintenance policy.

Upon the release of GitLab 15.6 technical and engineering support for using NFS to store Git repository data is officially at end-of-life. There are no product changes or troubleshooting provided via Engineering, Security or Paid Support channels after the release date of 15.6, regardless of your GitLab version.

Until the release of 15.6, for customers running 14.x releases, we continue to help with Git related tickets from customers running one or more Gitaly servers with its data stored on NFS. Examples may include:

  • Performance issues or timeouts accessing Git data
  • Commits or branches vanish
  • GitLab intermittently returns the wrong Git data (such as reporting that a repository has no branches)

Assistance is limited to activities like:

  • Verifying developers’ workflow uses features like protected branches
  • Reviewing GitLab event data from the database to advise if it looks like a force push over-wrote branches
  • Verifying that NFS client mount options match our documented recommendations
  • Analyzing the GitLab Workhorse and Rails logs, and determining that 500 errors being seen in the environment are caused by slow responses from Gitaly

GitLab support is unable to continue with the investigation if both:

  • The date of the request is on or after the release of GitLab version 15.6.
  • Support Engineers and Management determine that all reasonable non-NFS root causes have been exhausted.

If the issue is reproducible, or if it happens intermittently but regularly, GitLab Support can investigate providing the issue reproduces without the use of NFS. In order to reproduce without NFS, the affected repositories should be migrated to a different Gitaly shard, such as Gitaly cluster or a standalone Gitaly VM, backed with block storage.

Why remove NFS for Git repository data

NFS is not well-suited to a workload consisting of many small files, like Git repositories. NFS does provide a number of configuration options designed to improve performance. However, over time, a number of these mount options have proven to result in inconsistencies across multiple nodes mounting the NFS volume, up to and including data loss. Addressing these inconsistencies consume extraordinary development and support engineer time that hamper our ability to develop Gitaly Cluster, our purpose-built solution to addressing the deficiencies of NFS in this environment.

Please note that Gitaly Cluster provides highly-available Git repository storage. If this is not a requirement, single-node Gitaly backed by block storage is a suitable substitute.

Engineering support for NFS for Git repositories is deprecated. Technical support is planned to be unavailable from GitLab 15.0. No further enhancements are planned for this feature.

Read:

Known kernel version incompatibilities

RedHat Enterprise Linux (RHEL) and CentOS v7.7 and v7.8 ship with kernel version 3.10.0-1127, which contains a bug that causes uploads to fail to copy over NFS. The following GitLab versions include a fix to work properly with that kernel version:

If you are using that kernel version, be sure to upgrade GitLab to avoid errors.

Fast lookup of authorized SSH keys

The fast SSH key lookup feature can improve performance of GitLab instances even if they’re using block storage.

Fast SSH key lookup is a replacement for authorized_keys (in /var/opt/gitlab/.ssh) using the GitLab database.

NFS increases latency, so fast lookup is recommended if /var/opt/gitlab is moved to NFS.

We are investigating the use of fast lookup as the default.

Improving NFS performance with GitLab

NFS performance with GitLab can in some cases be improved with direct Git access using Rugged.

From GitLab 12.1, GitLab automatically detects if Rugged can and should be used per storage. If you previously enabled Rugged using the feature flag and you want to use automatic detection instead, you must unset the feature flag:

sudo gitlab-rake gitlab:features:unset_rugged

If the Rugged feature flag is explicitly set to either true or false, GitLab uses the value explicitly set.

From GitLab 12.7, Rugged is only automatically enabled for use with Puma if the Puma thread count is set to 1.

To use Rugged with a Puma thread count of more than 1, enable Rugged using the feature flag.

NFS server

Installing the nfs-kernel-server package allows you to share directories with the clients running the GitLab application:

sudo apt-get update
sudo apt-get install nfs-kernel-server

Required features

File locking: GitLab requires advisory file locking, which is only supported natively in NFS version 4. NFSv3 also supports locking as long as Linux Kernel 2.6.5+ is used. We recommend using version 4 and do not specifically test NFSv3.

When you define your NFS exports, we recommend you also add the following options:

  • no_root_squash - NFS normally changes the root user to nobody. This is a good security measure when NFS shares are accessed by many different users. However, in this case only GitLab uses the NFS share so it is safe. GitLab recommends the no_root_squash setting because we need to manage file permissions automatically. Without the setting you may receive errors when the Omnibus package tries to alter permissions. GitLab and other bundled components do not run as root but as non-privileged users. The recommendation for no_root_squash is to allow the Omnibus package to set ownership and permissions on files, as needed. In some cases where the no_root_squash option is not available, the root flag can achieve the same result.
  • sync - Force synchronous behavior. Default is asynchronous and under certain circumstances it could lead to data loss if a failure occurs before data has synced.

Due to the complexities of running Omnibus with LDAP and the complexities of maintaining ID mapping without LDAP, in most cases you should enable numeric UIDs and GIDs (which is off by default in some cases) for simplified permission management between systems:

Disable NFS server delegation

We recommend that all NFS users disable the NFS server delegation feature. This is to avoid a Linux kernel bug which causes NFS clients to slow precipitously due to excessive network traffic from numerous TEST_STATEID NFS messages.

To disable NFS server delegation, do the following:

  1. On the NFS server, run:

    echo 0 > /proc/sys/fs/leases-enable
    sysctl -w fs.leases-enable=0
    
  2. Restart the NFS server process. For example, on CentOS run service nfs restart.

note
The kernel bug may be fixed in more recent kernels with this commit. Red Hat Enterprise 7 shipped a kernel update on August 6, 2019 that may also have resolved this problem. You may not need to disable NFS server delegation if you know you are using a version of the Linux kernel that has been fixed. That said, GitLab still encourages instance administrators to keep NFS server delegation disabled.

NFS client

The nfs-common provides NFS functionality without installing server components which we don’t need running on the application nodes.

apt-get update
apt-get install nfs-common

Mount options

Here is an example snippet to add to /etc/fstab:

10.1.0.1:/var/opt/gitlab/.ssh /var/opt/gitlab/.ssh nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,_netdev,lookupcache=positive 0 2
10.1.0.1:/var/opt/gitlab/gitlab-rails/uploads /var/opt/gitlab/gitlab-rails/uploads nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,_netdev,lookupcache=positive 0 2
10.1.0.1:/var/opt/gitlab/gitlab-rails/shared /var/opt/gitlab/gitlab-rails/shared nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,_netdev,lookupcache=positive 0 2
10.1.0.1:/var/opt/gitlab/gitlab-ci/builds /var/opt/gitlab/gitlab-ci/builds nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,_netdev,lookupcache=positive 0 2
10.1.0.1:/var/opt/gitlab/git-data /var/opt/gitlab/git-data nfs4 defaults,vers=4.1,hard,rsize=1048576,wsize=1048576,noatime,nofail,_netdev,lookupcache=positive 0 2

You can view information and options set for each of the mounted NFS file systems by running nfsstat -m and cat /etc/fstab.

Note there are several options that you should consider using:

Setting Description
vers=4.1 NFS v4.1 should be used instead of v4.0 because there is a Linux NFS client bug in v4.0 that can cause significant problems due to stale data.
nofail Don’t halt boot process waiting for this mount to become available
lookupcache=positive Tells the NFS client to honor positive cache results but invalidates any negative cache results. Negative cache results cause problems with Git. Specifically, a git push can fail to register uniformly across all NFS clients. The negative cache causes the clients to ‘remember’ that the files did not exist previously.
hard Instead of soft. Further details.
cto cto is the default option, which you should use. Do not use nocto. Further details.
_netdev Wait to mount file system until network is online. See also the high_availability['mountpoint'] option.

soft mount option

It’s recommended that you use hard in your mount options, unless you have a specific reason to use soft.

When GitLab.com used NFS, we used soft because there were times when we had NFS servers reboot and soft improved availability, but everyone’s infrastructure is different. If your NFS is provided by on-premise storage arrays with redundant controllers, for example, you shouldn’t need to worry about NFS server availability.

The NFS man page states:

“soft” timeout can cause silent data corruption in certain cases

Read the Linux man page to understand the difference, and if you do use soft, ensure that you’ve taken steps to mitigate the risks.

If you experience behavior that might have been caused by writes to disk on the NFS server not occurring, such as commits going missing, use the hard option, because (from the man page):

use the soft option only when client responsiveness is more important than data integrity

Other vendors make similar recommendations, including System Applications and Products in Data Processing (SAP) and NetApp’s knowledge base, they highlight that if the NFS client driver caches data, soft means there is no certainty if writes by GitLab are actually on disk.

Mount points set with the option hard may not perform as well, and if the NFS server goes down, hard causes processes to hang when interacting with the mount point. Use SIGKILL (kill -9) to deal with hung processes. The intr option stopped working in the 2.6 kernel.

nocto mount option

Do not use nocto. Instead, use cto, which is the default.

When using nocto, the dentry cache is always used, up to acdirmax seconds (attribute cache time) from the time it’s created.

This results in stale dentry cache issues with multiple clients,