- High availability
- GitLab components and configuration instructions
GitLab offers high availability options for organizations that require the fault tolerance and redundancy necessary to maintain high-uptime operations.
Please consult our scaling documentation if you want to resolve performance bottlenecks you encounter in individual GitLab components without incurring the additional complexity costs associated with maintaining a highly-available architecture.
On this page, we present examples of self-managed instances which demonstrate how GitLab can be scaled out and made highly available. These examples progress from simple to complex as scaling or highly-available components are added.
For larger setups serving 2,000 or more users, we provide reference architectures based on GitLab’s experience with GitLab.com and internal scale testing that aim to achieve the right balance of scalability and availability.
GitLab offers a number of options to manage availability and resiliency. Below are the options to consider with trade-offs.
|Event||GitLab Feature||Recovery Point Objective (RPO)||Recovery Time Objective (RTO)||Cost|
|Availability Zone failure||“GitLab HA”||No loss||No loss||2x Git storage, multiple nodes balanced across AZ’s|
|Region failure||“GitLab Disaster Recovery”||5-10 minutes||30 minutes||2x primary cost|
|All failures||Backup/Restore||Last backup||Hours to Days||Cost of storing the backups|
By adding automatic failover for database systems, we can enable higher uptime with an additional layer of complexity.
- For PostgreSQL, we provide repmgr for server cluster management and failover and a combination of PgBouncer and Consul for database client cutover.
- For Redis, we use Redis Sentinel for server failover and client cutover.
You can also optionally run additional Sidekiq processes on dedicated hardware and configure individual Sidekiq processes to process specific background job queues if you need to scale out background job processing.
GitLab Geo allows you to replicate your GitLab instance to other geographical locations as a read-only fully operational instance that can also be promoted in case of disaster.
This configuration is supported in GitLab Premium and Ultimate.
The GitLab application depends on the following components. It can also depend on several third party services depending on your environment setup. Here we’ll detail both in the order in which you would typically configure them along with our recommendations for their use and configuration.
Here’s some details of several third party services a typical environment will depend on. The services can be provided by numerous applications or providers and further advice can be given on how best to select. These should be configured first, before the GitLab components.
|Load Balancer(s)1||Handles load balancing for the GitLab nodes where required||Load balancer HA configuration|
|Cloud Object Storage service2||Recommended store for shared data objects||Cloud Object Storage configuration|
|NFS3 4||Shared disk storage service. Can be used as an alternative for Gitaly or Object Storage. Required for GitLab Pages||NFS configuration|
Next are all of the components provided directly by GitLab. As mentioned earlier, they are presented in the typical order you would configure them.
|Consul5||Service discovery and health checks/failover||Consul HA configuration|
|PostgreSQL||Database||Database HA configuration|
|PgBouncer||Database Pool Manager||PgBouncer HA configuration|
|Redis5 with Redis Sentinel||Key/Value store for shared data with HA watcher service||Redis HA configuration|
|Gitaly6 3 4||Recommended high-level storage for Git repository data||Gitaly HA configuration|
|Sidekiq||Asynchronous/Background jobs||Sidekiq configuration|
|GitLab application nodes7||(Unicorn / Puma, Workhorse) - Web-requests (UI, API, Git over HTTP)||GitLab app HA/scaling configuration|
|Prometheus and Grafana||GitLab environment monitoring||Monitoring node for scaling/HA|
In some cases, components can be combined on the same nodes to reduce complexity as well.
Our architectures have been tested and validated with HAProxy as the load balancer. However other reputable load balancers with similar feature sets should also work instead but be aware these aren’t validated. ↩
NFS can be used as an alternative for both repository data (replacing Gitaly) and object storage but this isn’t typically recommended for performance reasons. Note however it is required for GitLab Pages. ↩ ↩2
We strongly recommend that any Gitaly and / or NFS nodes are set up with SSD disks over HDD with a throughput of at least 8,000 IOPS for read operations and 2,000 IOPS for write as these components have heavy I/O. These IOPS values are recommended only as a starter as with time they may be adjusted higher or lower depending on the scale of your environment’s workload. If you’re running the environment on a Cloud provider you may need to refer to their documentation on how configure IOPS correctly. ↩ ↩2
Recommended Redis setup differs depending on the size of the architecture. For smaller architectures (up to 5,000 users) we suggest one Redis cluster for all classes and that Redis Sentinel is hosted alongside Consul. For larger architectures (10,000 users or more) we suggest running a separate Redis Cluster for the Cache class and another for the Queues and Shared State classes respectively. We also recommend that you run the Redis Sentinel clusters separately as well for each Redis Cluster. ↩ ↩2
Gitaly node requirements are dependent on customer data, specifically the number of projects and their sizes. We recommend 2 nodes as an absolute minimum for HA environments and at least 4 nodes should be used when supporting 50,000 or more users. We also recommend that each Gitaly node should store no more than 5TB of data and have the number of
gitaly-rubyworkers set to 20% of available CPUs. Additional nodes should be considered in conjunction with a review of expected data size and spread based on the recommendations above. ↩
In our architectures we run each GitLab Rails node using the Puma webserver and have its number of workers set to 90% of available CPUs along with 4 threads. ↩