Troubleshooting the GitLab chart

UPGRADE FAILED: Job failed: BackoffLimitExceeded

If you received this error when upgrading to the 6.0 version of the chart, then it’s probably because you didn’t follow the right upgrade path, as you first need to upgrade to the latest 5.10.x version:

  1. List all your releases to identify your GitLab Helm release name (you will need to include -n <namespace> if your release was not deployed to the default K8s namespace):

     helm ls
    
  2. Assuming that your GitLab Helm release is called gitlab you then need to look at the release history and identify the last successful revision (you can see the status of a revision under DESCRIPTION):

     helm history gitlab
    
  3. Assuming your most recent successful revision is 1 use this command to roll back:

    helm rollback gitlab 1
    
  4. Re-run the upgrade command by replacing <x> with the appropriate chart version:

    helm upgrade --version=5.10.<x>
    
  5. At this point you can use the --version option to pass a specific 6.x.x chart version or remove the option for upgrading to the latest version of GitLab:

    helm upgrade --install gitlab gitlab/gitlab <other_options>
    

More information about command line arguments can be found in our Deploy using Helm section. For mappings between chart versions and GitLab versions, read GitLab version mappings.

UPGRADE FAILED: “$name” has no deployed releases

This error occurs on your second install/upgrade if your initial install failed.

If your initial install completely failed, and GitLab was never operational, you should first purge the failed install before installing again.

helm uninstall <release-name>

If instead, the initial install command timed out, but GitLab still came up successfully, you can add the --force flag to the helm upgrade command to ignore the error and attempt to update the release.

Otherwise, if you received this error after having previously had successful deploys of the GitLab chart, then you are encountering a bug. Please open an issue on our issue tracker, and also check out issue #630 where we recovered our CI server from this problem.

Error: this command needs 2 arguments: release name, chart path

An error like this could occur when you run helm upgrade and there are some spaces in the parameters. In the following example, Test Username is the culprit:

helm upgrade gitlab gitlab/gitlab --timeout 600s --set global.email.display_name=Test Username ...

To fix it, pass the parameters in single quotes:

helm upgrade gitlab gitlab/gitlab --timeout 600s --set global.email.display_name='Test Username' ...

Application containers constantly initializing

If you experience Sidekiq, Webservice, or other Rails based containers in a constant state of Initializing, you’re likely waiting on the dependencies container to pass.

If you check the logs of a given Pod specifically for the dependencies container, you may see the following repeated:

Checking database connection and schema version
WARNING: This version of GitLab depends on gitlab-shell 8.7.1, ...
Database Schema
Current version: 0
Codebase version: 20190301182457

This is an indication that the migrations Job has not yet completed. The purpose of this Job is to both ensure that the database is seeded, as well as all relevant migrations are in place. The application containers are attempting to wait for the database to be at or above their expected database version. This is to ensure that the application does not malfunction to the schema not matching expectations of the codebase.

  1. Find the migrations Job. kubectl get job -lapp=migrations
  2. Find the Pod being run by the Job. kubectl get pod -ljob-name=<job-name>
  3. Examine the output, checking the STATUS column.

If the STATUS is Running, continue. If the STATUS is Completed, the application containers should start shortly after the next check passes.

Examine the logs from this pod. kubectl logs <pod-name>

Any failures during the run of this job should be addressed. These will block the use of the application until resolved. Possible problems are:

  • Unreachable or failed authentication to the configured PostgreSQL database
  • Unreachable or failed authentication to the configured Redis services
  • Failure to reach a Gitaly instance

Applying configuration changes

The following command will perform the necessary operations to apply any updates made to gitlab.yaml:

helm upgrade <release name> <chart path> -f gitlab.yaml

Included GitLab Runner failing to register

This can happen when the runner registration token has been changed in GitLab. (This often happens after you have restored a backup)

  1. Find the new shared runner token located on the admin/runners webpage of your GitLab installation.
  2. Find the name of existing runner token Secret stored in Kubernetes

    kubectl get secrets | grep gitlab-runner-secret
    
  3. Delete the existing secret

    kubectl delete secret <runner-secret-name>
    
  4. Create the new secret with two keys, (runner-registration-token with your shared token, and an empty runner-token)

    kubectl create secret generic <runner-secret-name> --from-literal=runner-registration-token=<new-shared-runner-token> --from-literal=runner-token=""
    

Too many redirects

This can happen when you have TLS termination before the NGINX Ingress, and the tls-secrets are specified in the configuration.

  1. Update your values to set global.ingress.annotations."nginx.ingress.kubernetes.io/ssl-redirect": "false"

    Via a values file:

    # values.yaml
    global:
      ingress:
        annotations:
          "nginx.ingress.kubernetes.io/ssl-redirect": "false"
    

    Via the Helm CLI:

    helm ... --set-string global.ingress.annotations."nginx.ingress.kubernetes.io/ssl-redirect"=false
    
  2. Apply the change.

note
When using an external service for SSL termination, that service is responsible for redirecting to https (if so desired).

Upgrades fail with Immutable Field Error

spec.clusterIP

Prior to the 3.0.0 release of these charts, the spec.clusterIP property had been populated into several Services despite having no actual value (""). This was a bug, and causes problems with Helm 3’s three-way merge of properties.

Once the chart was deployed with Helm 3, there would be no possible upgrade path unless one collected the clusterIP properties from the various Services and populated those into the values provided to Helm, or the affected services are removed from Kubernetes.

The 3.0.0 release of this chart corrected this error, but it requires manual correction.

This can be solved by simply removing all of the affected services.

  1. Remove all affected services:

    kubectl delete services -lrelease=RELEASE_NAME
    
  2. Perform an upgrade via Helm.
  3. Future upgrades will not face this error.
note
This will change any dynamic value for the LoadBalancer for NGINX Ingress from this chart, if in use. See global Ingress settings documentation for more details regarding externalIP. You may be required to update DNS records!

spec.selector

Sidekiq pods did not receive a unique selector prior to chart release 3.0.0. The problems with this were documented in.

Upgrades to 3.0.0 using Helm will automatically delete the old Sidekiq deployments and create new ones by appending -v1 to the name of the Sidekiq Deployments,HPAs, and Pods.

Starting from 5.5.0 Helm will delete old Sidekiq deployments from prior versions and will use -v2 suffix for Pods, Deployments and HPAs.

If you continue to run into this error on the Sidekiq deployment when installing 3.0.0, resolve these with the following steps:

  1. Remove Sidekiq services

    kubectl delete deployment --cascade -lrelease=RELEASE_NAME,app=sidekiq
    
  2. Perform an upgrade via Helm.

cannot patch “RELEASE-NAME-cert-manager” with kind Deployment

Upgrading from CertManager version 0.10 introduced a number of breaking changes. The old Custom Resource Definitions must be uninstalled and removed from Helm’s tracking and then re-installed.

The Helm chart attempts to do this by default but if you encounter this error you may need to take manual action.

If this error message was encountered, then upgrading requires one more step than normal in order to ensure the new Custom Resource Definitions are actually applied to the deployment.

  1. Remove the old CertManager Deployment.

     kubectl delete deployments -l app=cert-manager --cascade
    
  2. Run the upgrade again. This time install the new Custom Resource Definitions

     helm upgrade --install --values - YOUR-RELEASE-NAME gitlab/gitlab < <(helm get values YOUR-RELEASE-NAME)
    

cannot patch gitlab-kube-state-metrics with kind Deployment

Upgrading from Prometheus version 11.16.9 to 15.0.4 changes the selector labels used on the kube-state-metrics Deployment, which is disabled by default (prometheus.kubeStateMetrics.enabled=false).

If this error message is encountered, meaning prometheus.kubeStateMetrics.enabled=true, then upgrading requires an additional step:

  1. Remove the old kube-state-metrics Deployment.

    kubectl delete deployments.apps -l app.kubernetes.io/instance=RELEASE_NAME,app.kubernetes.io/name=kube-state-metrics --cascade=orphan
    
  2. Perform an upgrade via Helm.

ImagePullBackOff, Failed to pull image and manifest unknown errors

If you are using global.gitlabVersion, start by removing that property. Check the version mappings between the chart and GitLab and specify a compatible version of the gitlab/gitlab chart in your helm command.

UPGRADE FAILED: “cannot patch …” after helm 2to3 convert

This is a known issue. After migrating a Helm 2 release to Helm 3, the subsequent upgrades may fail. You can find the full explanation and workaround in Migrating from Helm v2 to Helm v3.

UPGRADE FAILED: type mismatch on mailroom: %!t(<nil>)

An error like this can happen if you do not provide a valid map for a key that expects a map.

For example, the configuration below will cause this error:

gitlab:
  mailroom:

To fix this, either:

  1. Provide a valid map for gitlab.mailroom.
  2. Remove the mailroom key entirely.

Note that for optional keys, an empty map ({}) is a valid value.

Restoration failure: ERROR: cannot drop view pg_stat_statements because extension pg_stat_statements requires it

You may face this error when restoring a backup on your Helm chart instance. Use the following steps as a workaround:

  1. Inside your toolbox pod open the DB console:

    /srv/gitlab/bin/rails dbconsole -p
    
  2. Drop the extension:

    DROP EXTENSION pg_stat_statements;
    
  3. Perform the restoration process.
  4. After the restoration is complete, re-create the extension in the DB console:

    CREATE EXTENSION pg_stat_statements;
    

If you encounter the same issue with the pg_buffercache extension, follow the same steps above to drop and re-create it.

You can find more details about this error in issue #2469.

Bundled PostgreSQL pod fails to start: database files are incompatible with server

The following error message may appear in the bundled PostgreSQL pod after upgrading to a new version of the GitLab Helm chart:

gitlab-postgresql FATAL:  database files are incompatible with server
gitlab-postgresql DETAIL:  The data directory was initialized by PostgreSQL version 11, which is not compatible with this version 12.7.

To address this, perform a Helm rollback to the previous version of the chart and then follow the steps in the upgrade guide to upgrade the bundled PostgreSQL version. Once PostgreSQL is properly upgraded, try the GitLab Helm chart upgrade again.

Bundled NGINX Ingress pod fails to start: Failed to watch *v1beta1.Ingress

The following error message may appear in the bundled NGINX Ingress controller pod if running Kubernetes version 1.22 or later:

Failed to watch *v1beta1.Ingress: failed to list *v1beta1.Ingress: the server could not find the requested resource

To address this, ensure the Kubernetes version is 1.21 or older. See #2852 for more information regarding NGINX Ingress support for Kubernetes 1.22 or later.

Increased load on /api/v4/jobs/request endpoint

You may face this issue if the option workhorse.keywatcher was set to false for the deployment servicing /api/*. Use the following steps to verify:

  1. Access the container gitlab-workhorse in the pod serving /api/*:

    kubectl exec -it --container=gitlab-workhorse <gitlab_api_pod> -- /bin/bash
    
  2. Inspect the file /srv/gitlab/config/workhorse-config.toml. The [redis] configuration might be missing:

    grep '\[redis\]' /srv/gitlab/config/workhorse-config.toml
    

If the [redis] configuration is not present, the workhorse.keywatcher flag was set to false during deployment thus causing the extra load in the /api/v4/jobs/request endpoint. To fix this, enable the keywatcher in the webservice chart:

workhorse:
  keywatcher: true

Git over SSH: the remote end hung up unexpectedly

Git operations over SSH might fail intermittently with the following error:

fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

There are a number of potential causes for this error:

  • Network timeouts:

    Git clients sometimes open a connection and leave it idling, like when compressing objects. Settings like timeout client in HAProxy might cause these idle connections to be terminated.

    In GitLab 14.0 (chart version 5.0) and later, you can set a keepalive in sshd:

    gitlab:
      gitlab-shell:
        config:
          clientAliveInterval: 15
    
  • gitlab-shell memory:

    By default, the chart does not set a limit on GitLab Shell memory. If gitlab.gitlab-shell.resources.limits.memory is set too low, Git operations over SSH may fail with these errors.

    Run kubectl describe nodes to confirm that this is caused by memory limits rather than timeouts over the network.

    System OOM encountered, victim process: gitlab-shell
    Memory cgroup out of memory: Killed process 3141592 (gitlab-shell)
    

YAML configuration: mapping values are not allowed in this context

The following error message may appear when YAML configuration contains leading spaces:

template: /var/opt/gitlab/templates/workhorse-config.toml.tpl:16:98:
  executing \"/var/opt/gitlab/templates/workhorse-config.toml.tpl\" at <data.YAML>:
    error calling YAML:
      yaml: line 2: mapping values are not allowed in this context

To address this, ensure that there are no leading spaces in configuration.

For example, change this:

  key1: value1
  key2: value2

… to this:

key1: value1
key2: value2

This change ensures that the configuration can be populated correctly by gomplate, which was added in GitLab 14.5 (chart version 5.5.0) via MR 2218.

TLS and certificates

If your GitLab instance needs to trust a private TLS certificate authority, GitLab might fail to handshake with other services like object storage, Elasticsearch, Jira, or Jenkins:

error: certificate verify failed (unable to get local issuer certificate)

Partial trust of certificates signed by private certificate authorities can occur if:

  • The supplied certificates are not in separate files.
  • The certificates init container doesn’t perform all the required steps.

Also, GitLab is mostly written in Ruby on Rails and Go, and each language’s TLS libraries work differently. This difference can result in issues like job logs failing to render in the GitLab UI but raw job logs downloading without issue.

Additionally, depending on the proxy_download configuration, your browser is redirected to the object storage with no issues if the trust store is correctly configured. At the same time, TLS handshakes by one or more GitLab components could still fail.

Certificate trust setup and troubleshooting

As part of troubleshooting certificate issues, be sure to:

  • Create secrets for each certificate you need to trust.
  • Provide only one certificate per file.

    kubectl create secret generic custom-ca --from-file=unique_name=/path/to/cert
    

    In this example, the certificate is stored using the key name unique_name

If you supply a bundle or a chain, some GitLab components won’t work.

Query secrets with kubectl get secrets and kubectl describe secrets/secretname, which shows the key name for the certificate under Data.

Supply additional certificates to trust using global.certificates.customCAs in the chart globals.

When a pod is deployed, an init container mounts the certificates and sets them up so the GitLab components can use them. The init container isregistry.gitlab.com/gitlab-org/build/cng/alpine-certificates.

Additional certificates are mounted into the container at /usr/local/share/ca-certificates, using the secret key name as the certificate filename.

The init container runs /scripts/bundle-certificates (source). In that script, update-ca-certificates:

  1. Copies custom certificates from /usr/local/share/ca-certificates to /etc/ssl/certs.
  2. Compiles a bundle ca-certificates.crt.
  3. Generates hashes for each certificate and creates a symlink using the hash, which is required for Rails. Certificate bundles are skipped with a warning:

    WARNING: unique_name does not contain exactly one certificate or CRL: skipping
    

Troubleshoot the init container’s status and logs. For example, to view the logs for the certificates init container and check for warnings:

kubectl logs gitlab-webservice-default-pod -c certificates

Check on the Rails console

Use the toolbox pod to verify if Rails trusts the certificates you supplied.

  1. Start a Rails console (replace <namespace> with the namespace where GitLab is installed):

    kubectl exec -ti $(kubectl get pod -n <namespace> -lapp=toolbox -o jsonpath='{.items[0].metadata.name}') -n <namespace> -- bash
    /srv/gitlab/bin/rails console
    
  2. Verify the location Rails checks for certificate authorities:

    OpenSSL::X509::DEFAULT_CERT_DIR
    
  3. Execute an HTTPS query in the Rails console:

    ## Configure a web server to connect to:
    uri = URI.parse("https://myservice.example.com")
    
    require 'openssl'
    require 'net/http'
    Rails.logger.level = 0
    OpenSSL.debug=1
    http = Net::HTTP.new(uri.host, uri.port)
    http.set_debug_output($stdout)
    http.use_ssl = true
    
    http.verify_mode = OpenSSL::SSL::VERIFY_PEER
    # http.verify_mode = OpenSSL::SSL::VERIFY_NONE # TLS verification disabled
    
    response = http.request(Net::HTTP::Get.new(uri.request_uri))
    

Troubleshoot the init container

Run the certificates container using Docker.

  1. Set up a directory structure and populate it with your certificates:

    mkdir -p etc/ssl/certs usr/local/share/ca-certificates
    
      # The secret name is: my-root-ca
      # The key name is: corporate_root
    
    kubectl get secret my-root-ca -ojsonpath='{.data.corporate_root}' | \
         base64 --decode > usr/local/share/ca-certificates/corporate_root
    
      # Check the certificate is correct:
    
    openssl x509 -in usr/local/share/ca-certificates/corporate_root -text -noout
    
  2. Determine the correct container version:

    kubectl get deployment -lapp=webservice -ojsonpath='{.items[0].spec.template.spec.initContainers[0].image}'
    
  3. Run container, which performs the preparation of etc/ssl/certs content:

    docker run -ti --rm \
         -v $(pwd)/etc/ssl/certs:/etc/ssl/certs \
         -v $(pwd)/usr/local/share/ca-certificates:/usr/local/share/ca-certificates \
         registry.gitlab.com/gitlab-org/build/cng/gitlab-base:v15.10.3
    
  4. Check your certificates have been correctly built:

    • etc/ssl/certs/corporate_root.pem should have been created.
    • There should be a hashed filename, which is a symlink to the certificate itself (such as etc/ssl/certs/1234abcd.0).
    • The file and the symbolic link should display with:

      ls -l etc/ssl/certs/ | grep corporate_root
      

      For example:

      lrwxrwxrwx   1 root root      20 Oct  7 11:34 28746b42.0 -> corporate_root.pem
      -rw-r--r--   1 root root    1948 Oct  7 11:34 corporate_root.pem
      

308: Permanent Redirect causing a redirect loop

308: Permanent Redirect can happen if your Load Balancer is configured to send unencrypted traffic (HTTP) to NGINX. Because NGINX defaults to redirecting HTTP to HTTPS, you may end up in a “redirect loop”.

To fix this, enable NGINX’s use-forwarded-headers setting.

“Invalid Word” errors in the nginx-controller logs and 404 errors

After upgrading to Helm chart 6.6 or later, you might experience 404 return codes when visiting your GitLab or third-party domains for applications installed in your cluster and are also seeing “invalid word” errors in the gitlab-nginx-ingress-controller logs:

gitlab-nginx-ingress-controller-899b7d6bf-688hr controller W1116 19:03:13.162001       7 store.go:846] skipping ingress gitlab/gitlab-minio: nginx.ingress.kubernetes.io/configuration-snippet annotation contains invalid word proxy_pass
gitlab-nginx-ingress-controller-899b7d6bf-688hr controller W1116 19:03:13.465487       7 store.go:846] skipping ingress gitlab/gitlab-registry: nginx.ingress.kubernetes.io/configuration-snippet annotation contains invalid word proxy_pass
gitlab-nginx-ingress-controller-899b7d6bf-lqcks controller W1116 19:03:12.233577       6 store.go:846] skipping ingress gitlab/gitlab-kas: nginx.ingress.kubernetes.io/configuration-snippet annotation contains invalid word proxy_pass
gitlab-nginx-ingress-controller-899b7d6bf-lqcks controller W1116 19:03:12.536534       6 store.go:846] skipping ingress gitlab/gitlab-webservice-default: nginx.ingress.kubernetes.io/configuration-snippet annotation contains invalid word proxy_pass
gitlab-nginx-ingress-controller-899b7d6bf-lqcks controller W1116 19:03:12.848844       6 store.go:846] skipping ingress gitlab/gitlab-webservice-default-smartcard: nginx.ingress.kubernetes.io/configuration-snippet annotation contains invalid word proxy_pass
gitlab-nginx-ingress-controller-899b7d6bf-lqcks controller W1116 19:03:13.161640       6 store.go:846] skipping ingress gitlab/gitlab-minio: nginx.ingress.kubernetes.io/configuration-snippet annotation contains invalid word proxy_pass
gitlab-nginx-ingress-controller-899b7d6bf-lqcks controller W1116 19:03:13.465425       6 store.go:846] skipping ingress gitlab/gitlab-registry: nginx.ingress.kubernetes.io/configuration-snippet annotation contains invalid word proxy_pass

In that case, review your GitLab values and any third-party Ingress objects for the use of configuration snippets. You may need to adjust or modify the nginx-ingress.controller.config.annotation-value-word-blocklist setting.

See Annotation value word blocklist for additional details.

Volume mount takes a long time

Mounting large volumes, such as the gitaly or toolbox chart volumes, can take a long time because Kubernetes recursively changes the permissions of the volume’s contents to match the Pod’s securityContext.

Starting with Kubernetes 1.23 you can set the securityContext.fsGroupChangePolicy to OnRootMismatch to mitigate this issue. This flag is supported by all GitLab subcharts.

For example for the Gitaly subchart:

gitlab:
  gitaly:
    securityContext:
      fsGroupChangePolicy: "OnRootMismatch"

See the Kubernetes documentation, for more details.

For Kubernetes versions not supporting fsGroupChangePolicy you can mitigate the issue by changing or fully deleting the settings for the securityContext.

gitlab:
  gitaly:
    securityContext:
      fsGroup: ""
      runAsUser: ""
note
The example syntax eliminates the securityContext setting entirely. Setting securityContext: {} or securityContext: does not work due to the way Helm merges default values with user provided configuration.

Intermittent 502 errors

When a request being handled by a Puma worker crosses the memory limit threshold, it is killed by the node’s OOMKiller. However, killing the request does not necessarily kill or restart the webservice pod itself. This situation causes the request to return a 502 timeout. In the logs, this appears as a Puma worker being created shortly after the 502 error is logged.

2024-01-19T14:12:08.949263522Z {"correlation_id":"XXXXXXXXXXXX","duration_ms":1261,"error":"badgateway: failed to receive response: context canceled"....
2024-01-19T14:12:24.214148186Z {"component": "gitlab","subcomponent":"puma.stdout","timestamp":"2024-01-19T14:12:24.213Z","pid":1,"message":"- Worker 2 (PID: 7414) booted in 0.84s, phase: 0"}

To solve this problem, raise memory limits for the webservice pods.