- Passwords and secrets failing or unsynchronized
- Database is broken and needs reset
- Backup used for testing needs to be updated
- CI clusters are low on available resources
Troubleshooting GitLab chart development environment
All steps noted here are for DEVELOPMENT ENVIRONMENTS ONLY. Administrators may find the information insightful, but the outlined fixes are destructive and would have a major negative impact on production systems.
Passwords and secrets failing or unsynchronized
Developers commonly deploy, delete, and re-deploy a release into the same
cluster multiple times. Kubernetes secrets and persistent volume claims created by StatefulSets are
intentionally not removed by helm delete RELEASE_NAME
.
Removing only the Kubernetes secrets leads to interesting problems. For example, a new deployment’s migration pod will fail because GitLab Rails cannot connect to the database because it has the wrong password.
To completely wipe a release from a development environment including secrets, a developer must remove both the secrets and the persistent volume claims.
# DO NOT run these commands in a production environment. Disaster will strike.
kubectl delete secrets,pvc -lrelease=RELEASE_NAME
Database is broken and needs reset
The database environment can be reset in a development environment by:
- Delete the PostgreSQL StatefulSet
- Delete the PostgreSQL PersistentVolumeClaim
- Deploy GitLab again with
helm upgrade --install
Backup used for testing needs to be updated
Certain jobs in CI use a backup of GitLab during testing. Complete the steps below to update this backup when needed:
- Install the most latest version of the chart that is compatible with the current backup into a development cluster.
-
Restore the backup currently used in CI. The backup is available at
https://storage.cloud.google.com/gitlab-charts-ci/test-backups/<BACKUP_PREFIX>_gitlab_backup.tar
. The currentBACKUP_PREFIX
is defined in.gitlab-ci.yml
.- If you are using the bundled MinIO with a self-signed certificate you may want
to use
awscli
instead ofs3cmd
to avoid SSL errors. To do this, first configureawscli
inside your toolbox, and then pass--s3tool awscli --aws-s3-endpoint-url http://gitlab-minio-svc:9000
to your backup and restore commands.
- If you are using the bundled MinIO with a self-signed certificate you may want
to use
- Ensure the background migrations all complete, forcing them to complete if needed.
- Upgrade the Helm release to use the new CNG images which have the new backup/restore
changes by setting
global.gitlabVersion=<CNG tag>
. -
Create a new backup from the
toolbox
Pod. - Download the new backup from the
gitlab-backups
bucket. - Ask in
#g_distribution
to upload the backup to Google Cloud Storage (GCS):- Project:
cloud-native-182609
, path:gitlab-charts-ci/test-backups/
- Edit access and add
Entity=Public
,Name=allUsers
, andAccess=Reader
.
- Project:
- Finally, update
.variables.TEST_BACKUP_PREFIX
in.gitlab-ci.yml
and open a merge request.- For example: If the filename is
1708623546_2024_02_22_16.9.1-ee_gitlab_backup
, then the prefix is1708623546_2024_02_22_16.9.1-ee
.
- For example: If the filename is
Future pipelines will now use the new backup artifact during testing.
CI clusters are low on available resources
You may notice one or more CI clusters run low on available resources like CPU and memory. Our clusters are configured to automatically scale the available nodes, but sometimes we hit the upper limit and therefore no more nodes can be created. In this case, a good first step is to see if any installations of the GitLab Helm Charts in the clusters can be removed.
Installations are usually cleaned up automatically by the Review Apps logic in the pipeline, but this can fail for various reasons. See the following issues for more details:
- What can we do about cleaning up failed deploys in CI?
- https://gitlab.com/gitlab-org/charts/gitlab/-/issues/5338
As a workaround, these installations can be manually deleted by running the associated
stop_review
job(s) in CI. To make this easier, use the
helm_ci_triage.sh
script to get a list of running installations and open the associated pipeline to run
the stop_review
job(s). Further usage details are available in the script.