Troubleshooting the Kubernetes executor

The following errors are commonly encountered when using the Kubernetes executor.

Job failed (system failure): timed out waiting for pod to start

If the cluster cannot schedule the build pod before the timeout defined by poll_timeout, the build pod returns an error. The Kubernetes Scheduler should be able to delete it.

To fix this issue, increase the poll_timeout value in your config.toml file.

context deadline exceeded

The context deadline exceeded errors in job logs usually indicate that the Kubernetes API client hit a timeout for a given cluster API request.

Check the metrics of the kube-apiserver cluster component for any signs of:

  • Increased response latencies.
  • Error rates for common create or delete operations over pods, secrets, ConfigMaps, and other core (v1) resources.

Logs for timeout-driven errors from the kube-apiserver operations may appear as:

Job failed (system failure): prepare environment: context deadline exceeded
Job failed (system failure): prepare environment: setting up build pod: context deadline exceeded

In some cases, the kube-apiserver error response might provide additional details of its sub-components failing (such as the Kubernetes cluster’s etcdserver):

Job failed (system failure): prepare environment: etcdserver: request timed out
Job failed (system failure): prepare environment: etcdserver: leader changed
Job failed (system failure): prepare environment: Internal error occurred: resource quota evaluates timeout

These kube-apiserver service failures can occur during the creation of the build pod and also during cleanup attempts after completion:

Error cleaning up secrets: etcdserver: request timed out
Error cleaning up secrets: etcdserver: leader changed

Error cleaning up pod: etcdserver: request timed out, possibly due to previous leader failure
Error cleaning up pod: etcdserver: request timed out
Error cleaning up pod: context deadline exceeded

Dial tcp xxx.xx.x.x:xxx: i/o timeout

This is a Kubernetes error that generally indicates the Kubernetes API server is unreachable by the runner manager. To resolve this issue:

  • If you use network security policies, grant access to the Kubernetes API, typically on port 443 or port 6443, or both.
  • Ensure that the Kubernetes API is running.

Connection refused when attempting to communicate with the Kubernetes API

When GitLab Runner makes a request to the Kubernetes API and it fails, it is likely because kube-apiserver is overloaded and can’t accept or process API requests.

Error cleaning up pod and Job failed (system failure): prepare environment: waiting for pod running

The following errors occur when Kubernetes fails to schedule the job pod in a timely manner. GitLab Runner waits for the pod to be ready, but it fails and then tries to clean up the pod, which can also fail.

Error: Error cleaning up pod: Delete "https://xx.xx.xx.x:443/api/v1/namespaces/gitlab-runner/runner-0001": dial tcp xx.xx.xx.x:443 connect: connection refused

Error: Job failed (system failure): prepare environment: waiting for pod running: Get "https://xx.xx.xx.x:443/api/v1/namespaces/gitlab-runner/runner-0001": dial tcp xx.xx.xx.x:443 connect: connection refused

To troubleshoot, check the Kubernetes primary node and all nodes that run a kube-apiserver instance. Ensure they have all of the resources needed to manage the target number of pods that you hope to scale up to on the cluster.

To change the time GitLab Runner waits for a pod to reach its Ready status, use the poll_timeout setting.

To better understand how pods are scheduled or why they might not get scheduled on time, read about the Kubernetes Scheduler.

request did not complete within requested timeout

The message request did not complete within requested timeout observed during build pod creation indicates that a configured admission control webhook on the Kubernetes cluster is timing out.

Admission control webhooks are a cluster-level administrative control intercept for all API requests they’re scoped for, and can cause failures if they do not execute in time.

Admission control webhooks support filters that can finely control which API requests and namespace sources it intercepts. If the Kubernetes API calls from GitLab Runner do not need to pass through an admission control webhook then you may alter the webhook’s selector/filter configuration to ignore the GitLab Runner namespace, or apply exclusion labels/annotations over the GitLab Runner pod by configuring podAnnotations or podLabels in the GitLab Runner Helm Chart values.yaml.

For example, to avoid DataDog Admission Controller webhook from intercepting API requests made by the GitLab Runner manager pod, the following can be added:

podLabels:
  admission.datadoghq.com/enabled: false

To list a Kubernetes cluster’s admission control webhooks, run:

kubectl get validatingwebhookconfiguration -o yaml
kubectl get mutatingwebhookconfiguration -o yaml

The following forms of logs can be observed when an admission control webhook times out:

Job failed (system failure): prepare environment: Timeout: request did not complete within requested timeout
Job failed (system failure): prepare environment: setting up credentials: Timeout: request did not complete within requested timeout

A failure from an admission control webhook may instead appear as:

Job failed (system failure): prepare environment: setting up credentials: Internal error occurred: failed calling webhook "example.webhook.service"

fatal: unable to access 'https://gitlab-ci-token:token@example.com/repo/proj.git/': Could not resolve host: example.com

If using the alpine flavor of the helper image, there can be DNS issues related to Alpine’s musl’s DNS resolver.

Using the helper_image_flavor = "ubuntu" option should resolve this.

docker: Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?

This error can occur when using Docker-in-Docker if attempts are made to access the DIND service before it has had time to fully start up. For a more detailed explanation, see this issue.

curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443

This error can happen when using Docker-in-Docker if the DIND Maximum Transmission Unit (MTU) is larger than the Kubernetes overlay network. DIND uses a default MTU of 1500, which is too large to route across the default overlay network. The DIND MTU can be changed within the service definition:

services:
  - name: docker:dind
    command: ["--mtu=1450"]

MountVolume.SetUp failed for volume "kube-api-access-xxxxx" : chown is not supported by windows

When you run your CI/CD job, you might receive an error like the following:

MountVolume.SetUp failed for volume "kube-api-access-xxxxx" : chown c:\var\lib\kubelet\pods\xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\volumes\kubernetes.io~projected\kube-api-access-xxxxx\..2022_07_07_20_52_19.102630072\token: not supported by windows

This issue occurs when you use node selectors to run builds on nodes with different operating systems and architectures.

To fix the issue, configure nodeSelector so that the runner manager pod is always scheduled on a Linux node. For example, your values.yaml file should contain the following:

nodeSelector:
  kubernetes.io/os: linux

Build pods are assigned the worker node’s IAM role instead of Runner IAM role

This issue happens when the worker node IAM role does not have the permission to assume the correct role. To fix this, add the sts:AssumeRole permission to the trust relationship of the worker node’s IAM role:

{
    "Effect": "Allow",
    "Principal": {
        "AWS": "arn:aws:iam::<AWS_ACCOUNT_NUMBER>:role/<IAM_ROLE_NAME>"
    },
    "Action": "sts:AssumeRole"
}

Preparation failed: failed to pull image 'image-name:latest': pull_policy ([Always]) defined in GitLab pipeline config is not one of the allowed_pull_policies ([])

This issue happens if you specified a pull_policy in your .gitlab-ci.yml but there is no policy configured in the Runner’s config file. To fix this, add allowed_pull_policies to your config according to Restrict Docker pull policies.

Background processes cause jobs to hang and timeout

Background processes started during job execution can prevent the build job from exiting. To avoid this you can:

  • Double fork the process. For example, command_to_run < /dev/null &> /dev/null &.
  • Kill the process before exiting the job script.

Files and folders that are generated in your job have certain UNIX ownerships and permissions. When your files and folders are archived or extracted, UNIX details are retained. However, the files and folders may mismatch with the USER configurations of helper images.

If you encounter permission-related errors in the Creating cache ... step, you can:

  • As a solution, investigate whether the source data is modified, for example in the job script that creates the cached files.
  • As a workaround, add matching chown and chmod commands. to your (before_/after_)script: directives.

Apparently redundant shell process in build container with init system

The process tree might include a shell process when either:

  • FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY is false and FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR is true.
  • The ENTRYPOINT of the build image is an init system (like tini-init or dumb-init).
UID    PID   PPID  C STIME TTY          TIME CMD
root     1      0  0 21:58 ?        00:00:00 /scripts-37474587-5556589047/dumb-init -- sh -c if [ -x /usr/local/bin/bash ]; then .exec /usr/local/bin/bash  elif [ -x /usr/bin/bash ]; then .exec /usr/bin/bash  elif [ -x /bin/bash ]; then .exec /bin/bash  elif [ -x /usr/local/bin/sh ]; then .exec /usr/local/bin/sh  elif [ -x /usr/bin/sh ]; then .exec /usr/bin/sh  elif [ -x /bin/sh ]; then .exec /bin/sh  elif [ -x /busybox/sh ]; then .exec /busybox/sh  else .echo shell not found .exit 1 fi
root     7      1  0 21:58 ?        00:00:00 /usr/bin/bash <---------------- WHAT IS THIS???
root    26      1  0 21:58 ?        00:00:00 sh -c (/scripts-37474587-5556589047/detect_shell_script /scripts-37474587-5556589047/step_script 2>&1 | tee -a /logs-37474587-5556589047/output.log) &
root    27     26  0 21:58 ?        00:00:00  \_ /usr/bin/bash /scripts-37474587-5556589047/step_script
root    32     27  0 21:58 ?        00:00:00  |   \_ /usr/bin/bash /scripts-37474587-5556589047/step_script
root    37     32  0 21:58 ?        00:00:00  |       \_ ps -ef --forest
root    28     26  0 21:58 ?        00:00:00  \_ tee -a /logs-37474587-5556589047/output.log

This shell process, which might be sh, bash or busybox, with a PPID of 1 and a PID of 6 or 7, is the shell started by the shell detection script run by the init system (PID 1 above). The process is not redundant, and is the typical operation when the build container runs with an init system.

Runner pod fails to run job results and timesout despite successful registration

After the runner pod registers with GitLab, it attempts to run a job but does not and the job eventually times out. The following errors are reported:

There has been a timeout failure or the job got stuck. Check your timeout limits or try again.

This job does not have a trace.

In this case, the runner might receive the error,

HTTP 204 No content response code when connecting to the `jobs/request` API.

To troubleshoot this issue, manually send a POST request to the API to validate if the TCP connection is hanging. If the TCP connection is hanging, the runner might not be able to request CI job payloads.

failed to reserve container name for init-permissions container when gcs-fuse-csi-driver is used

The gcs-fuse-csi-driver csi driver does not support mounting volumes for the init container. This can cause failures starting the init container when using this driver. Features introduced in Kubernetes 1.28 must be supported in the driver’s project to resolve this bug.