Job failed (system failure): timed out waiting for pod to start
context deadline exceeded
- Dial tcp xxx.xx.x.x:xxx: i/o timeout
- Connection refused when attempting to communicate with the Kubernetes API
Error cleaning up pod
andJob failed (system failure): prepare environment: waiting for pod running
request did not complete within requested timeout
fatal: unable to access 'https://gitlab-ci-token:token@example.com/repo/proj.git/': Could not resolve host: example.com
docker: Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443
MountVolume.SetUp failed for volume "kube-api-access-xxxxx" : chown is not supported by windows
- Build pods are assigned the worker node’s IAM role instead of Runner IAM role
Preparation failed: invalid pull policy for image 'image-name:latest': pull_policy ([Always]) defined in GitLab pipeline config is not one of the allowed_pull_policies ([])
- Background processes cause jobs to hang and timeout
- Cache-related
permission denied
errors - Apparently redundant shell process in build container with init system
- Runner pod fails to run job results and timesout despite successful registration
failed to reserve container name
for init-permissions container whengcs-fuse-csi-driver
is used-
Error:
only read-only root filesystem container is allowed
- AWS EKS: Error cleaning up pod: pods “runner-**” not found or status is “Failed”
Troubleshooting the Kubernetes executor
The following errors are commonly encountered when using the Kubernetes executor.
Job failed (system failure): timed out waiting for pod to start
If the cluster cannot schedule the build pod before the timeout defined by poll_timeout
, the build pod returns an error. The Kubernetes Scheduler should be able to delete it.
To fix this issue, increase the poll_timeout
value in your config.toml
file.
context deadline exceeded
The context deadline exceeded
errors in job logs usually indicate that the Kubernetes API client hit a timeout for a given cluster API request.
Check the metrics of the kube-apiserver
cluster component for any signs of:
- Increased response latencies.
- Error rates for common create or delete operations over pods, secrets, ConfigMaps, and other core (v1) resources.
Logs for timeout-driven errors from the kube-apiserver
operations may appear as:
Job failed (system failure): prepare environment: context deadline exceeded
Job failed (system failure): prepare environment: setting up build pod: context deadline exceeded
In some cases, the kube-apiserver
error response might provide additional details of its sub-components failing (such as the Kubernetes cluster’s etcdserver
):
Job failed (system failure): prepare environment: etcdserver: request timed out
Job failed (system failure): prepare environment: etcdserver: leader changed
Job failed (system failure): prepare environment: Internal error occurred: resource quota evaluates timeout
These kube-apiserver
service failures can occur during the creation of the build pod and also during cleanup attempts after completion:
Error cleaning up secrets: etcdserver: request timed out
Error cleaning up secrets: etcdserver: leader changed
Error cleaning up pod: etcdserver: request timed out, possibly due to previous leader failure
Error cleaning up pod: etcdserver: request timed out
Error cleaning up pod: context deadline exceeded
Dial tcp xxx.xx.x.x:xxx: i/o timeout
This is a Kubernetes error that generally indicates the Kubernetes API server is unreachable by the runner manager. To resolve this issue:
- If you use network security policies, grant access to the Kubernetes API, typically on port 443 or port 6443, or both.
- Ensure that the Kubernetes API is running.
Connection refused when attempting to communicate with the Kubernetes API
When GitLab Runner makes a request to the Kubernetes API and it fails,
it is likely because
kube-apiserver
is overloaded and can’t accept or process API requests.
Error cleaning up pod
and Job failed (system failure): prepare environment: waiting for pod running
The following errors occur when Kubernetes fails to schedule the job pod in a timely manner. GitLab Runner waits for the pod to be ready, but it fails and then tries to clean up the pod, which can also fail.
Error: Error cleaning up pod: Delete "https://xx.xx.xx.x:443/api/v1/namespaces/gitlab-runner/runner-0001": dial tcp xx.xx.xx.x:443 connect: connection refused
Error: Job failed (system failure): prepare environment: waiting for pod running: Get "https://xx.xx.xx.x:443/api/v1/namespaces/gitlab-runner/runner-0001": dial tcp xx.xx.xx.x:443 connect: connection refused
To troubleshoot, check the Kubernetes primary node and all nodes that run a
kube-apiserver
instance. Ensure they have all of the resources needed to manage the target number
of pods that you hope to scale up to on the cluster.
To change the time GitLab Runner waits for a pod to reach its Ready
status, use the
poll_timeout
setting.
To better understand how pods are scheduled or why they might not get scheduled on time, read about the Kubernetes Scheduler.
request did not complete within requested timeout
The message request did not complete within requested timeout
observed during build pod creation indicates that a configured admission control webhook on the Kubernetes cluster is timing out.
Admission control webhooks are a cluster-level administrative control intercept for all API requests they’re scoped for, and can cause failures if they do not execute in time.
Admission control webhooks support filters that can finely control which API requests and namespace sources it intercepts. If the Kubernetes API calls from GitLab Runner do not need to pass through an admission control webhook then you may alter the webhook’s selector/filter configuration to ignore the GitLab Runner namespace, or apply exclusion labels/annotations over the GitLab Runner pod by configuring podAnnotations
or podLabels
in the GitLab Runner Helm Chart values.yaml
.
For example, to avoid DataDog Admission Controller webhook from intercepting API requests made by the GitLab Runner manager pod, the following can be added:
podLabels:
admission.datadoghq.com/enabled: false
To list a Kubernetes cluster’s admission control webhooks, run:
kubectl get validatingwebhookconfiguration -o yaml
kubectl get mutatingwebhookconfiguration -o yaml
The following forms of logs can be observed when an admission control webhook times out:
Job failed (system failure): prepare environment: Timeout: request did not complete within requested timeout
Job failed (system failure): prepare environment: setting up credentials: Timeout: request did not complete within requested timeout
A failure from an admission control webhook may instead appear as:
Job failed (system failure): prepare environment: setting up credentials: Internal error occurred: failed calling webhook "example.webhook.service"
fatal: unable to access 'https://gitlab-ci-token:token@example.com/repo/proj.git/': Could not resolve host: example.com
If using the alpine
flavor of the helper image,
there can be DNS issues related to Alpine’s musl
’s DNS resolver.
Using the helper_image_flavor = "ubuntu"
option should resolve this.
docker: Cannot connect to the Docker daemon at tcp://docker:2375. Is the docker daemon running?
This error can occur when using Docker-in-Docker if attempts are made to access the DIND service before it has had time to fully start up. For a more detailed explanation, see this issue.
curl: (35) OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to github.com:443
This error can happen when using Docker-in-Docker if the DIND Maximum Transmission Unit (MTU) is larger than the Kubernetes overlay network. DIND uses a default MTU of 1500, which is too large to route across the default overlay network. The DIND MTU can be changed within the service definition:
services:
- name: docker:dind
command: ["--mtu=1450"]
MountVolume.SetUp failed for volume "kube-api-access-xxxxx" : chown is not supported by windows
When you run your CI/CD job, you might receive an error like the following:
MountVolume.SetUp failed for volume "kube-api-access-xxxxx" : chown c:\var\lib\kubelet\pods\xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx\volumes\kubernetes.io~projected\kube-api-access-xxxxx\..2022_07_07_20_52_19.102630072\token: not supported by windows
This issue occurs when you use node selectors to run builds on nodes with different operating systems and architectures.
To fix the issue, configure nodeSelector
so that the runner manager pod is always scheduled on a Linux node. For example, your values.yaml
file should contain the following:
nodeSelector:
kubernetes.io/os: linux
Build pods are assigned the worker node’s IAM role instead of Runner IAM role
This issue happens when the worker node IAM role does not have the permission to assume the correct role. To fix this, add the sts:AssumeRole
permission to the trust relationship of the worker node’s IAM role:
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::<AWS_ACCOUNT_NUMBER>:role/<IAM_ROLE_NAME>"
},
"Action": "sts:AssumeRole"
}
Preparation failed: invalid pull policy for image 'image-name:latest': pull_policy ([Always]) defined in GitLab pipeline config is not one of the allowed_pull_policies ([])
This issue happens if you specified a pull_policy
in your .gitlab-ci.yml
but there is no policy configured in the Runner’s config file. To fix this, add allowed_pull_policies
to your config according to Restrict Docker pull policies.
Background processes cause jobs to hang and timeout
Background processes started during job execution can prevent the build job from exiting. To avoid this you can:
- Double fork the process. For example,
command_to_run < /dev/null &> /dev/null &
. - Kill the process before exiting the job script.
Cache-related permission denied
errors
Files and folders that are generated in your job have certain UNIX ownerships and permissions.
When your files and folders are archived or extracted, UNIX details are retained.
However, the files and folders may mismatch with the USER
configurations of
helper images.
If you encounter permission-related errors in the Creating cache ...
step,
you can:
- As a solution, investigate whether the source data is modified, for example in the job script that creates the cached files.
- As a workaround, add matching chown and
chmod commands.
to your (
before_
/after_
)script:
directives.
Apparently redundant shell process in build container with init system
The process tree might include a shell process when either:
-
FF_USE_LEGACY_KUBERNETES_EXECUTION_STRATEGY
isfalse
andFF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR
istrue
. - The
ENTRYPOINT
of the build image is an init system (liketini-init
ordumb-init
).
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 21:58 ? 00:00:00 /scripts-37474587-5556589047/dumb-init -- sh -c if [ -x /usr/local/bin/bash ]; then .exec /usr/local/bin/bash elif [ -x /usr/bin/bash ]; then .exec /usr/bin/bash elif [ -x /bin/bash ]; then .exec /bin/bash elif [ -x /usr/local/bin/sh ]; then .exec /usr/local/bin/sh elif [ -x /usr/bin/sh ]; then .exec /usr/bin/sh elif [ -x /bin/sh ]; then .exec /bin/sh elif [ -x /busybox/sh ]; then .exec /busybox/sh else .echo shell not found .exit 1 fi
root 7 1 0 21:58 ? 00:00:00 /usr/bin/bash <---------------- WHAT IS THIS???
root 26 1 0 21:58 ? 00:00:00 sh -c (/scripts-37474587-5556589047/detect_shell_script /scripts-37474587-5556589047/step_script 2>&1 | tee -a /logs-37474587-5556589047/output.log) &
root 27 26 0 21:58 ? 00:00:00 \_ /usr/bin/bash /scripts-37474587-5556589047/step_script
root 32 27 0 21:58 ? 00:00:00 | \_ /usr/bin/bash /scripts-37474587-5556589047/step_script
root 37 32 0 21:58 ? 00:00:00 | \_ ps -ef --forest
root 28 26 0 21:58 ? 00:00:00 \_ tee -a /logs-37474587-5556589047/output.log
This shell process, which might be sh
, bash
or busybox
, with a PPID
of 1 and a PID
of 6 or 7, is the shell
started by the shell detection script run by the init system (PID
1 above). The process is not redundant, and is the typical
operation when the build container runs with an init system.
Runner pod fails to run job results and timesout despite successful registration
After the runner pod registers with GitLab, it attempts to run a job but does not and the job eventually times out. The following errors are reported:
There has been a timeout failure or the job got stuck. Check your timeout limits or try again.
This job does not have a trace.
In this case, the runner might receive the error,
HTTP 204 No content response code when connecting to the `jobs/request` API.
To troubleshoot this issue, manually send a POST request to the API to validate if the TCP connection is hanging. If the TCP connection is hanging, the runner might not be able to request CI job payloads.
failed to reserve container name
for init-permissions container when gcs-fuse-csi-driver
is used
The gcs-fuse-csi-driver
csi
driver does not support mounting volumes for the init container. This can cause failures starting the init container when using this driver. Features introduced in Kubernetes 1.28 must be supported in the driver’s project to resolve this bug.
Error: only read-only root filesystem container is allowed
In clusters with admission policies that force containers to run on read-only mounted root filesystems, this error might appear when:
- You install GitLab Runner.
- GitLab Runner tries to schedule a build pod.
These admission policies are usually enforced by an admission controller like
Gatekeeper or Kyverno.
For example, a policy forcing containers to run on read-only root filesystems is
the readOnlyRootFilesystem
Gatekeeper policy.
To resolve this issue:
- All pods that are deployed to the cluster must adhere to the admission policies by setting
securityContext.readOnlyRootFilesystem
totrue
for their containers so the admission controller does not block the pod. - The containers must run successfully and be able to write to the filesystem even though the root filesystem is mounted read-only.
For GitLab Runner
If GitLab Runner is deployed with the GitLab Runner Helm chart, you must update the GitLab chart configuration to have:
-
A proper
securityContext
value:<...> securityContext: readOnlyRootFilesystem: true <...>
-
A writable filesystem mounted where the pod can write:
<...> volumeMounts: - name: tmp-dir mountPath: /tmp volumes: - name: tmp-dir emptyDir: medium: "Memory" <...>
For the build pod
To make the build pod run on a read-only root filesystem,
configure the different containers’ security contexts in config.toml
.
You can set the GitLab chart variable runners.config
, which is passed to the build pod:
runners:
config: |
<...>
[[runners]]
[runners.kubernetes.build_container_security_context]
read_only_root_filesystem = true
[runners.kubernetes.init_permissions_container_security_context]
read_only_root_filesystem = true
[runners.kubernetes.helper_container_security_context,omitempty]
read_only_root_filesystem = true
# This section is only needed if jobs with services are used
[runners.kubernetes.service_container_security_context,omitempty]
read_only_root_filesystem = true
<...>
To make the build pod and its containers run successfully on a read-only filesystem, you must have writable filesystems in locations where the build pod can write. At a minimum, these locations are the build and home directories. Ensure the build process has write access to other locations if necessary.
The home directory must generally be writable so programs can store
their configuration and other data they need for successful execution.
The git
binary is one example of a program that expects to be able to
write to the home directory.
To make the home directory writable regardless of its path in different container images:
- Mount a volume on a stable path (regardless of which build image you use).
- Change the home directory by setting the environment variable
$HOME
globally for all builds.
You can configure the build pod and its containers in config.toml
by
updating the value of the GitLab chart variable runners.config
.
runners:
config: |
<...>
[[runners]]
environment = ["HOME=/build_home"]
[[runners.kubernetes.volumes.empty_dir]]
name = "repo"
mount_path = "/builds"
[[runners.kubernetes.volumes.empty_dir]]
name = "build-home"
mount_path = "/build_home"
<...>
emptyDir
, you can use any other
supported volume types.
Because all files that are not explicitly handled and stored as build
artefacts are usually ephemeral, emptyDir
works for most cases.AWS EKS: Error cleaning up pod: pods “runner-**” not found or status is “Failed”
The Amazon EKS zone rebalancing feature balances the availability zones in an autoscaling group. This feature might stop a node in one availability zone and create it in another.
Runner jobs cannot be stopped and moved to another node. Disable this feature for runner jobs to resolve this error.