Optimize GitLab Runner manager pod performance
- Tier: Free, Premium, Ultimate
- Offering: GitLab.com, GitLab Self-Managed, GitLab Dedicated
To monitor and optimize GitLab Runner manager pod performance in Kubernetes environments, GitLab recommends the following best practices. Apply them to identify performance bottlenecks and implement solutions for optimal CI/CD pipeline execution.
Prerequisites
Before you implement these recommendations:
- Deploy GitLab Runner in Kubernetes using the Kubernetes executor
- Have administrator access to your Kubernetes cluster
- Configure Prometheus monitoring for GitLab Runner
- Have basic understanding of Kubernetes resource management
GitLab Runner manager pod responsibilities
The GitLab Runner manager pod coordinates all CI/CD job execution in Kubernetes. Its performance directly impacts your pipeline efficiency.
It handles:
- Log processing: Collects and forwards job logs from worker pods to GitLab
- Cache management: Coordinates local and cloud-based caching operations
- Kubernetes API requests: Creates, monitors, and deletes worker pods
- GitLab API communication: Polls for jobs and reports status updates
- Pod lifecycle management: Manages worker pod provisioning and cleanup
%%{init: { "fontFamily": "GitLab Sans" }}%%
flowchart LR
accTitle: GitLab Runner manager pod architecture
accDescr: The manager pod polls GitLab for jobs, creates job pods through the Kubernetes API, manages the S3 cache, and forwards logs from job pods to GitLab.
subgraph "External Services"
GL[GitLab Instance]
S3[S3 Cache Storage]
end
subgraph "Manager Pod"
MP[Manager Process]
LB[Log Buffer]
CM[Cache Manager]
end
subgraph "Kubernetes API"
K8S[API Server]
end
subgraph "Job Pods"
JP1[Job Pod 1]
JP2[Job Pod 2]
JP3[Job Pod N]
end
GL <-->|Poll Jobs<br/>Update Status| MP
MP <-->|Create/Delete<br/>Monitor Pods| K8S
MP <-->|Cache Operations| S3
JP1 -->|Stream Logs| LB
JP2 -->|Stream Logs| LB
JP3 -->|Stream Logs| LB
LB -->|Forward Logs| GL
CM <-->|Manage Cache| S3
Each responsibility affects performance differently:
- CPU intensive: Kubernetes API operations, log processing
- Memory intensive: Log buffering, job queue management
- Network intensive: GitLab API communication, log streaming
Deploy GitLab Runner in Kubernetes
Install GitLab Runner through the GitLab Runner Operator. The Operator actively receives new features and improvements. The GitLab Runner team installs the Operator through the Experimental GRIT framework.
The easiest way to install GitLab Runner in Kubernetes is to apply the
operator.k8s.yaml manifest from the latest release
and then follow the instructions in the Operator install documentation.
Configure monitoring
Observability is critical for GitLab Runner administration in Kubernetes because pods are ephemeral and metrics provide the primary operational visibility.
For monitoring, install kube-prometheus-stack.
To configure monitoring for the Operator, see Monitor GitLab Runner Operator.
Performance monitoring
Effective monitoring is crucial for maintaining optimal manager pod performance.
%%{init: { "fontFamily": "GitLab Sans" }}%%
flowchart TD
accTitle: Metrics collection and monitoring flow
accDescr: The manager pod exposes metrics, Prometheus scrapes the metrics using PodMonitor configuration, Grafana visualizes the data, and Alertmanager notifies operators.
subgraph "Metrics Collection Flow"
MP[Manager Pod<br/>:9252/metrics]
PM[PodMonitor]
P[Prometheus]
G[Grafana]
A[Alertmanager]
MP -->|Expose Metrics| PM
PM -->|Scrape| P
P -->|Query| G
P -->|Alerts| A
A -->|Notify| O[Operators]
end
Key performance metrics
Monitor these essential metrics:
| Metric | Description | Performance Indicator |
|---|---|---|
gitlab_runner_jobs | Current running jobs | Job queue saturation |
gitlab_runner_limit | Configured job concurrency limit | Capacity utilization |
gitlab_runner_request_concurrency_exceeded_total | Requests above concurrency limit | API throttling |
gitlab_runner_errors_total | Total caught errors | System stability |
container_cpu_usage_seconds_total | Container CPU usage | Resource consumption |
container_memory_working_set_bytes | Container memory usage | Memory pressure |
Prometheus queries
Track manager pod performance with these queries:
# Manager pod memory usage in MB
container_memory_working_set_bytes{pod=~"gitlab-runner.*"} / 1024 / 1024
# Manager pod CPU utilization in Millicores
rate(container_cpu_usage_seconds_total{pod=~"gitlab-runner.*"}[5m]) * 1000
# Job queue saturation
gitlab_runner_jobs / gitlab_runner_limit
# Jobs per runner
gitlab_runner_jobs
# API request rate
sum(rate(apiserver_request_total[5m]))Example dashboard
The following dashboard shows Manager Pod utilization across all pods using the Prometheus queries described previously:
This dashboard can help you visualize:
- Memory usage trends across manager pods
- CPU utilization patterns during job execution
- Job queue saturation levels
- Individual pod resource consumption
Identify overloaded manager pods
Recognize performance degradation before it impacts your pipelines.
Resource utilization indicators
By default, GitLab Runner Operator does not apply CPU or memory limits to manager pods.
To set resource limits:
kubectl patch deployment gitlab-runner -p '{"spec":{"template":{"spec":{"containers":[{"name":"gitlab-runner","resources":{"requests":{"cpu":"500m","memory":"256Mi"},"limits":{"cpu":"1000m","memory":"512Mi"}}}]}}}}'The feature to allow deployment patching from the Operator configuration is under development. For more information, see merge request 197.
High CPU usage patterns:
- CPU consistently above 70% during standard operations
- CPU spikes exceeding 90% during job creation
- Sustained high CPU without corresponding job activity
Memory consumption trends:
- Memory usage above 80% of allocated limits
- Continuous memory growth without workload increase
- Out-of-memory (OOM) events in manager pod logs
Performance degradation signs
Watch for these operational symptoms:
- Jobs remaining pending longer than usual
- Pod creation times exceeding 30 seconds
- Delayed log output in GitLab job interfaces
etcdserver: request timed outerrors in logs
Diagnostic commands
# Current resource usage
kubectl top pods --containers
> POD NAME CPU(cores) MEMORY(bytes)
> gitlab-runner-runner-86cd68d899-m6qqm runner 7m 32Mi
# Check for performance errors
kubectl logs gitlab-runner-runner-86cd68d899-m6qqm --since=2h | grep -E "(error|timeout|failed)"Resource configuration
Proper resource configuration is essential for optimal performance.
Performance testing methodology
GitLab Runner Manager Pod performance is tested using a job that maximizes log output:
Performance test job definition
performance_test:
stage: build
timeout: 30m
tags:
- kubernetes_runner
image: alpine:latest
parallel: 100
variables:
FILE_SIZE_MB: 4
CHUNK_SIZE_BYTES: 1024
FILE_NAME: "test_file_${CI_JOB_ID}_${FILE_SIZE_MB}MB.dat"
KUBERNETES_CPU_REQUEST: "200m"
KUBERNETES_CPU_LIMIT: "200m"
KUBERNETES_MEMORY_REQUEST: "200Mi"
KUBERNETES_MEMORY_LIMIT: "200Mi"
script:
- echo "Starting performance test job ${CI_PARALLEL_ID}/${CI_PARALLEL_TOTAL} with ${FILE_SIZE_MB}MB file size, ${CHUNK_SIZE_BYTES} bytes chunk size"
- dd if=/dev/urandom of="${FILE_NAME}" bs=1M count=${FILE_SIZE_MB}
- echo "File generated successfully. Size:"
- ls -lh "${FILE_NAME}"
- echo "Reading file in ${CHUNK_SIZE_BYTES} byte chunks"
- |
TOTAL_SIZE=$(stat -c%s "${FILE_NAME}")
BLOCKS=$((TOTAL_SIZE / CHUNK_SIZE_BYTES))
echo "Processing $BLOCKS blocks of $CHUNK_SIZE_BYTES bytes each"
for i in $(seq 0 99 $BLOCKS); do
echo "Processing blocks $i to $((i+99))"
dd if="${FILE_NAME}" bs=${CHUNK_SIZE_BYTES} skip=$i count=100 2>/dev/null | xxd -l $((CHUNK_SIZE_BYTES * 100)) -c 16
sleep 0.5
doneThis test generates 4 MB of log output per job, which reaches the default
output_limit
to stress test the manager pod’s log processing capabilities.
Test results:
| Parallel Jobs | Peak CPU Usage | Peak Memory Usage |
|---|---|---|
| 50 | 308m | 261 MB |
| 100 | 657m | 369 MB |
%%{init: { "fontFamily": "GitLab Sans" }}%%
xychart-beta
accTitle: Manager pod resource usage compared to concurrent jobs
accDescr: Chart showing CPU usage (10-610 millicores) and memory usage (50-300 MB) that scale with concurrent jobs (0-100).
x-axis [0, 25, 50, 75, 100]
y-axis "Resource Usage" 0 --> 700
line "CPU (millicores)" [10, 160, 310, 460, 610]
line "Memory (MB)" [50, 112, 175, 237, 300]
Key findings:
- CPU usage scales approximately linearly with concurrent jobs
- Memory usage increases with job count but not linearly
- All jobs run concurrently without queuing
CPU requirements
Based on GitLab performance testing, calculate manager pod CPU requirements:
Manager pod CPU = Base CPU + (Concurrent jobs × CPU per job factor)
Where:
- Base CPU: 10m (baseline overhead)
- CPU per job factor: ~6m per concurrent job (based on testing)
Examples based on test results:
For 50 concurrent jobs:
resources:
requests:
cpu: "310m" # 10m + (50 × 6m) = 310m
limits:
cpu: "465m" # 50% headroom for burst trafficFor 100 concurrent jobs:
resources:
requests:
cpu: "610m" # 10m + (100 × 6m) = 610m
limits:
cpu: "915m" # 50% headroomMemory requirements
Based on GitLab testing, calculate memory requirements:
Manager pod memory = Base memory + (Concurrent jobs × Memory per job)
Where:
- Base memory: 50 MB (baseline overhead)
- Memory per job: ~2.5 MB per concurrent job (with 4MB log output)
Examples based on test results:
For 50 concurrent jobs:
resources:
requests:
memory: "175Mi" # 50 + (50 × 2.5) = 175 MB
limits:
memory: "350Mi" # 100% headroomFor 100 concurrent jobs:
resources:
requests:
memory: "300Mi" # 50 + (100 × 2.5) = 300 MB
limits:
memory: "600Mi" # 100% headroomMemory usage varies significantly based on log volume. Jobs producing more than 4 MB of logs require proportionally more memory.
Configuration examples
Small-scale (1-20 concurrent jobs):
resources:
limits:
cpu: 300m
memory: 256Mi
requests:
cpu: 150m
memory: 128Mi
runners:
config: |
concurrent = 20
[[runners]]
limit = 20
request_concurrency = 5Large-scale (75+ concurrent jobs):
resources:
limits:
cpu: 1000m
memory: 1Gi
requests:
cpu: 600m
memory: 600Mi
runners:
config: |
concurrent = 150
[[runners]]
limit = 150
request_concurrency = 20Horizontal pod autoscaler
Configure automatic scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: gitlab-runner-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: gitlab-runner
minReplicas: 2
maxReplicas: 5
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80Troubleshoot performance issues
Address common manager pod performance problems with these solutions.
API rate limiting
Problem: Manager pod exceeds Kubernetes API rate limits.
Solution: Optimize API polling:
[[runners]]
[runners.kubernetes]
poll_interval = "5s" # Increase from default 3s
poll_timeout = "180s"Performance optimization
Apply these performance optimization strategies for challenging scenarios.
Cache optimization
Configure distributed caching to reduce manager pod load. This action reduces computation required for job pods by sharing cached files:
[runners.cache]
Type = "s3"
Shared = true
[runners.cache.s3]
ServerAddress = "cache.example.com"
BucketName = "gitlab-runner-cache"
PreSignedURLDisabled = falseNode segregation
Segregate manager pods from job pods by using dedicated nodes to ensure stable performance and prevent resource contention. This isolation prevents job pods from disrupting critical manager pod operations.
%%{init: { "fontFamily": "GitLab Sans" }}%%
graph TB
accTitle: Kubernetes node segregation architecture
accDescr: Node segregation with manager pods on dedicated manager nodes and job pods on worker nodes, separated by taints.
subgraph "Kubernetes Cluster"
subgraph "Manager Nodes"
MN1[Manager Node 1<br/>Taint: runner.gitlab.com/manager]
MN2[Manager Node 2<br/>Taint: runner.gitlab.com/manager]
MP1[Manager Pod 1]
MP2[Manager Pod 2]
MN1 --> MP1
MN2 --> MP2
end
subgraph "Worker Nodes"
WN1[Worker Node 1<br/>Taint: runner.gitlab.com/job]
WN2[Worker Node 2<br/>Taint: runner.gitlab.com/job]
WN3[Worker Node 3<br/>Taint: runner.gitlab.com/job]
JP1[Job Pod 1]
JP2[Job Pod 2]
JP3[Job Pod 3]
JP4[Job Pod 4]
WN1 --> JP1
WN1 --> JP2
WN2 --> JP3
WN3 --> JP4
end
end
MP1 -.->|Creates & Manages| JP1
MP1 -.->|Creates & Manages| JP2
MP2 -.->|Creates & Manages| JP3
MP2 -.->|Creates & Manages| JP4
Configure node taints
For manager nodes:
# Taint nodes dedicated to Manager Pods
kubectl taint nodes <manager-node-name> runner.gitlab.com/manager=:NoExecute
# Label nodes for easier selection
kubectl label nodes <manager-node-name> runner.gitlab.com/workload-type=managerFor worker nodes:
# Taint nodes dedicated to job pods
kubectl taint nodes <worker-node-name> runner.gitlab.com/job=:NoExecute
# Label nodes for job scheduling
kubectl label nodes <worker-node-name> runner.gitlab.com/workload-type=jobConfigure manager pod scheduling
Update the GitLab Runner Operator configuration to schedule manager pods only on dedicated nodes:
apiVersion: apps.gitlab.com/v1beta2
kind: Runner
metadata:
name: gitlab-runner
spec:
gitlabUrl: https://gitlab.example.com
token: gitlab-runner-secret
buildImage: alpine
podSpec:
name: "manager-node-affinity"
patch: |
{
"spec": {
"nodeSelector": {
"runner.gitlab.com/workload-type": "manager"
},
"tolerations": [
{
"key": "runner.gitlab.com/manager",
"operator": "Exists",
"effect": "NoExecute"
}
]
}
}
patchType: "strategic"Configure job pod scheduling
Ensure job pods run only on worker nodes by updating config.toml.
[runners.kubernetes.node_selector]
"runner.gitlab.com/workload-type" = "job"
[runners.kubernetes.node_tolerations]
"runner.gitlab.com/job=" = "NoExecute"Benefits of node segregation:
- Dedicated resources for manager pods without job interference
- Predictable performance without resource contention
- Option to run without resource limits when using dedicated nodes
- Simplified capacity planning with node-based scaling
Emergency procedures
Graceful restart:
# Scale down to stop accepting new jobs
kubectl scale deployment gitlab-runner --replicas=0
# Wait for active jobs to complete (max 10 minutes)
timeout 600 bash -c 'while kubectl get pods -l job-type=user-job | grep Running; do sleep 10; done'
# Scale back up
kubectl scale deployment gitlab-runner --replicas=1Capacity planning
These calculations are based on tests with 4 MB log output per job. Your resource requirements might vary based on:
- Log volume per job
- Job execution patterns
- Cache usage
- Network latency to GitLab
Calculate optimal resources using this Python function:
def calculate_manager_resources(concurrent_jobs, avg_log_mb_per_job=4):
"""Calculate Manager Pod resources based on performance testing."""
# CPU: ~6m per concurrent job + 10m base
base_cpu = 0.01 # 10m
cpu_per_job = 0.006 # 6m per job
total_cpu = base_cpu + (concurrent_jobs * cpu_per_job)
# Memory: ~2.5MB per job + 50MB base (for 4MB log output)
base_memory = 50
memory_per_job = 2.5 * (avg_log_mb_per_job / 4) # Scale with log size
total_memory = base_memory + (concurrent_jobs * memory_per_job)
return {
'cpu_request': f"{int(total_cpu * 1000)}m",
'cpu_limit': f"{int(total_cpu * 1.5 * 1000)}m", # 50% headroom
'memory_request': f"{int(total_memory)}Mi",
'memory_limit': f"{int(total_memory * 2.0)}Mi" # 100% headroom
}Performance thresholds
Establish thresholds for proactive intervention:
| Metric | Warning | Critical | Action Required |
|---|---|---|---|
| CPU Usage | 70% sustained | 85% sustained | Scale or optimize |
| Memory Usage | 80% of limit | 90% of limit | Increase limits |
| API Error Rate | 2% of requests | 5% of requests | Investigate bottlenecks |
| Job Queue Time | 30 seconds | 2 minutes | Review capacity |
Related topics
- GitLab Runner fleet configuration and best practices - Job pod performance optimization
- GitLab Runner executors - Execution environment performance characteristics
- GitLab Runner monitoring - General monitoring setup
- Plan and operate a fleet of runners - Strategic fleet deployment
Summary
Optimizing GitLab Runner manager pod performance requires systematic monitoring, proper resource allocation, and proactive troubleshooting.
Key strategies include:
- Proactive monitoring by using Prometheus metrics and Grafana dashboards
- Resource planning based on concurrent job capacity and log volume
- Multi-manager architecture for fault tolerance and load distribution
- Emergency procedures for quick issue resolution
Implement these strategies to ensure reliable CI/CD pipeline execution while maintaining optimal resource utilization.
