Geo background jobs

  • Tier: Premium, Ultimate
  • Offering: GitLab Self-Managed, GitLab Dedicated

Geo persists all sync and verification intent in registry tables in the Geo tracking database, not in Sidekiq job arguments. If a job is killed, the registry record retains its state and cron-based schedulers re-enqueue the work. This design makes Geo fundamentally crash-safe.

Registry sync states

Every replicable data type has a registry record on the secondary site that tracks sync state:

StateDescription
pendingNeeds sync. Picked up by the sync scheduler cron.
startedSync in progress. If the worker is killed, the record stays in this state.
syncedSuccessfully replicated.
failedSync failed. Has a retry_at timestamp for exponential backoff retry.

If a sync job is killed, the registry stays in started. Geo::SyncTimeoutCronWorker, which runs every 10 minutes, detects the sync job state and marks the registry as failed with retry backoff. The sync scheduler cron worker then re-enqueues the registry for sync.

Recovery mechanisms

The following cron workers provide automatic recovery. Their schedules are defined in ee/config/schedule.yml and are configurable.

The cron job name matches the name shown in the Cron tab of the Sidekiq dashboard in the Admin area.

MechanismWorkerCron job nameDefault schedulePurpose
Sync timeout recoveryGeo::SyncTimeoutCronWorkergeo_sync_timeout_cron_workerEvery 10 min (secondary)Marks registries stuck in started state as failed with retry backoff.
Blob sync schedulerGeo::RegistrySyncWorkergeo_registry_sync_workerEvery 1 min (secondary)Polls for pending and failed blob registries and enqueues Geo::SyncWorker.
Repository sync schedulerGeo::RepositoryRegistrySyncWorkergeo_repository_registry_sync_workerEvery 1 min (secondary)Polls for pending and failed repository registries and enqueues Geo::SyncWorker.
Registry consistencyGeo::Secondary::RegistryConsistencyWorkergeo_secondary_registry_consistency_workerEvery 1 min (secondary)Creates missing registry records for untracked replicables. Detects orphaned registries.
Verification timeoutGeo::VerificationTimeoutWorkerTriggered by geo_verification_cron_workerEvery 1 min (primary and secondary)Marks verification stuck in verification_started as verification_failed.
Verification schedulerGeo::VerificationCronWorkergeo_verification_cron_workerEvery 1 min (primary and secondary)Triggers verification batch, timeout, re-verification, and state backfill workers.

Queue safety reference

The following sections provide per-worker safety information for Geo Sidekiq queues.

Cron workers

Cron workers are scheduled automatically and re-run on the next cron tick if their queue is cleared. All cron worker queues are safe to clear.

WorkerWhat it doesSafe to clear queueNegative consequences of clearingRecovery mechanism
Geo::RegistrySyncWorkerPolls for pending and failed blob registries. Enqueues Geo::SyncWorker.YesSync is delayed until the next cron tick.Re-runs every 1 min.
Geo::RepositoryRegistrySyncWorkerPolls for pending and failed repository registries. Enqueues Geo::SyncWorker.YesSync is delayed until the next cron tick.Re-runs every 1 min.
Geo::SyncTimeoutCronWorkerFinds registries stuck in started state. Marks them failed with retry backoff.YesRegistries stuck in started are not transitioned to failed until the next tick. Sync resumes after the next tick.Re-runs every 10 min.
Geo::Secondary::RegistryConsistencyWorkerScans all registry types. Creates missing registries. Detects orphaned registries and enqueues Geo::DestroyWorker.YesMissing registries are not created and orphaned registries are not cleaned up until the next tick.Re-runs every 1 min.
Geo::VerificationCronWorkerTriggers all verification sub-workers.YesVerification is delayed until the next cron tick.Re-runs every 1 min.
Geo::VerificationTimeoutWorkerRecovers records stuck in verification_started.YesRecords stuck in verification_started are not transitioned until the next tick.Re-runs every 1 min.
Geo::PruneEventLogWorkerDeletes old event log entries that all secondaries have consumed (primary only).YesThe event log grows until the worker runs again. No data loss.Re-runs every 5 min.
Geo::MetricsUpdateWorkerComputes node status, updates Prometheus gauges, sends status to primary.YesMetrics become stale until the worker runs again.Re-runs every 1 min.
Geo::SidekiqCronConfigWorkerEnables and disables cron jobs based on node type (primary or secondary).YesCron job configuration may be incorrect until the worker runs again.Re-runs every 1 min.

Sync workers (secondary)

WorkerWhat it doesSafe to clear queueNegative consequences of clearingRecovery mechanism
Geo::SyncWorkerDownloads a single blob or fetches a single repository from the primary site.YesRegistries for in-flight jobs stay in started. Sync is delayed until recovery.SyncTimeoutCronWorker transitions stuck registries to failed. Sync scheduler re-enqueues.
Geo::ContainerRepositorySyncWorkerSyncs a single container repository from the primary site.YesRegistry retains its state. Sync is delayed until recovery.Sync scheduler re-enqueues.
Geo::BulkRegistryResyncWorkerTriggers bulk re-sync of a registry class.YesBulk re-sync does not start. Individual registries retain their state.Caller re-enqueues.

Event workers (primary and secondary)

When you clear event worker queues, events in the queue might be lost. Lost events can cause data to be temporarily out of date on the secondary site.

WorkerWhat it doesSafe to clear queueNegative consequences of clearingRecovery mechanism
Geo::EventWorkerProcesses Geo replication events (created, updated, deleted) on the secondary site.Use caution.Losing updated events can leave a resource out of date until re-verification or the next update event. Losing deleted events can leave orphaned files on the secondary site (wasted disk, no data loss). Losing created events has no lasting effect. Has 3 Sidekiq retries.Even if all retries fail, the registry record exists and the sync scheduler re-enqueues. RegistryConsistencyWorker also detects orphaned registries.
Geo::BatchEventCreateWorkerBulk-inserts Geo events on the primary site.Use caution.Events in the queue are lost if cleared. Secondary sites may not learn about changes until re-verification.RegistryConsistencyWorker on the secondary site eventually detects missing registries (every 1 min).
Geo::CreateRepositoryUpdatedEventWorkerCreates a Geo event when a repository is updated on the primary site.Use caution.Same as Geo::BatchEventCreateWorker.Same as Geo::BatchEventCreateWorker.

Verification workers (primary and secondary)

WorkerWhat it doesSafe to clear queueNegative consequences of clearingRecovery mechanism
Geo::VerificationBatchWorkerChecksums batches of records.YesVerification is delayed.Cron re-enqueues. VerificationTimeoutWorker catches stuck records.
Geo::ReverificationBatchWorkerMarks already-verified records for periodic re-verification.YesRe-verification is delayed.Cron re-enqueues.
Geo::VerificationStateBackfillWorkerBackfills verification state table for replicable types.YesBackfill is delayed. Exclusive lease expires in 30 min.Re-enqueues itself.
Geo::BulkPrimaryVerificationWorkerTriggers bulk verification of a model class on the primary site.YesBulk verification does not start.Caller re-enqueues.
Geo::BulkRegistryReverificationWorkerTriggers bulk re-verification of a registry class on the secondary site.YesBulk re-verification does not start.Caller re-enqueues.

Destroy workers (secondary)

When you clear the destroy worker queue, lost jobs can leave orphaned files or repositories on the secondary site.

WorkerWhat it doesSafe to clear queueNegative consequences of clearingRecovery mechanism
Geo::DestroyWorkerDeletes a replicated file or repository on the secondary site after deletion on the primary site.Use caution.Orphaned files or repositories remain on the secondary site, wasting disk space. No data loss. Has 3 Sidekiq retries.RegistryConsistencyWorker detects orphaned registries and re-enqueues DestroyWorker.