Geo blob replication
Blobs such as uploads, LFS objects, and CI job artifacts, are replicated to the secondary site
with the Self-Service Framework. To track the state of syncing, each model has a corresponding registry table,
for example Upload has Geo::UploadRegistry in the PostgreSQL Geo Tracking Database.
Blob replication happy path workflows between services
Job artifacts are used in the diagrams below, as one example of a blob.
Replicating a new job artifact
Primary site:
sequenceDiagram participant R as Runner participant P as Puma participant DB as PostgreSQL participant SsP as Secondary site PostgreSQL R->>P: Upload artifact P->>DB: Insert `ci_job_artifacts` row P->>DB: Insert `geo_events` row P->>DB: Insert `geo_event_log` row DB->>SsP: Replicate rows
- A Runner uploads an artifact
- Puma inserts
ci_job_artifactsrow - Puma inserts
geo_eventsrow with data like “Job Artifact with ID 123 was updated” - Puma inserts
geo_event_logrow pointing to thegeo_eventsrow (because we built SSF on top of some legacy logic) - PostgreSQL streaming replication inserts the rows in the read replica
Secondary site, after the PostgreSQL DB rows have been replicated:
sequenceDiagram participant DB as PostgreSQL participant GLC as Geo Log Cursor participant R as Redis participant S as Sidekiq participant TDB as PostgreSQL Tracking DB participant PP as Primary site Puma GLC->>DB: Query `geo_event_log` GLC->>DB: Query `geo_events` GLC->>R: Enqueue `Geo::EventWorker` S->>R: Pick up `Geo::EventWorker` S->>TDB: Insert to `job_artifact_registry`, "starting sync" S->>PP: GET <primary site internal URL>/geo/retrieve/job_artifact/123 S->>TDB: Update `job_artifact_registry`, "synced"
- Geo Log Cursor loop finds the new
geo_event_logrow - Geo Log Cursor processes the
geo_eventsrow- Geo Log Cursor enqueues
Geo::EventWorkerjob passing through thegeo_eventsrow data
- Geo Log Cursor enqueues
- Sidekiq picks up
Geo::EventWorkerjob- Sidekiq inserts
job_artifact_registryrow in the PostgreSQL Geo Tracking Database because it doesn’t exist, and marks it “started sync” - Sidekiq does a GET request on an API endpoint at the primary Geo site and downloads the file
- Sidekiq marks the
job_artifact_registryrow as “synced” and “pending verification”
- Sidekiq inserts
Backfilling existing job artifacts
- Sysadmin has an existing GitLab site without Geo
- There are existing CI jobs and job artifacts
- Sysadmin sets up a new GitLab site and configures it to be a secondary Geo site
Secondary site:
There are two cronjobs running every minute: Geo::Secondary::RegistryConsistencyWorker and Geo::RegistrySyncWorker.
The workflow below is split into two, along those lines.
sequenceDiagram participant SC as Sidekiq-cron participant R as Redis participant S as Sidekiq participant DB as PostgreSQL participant TDB as PostgreSQL Tracking DB SC->>R: Enqueue `Geo::Secondary::RegistryConsistencyWorker` S->>R: Pick up `Geo::Secondary::RegistryConsistencyWorker` S->>DB: Query `ci_job_artifacts` S->>TDB: Query `job_artifact_registry` S->>TDB: Insert to `job_artifact_registry`
- Sidekiq-cron enqueues a
Geo::Secondary::RegistryConsistencyWorkerjob every minute. As long as it is actively doing work (creating and deleting rows), this job immediately re-enqueues itself. This job uses an exclusive lease to prevent multiple instances of itself from running simultaneously. - Sidekiq picks up
Geo::Secondary::RegistryConsistencyWorkerjob- Sidekiq queries
ci_job_artifactstable for up to 10000 rows - Sidekiq queries
job_artifact_registrytable for up to 10000 rows - Sidekiq inserts a
job_artifact_registryrow in the PostgreSQL Geo Tracking Database corresponding to the existing Job Artifact
- Sidekiq queries
sequenceDiagram participant SC as Sidekiq-cron participant R as Redis participant S as Sidekiq participant DB as PostgreSQL participant TDB as PostgreSQL Tracking DB participant PP as Primary site Puma SC->>R: Enqueue `Geo::RegistrySyncWorker` S->>R: Pick up `Geo::RegistrySyncWorker` S->>TDB: Query `*_registry` tables S->>R: Enqueue `Geo::EventWorker`s S->>R: Pick up `Geo::EventWorker` S->>TDB: Insert to `job_artifact_registry`, "starting sync" S->>PP: GET <primary site internal URL>/geo/retrieve/job_artifact/123 S->>TDB: Update `job_artifact_registry`, "synced"
- Sidekiq-cron enqueues a
Geo::RegistrySyncWorkerjob every minute. As long as it is actively doing work, this job loops for up to an hour scheduling sync jobs. This job uses an exclusive lease to prevent multiple instances of itself from running simultaneously. - Sidekiq picks up
Geo::RegistrySyncWorkerjob- Sidekiq queries all
registrytables in the PostgreSQL Geo Tracking Database for “never attempted sync” rows. It interleaves rows from each table and adds them to an in-memory queue. - If the previous step yielded less than 1000 rows, then Sidekiq queries all
registrytables for “failed sync and ready to retry” rows and interleaves those and adds them to the in-memory queue. - Sidekiq enqueues
Geo::EventWorkerjobs with arguments like “Job Artifact with ID 123 was updated” for each item in the queue, and tracks the enqueued Sidekiq job IDs. - Sidekiq stops enqueuing
Geo::EventWorkerjobs when “maximum concurrency limit” settings are reached - Sidekiq loops doing this kind of work until it has no more to do
- Sidekiq queries all
- Sidekiq picks up
Geo::EventWorkerjob- Sidekiq marks the
job_artifact_registryrow as “started sync” - Sidekiq does a GET request on an API endpoint at the primary Geo site and downloads the file
- Sidekiq marks the
job_artifact_registryrow as “synced” and “pending verification”
- Sidekiq marks the
Verifying a new job artifact
Primary site:
sequenceDiagram participant Ru as Runner participant P as Puma participant DB as PostgreSQL participant SC as Sidekiq-cron participant Rd as Redis participant S as Sidekiq participant F as Filesystem Ru->>P: Upload artifact P->>DB: Insert `ci_job_artifacts` P->>DB: Insert `ci_job_artifact_states` SC->>Rd: Enqueue `Geo::VerificationCronWorker` S->>Rd: Pick up `Geo::VerificationCronWorker` S->>DB: Query `ci_job_artifact_states` S->>Rd: Enqueue `Geo::VerificationBatchWorker` S->>Rd: Pick up `Geo::VerificationBatchWorker` S->>DB: Query `ci_job_artifact_states` S->>DB: Update `ci_job_artifact_states` row, "started" S->>F: Checksum file S->>DB: Update `ci_job_artifact_states` row, "succeeded"
- A Runner uploads an artifact
- Puma creates a
ci_job_artifactsrow - Puma creates a
ci_job_artifact_statesrow to store verification state.- The row is marked “pending verification”
- Sidekiq-cron enqueues a
Geo::VerificationCronWorkerjob every minute - Sidekiq picks up the
Geo::VerificationCronWorkerjob- Sidekiq queries
ci_job_artifact_statesfor the number of rows marked “pending verification” or “failed verification and ready to retry” - Sidekiq enqueues one or more
Geo::VerificationBatchWorkerjobs, limited by the “maximum verification concurrency” setting
- Sidekiq queries
- Sidekiq picks up
Geo::VerificationBatchWorkerjob- Sidekiq queries
ci_job_artifact_statesfor rows marked “pending verification” - If the previous step yielded less than 10 rows, then Sidekiq queries
ci_job_artifact_statesfor rows marked “failed verification and ready to retry” - For each row
- Sidekiq marks it “started verification”
- Sidekiq gets the SHA256 checksum of the file
- Sidekiq saves the checksum in the row and marks it “succeeded verification”
- Now secondary Geo sites can compare against this checksum
- Sidekiq queries
Secondary site:
sequenceDiagram participant SC as Sidekiq-cron participant R as Redis participant S as Sidekiq participant TDB as PostgreSQL Tracking DB participant F as Filesystem participant DB as PostgreSQL SC->>R: Enqueue `Geo::VerificationCronWorker` S->>R: Pick up `Geo::VerificationCronWorker` S->>TDB: Query `job_artifact_registry` S->>R: Enqueue `Geo::VerificationBatchWorker` S->>R: Pick up `Geo::VerificationBatchWorker` S->>TDB: Query `job_artifact_registry` S->>TDB: Update `job_artifact_registry` row, "started" S->>F: Checksum file S->>DB: Query `ci_job_artifact_states` S->>TDB: Update `job_artifact_registry` row, "succeeded"
- After the artifact is successfully synced, it becomes “pending verification”
- Sidekiq-cron enqueues a
Geo::VerificationCronWorkerjob every minute - Sidekiq picks up the
Geo::VerificationCronWorkerjob- Sidekiq queries
job_artifact_registryin the PostgreSQL Geo Tracking Database for the number of rows marked “pending verification” or “failed verification and ready to retry”- Sidekiq enqueues one or more
Geo::VerificationBatchWorkerjobs, limited by the “maximum verification concurrency” setting
- Sidekiq enqueues one or more
- Sidekiq queries
- Sidekiq picks up
Geo::VerificationBatchWorkerjob- Sidekiq queries
job_artifact_registryin the PostgreSQL Geo Tracking Database for rows marked “pending verification” - If the previous step yielded less than 10 rows, then Sidekiq queries
job_artifact_registryfor rows marked “failed verification and ready to retry” - For each row
- Sidekiq marks it “started verification”
- Sidekiq gets the SHA256 checksum of the file
- Sidekiq saves the checksum in the row
- Sidekiq compares the checksum against the checksum in the
ci_job_artifact_statesrow which was replicated by PostgreSQL - If the checksum matches, then Sidekiq marks the
job_artifact_registryrow “succeeded verification”
- Sidekiq queries