Sharding guidelines
The sharding initiative is a long-running project to ensure that most GitLab database tables can be related to an Organization, either directly or indirectly. This involves adding an organization_id, namespace_id or project_id column to tables, and backfilling their NOT NULL fallback data. This work is important for the delivery of Cells and Organizations. For more information, see the design goals of Organizations.
Sharding principles
Follow this guidance to complete the remaining sharding key work and resolve outstanding issues.
Use unique issues for each table
We have a number of tables which share an issue. For example, eight tables point to the same issue here. This makes tracking progress and resolving blockers difficult. You should break out these shared issues into a single one per table, and update the YAML files to match.
Define a sharding key
1: Add column, triggers, indexes, and foreign keys
This is the step where we add the sharding key column, indexes, foreign keys and necessary triggers.
1. Steps to set up:
Set up the keys required by Housekeeper:
export HOUSEKEEPER_TARGET_PROJECT_ID=278964(project ID of GitLab project).Create a new PAT for yourself:
export HOUSEKEEPER_GITLAB_API_TOKEN=<your-pat>.Update the local master branch:
git checkout master && git pull origin master --rebase.Switch to the
sharding-key-backfill-keepsbranch with the following command:git checkout sharding-key-backfill-keepsYou can also find the branch in the MR.
Rebase this branch on top of master and push the changes back to origin. This makes sure that this branch is aware of changes in master:
git pull origin master --rebaseRun the following command to push back rebased changes to the branch, and omit
LEFTHOOK=0(otherwise, RuboCop fails):LEFTHOOK=0 git push --force-with-lease -o ci.skip
- Run
bundle installand migrations.
Do not push any changes to this branch, just keep rebasing.
2. Steps to create automated MR:
We store sharding key keeps for the small table and large table inside the keeps directory. The file name starts with backfill_desired_sharding_key_*.rb.
Let’s understand the small table keep:
The keep file contains code that:
- Defines a
Housekeeper::Keepclass for backfilling desired sharding keys on small tables - Sets change types to include
::Keeps::DesiredShardingKey::CHANGE_TYPES - Iterates through entries in
database_yaml_entries - For each entry, adds the sharding key column if needed, including indexes, triggers, and foreign keys
Open the keep file and add
next unless entry.table_name == 'table name'. Here, the table name will be the name of the table we want to create migrations for.Perform a quick check based on the
desired_sharding_keyconfiguration of the table. Is the configuration correct? Does the sharding key of the parent table include aNOT NULLconstraint as expected? If not, skip that table and go for another one. If the table is OK, we can proceed.Let’s check the table’s primary key. Run the following commands in your terminal:
gdk psql\d <table_name>
Running the above command will give you useful information about the tables, such as indexes, pk, etc. For example, the
security_scanstable looks like:The output shows:
- Table:
public.security_scans - Columns:
id(bigint, not null),created_at(timestamp),updated_at(timestamp),build_id(bigint, not null),scan_type(smallint, not null), and other fields - Primary key:
security_scans_pkey(btree onid) - Indexes: Including
index_security_scans_on_build_idandindex_security_scans_on_project_id - Foreign keys: Including references to
ci_buildsandprojects
- Table:
This is important because we have many cases where the primary key is composite, non unique, etc., and requires some manual changes in the keep.
Do a dry run of the execution as it will show you the generated changes and won’t commit anything:
bundle exec gitlab-housekeeper -k Keeps::BackfillDesiredShardingKeySmallTable -d
If dry run looks good then run the same command without -d flag:
bundle exec gitlab-housekeeper -k Keeps::BackfillDesiredShardingKeySmallTableRunning the above command will create a MR with the changes 🎉
Follow the same methods for large tables and use the large table keep. The only difference is that the table won’t have a FK for performance reasons.
Some useful hacks:
1. Tables containing non :id primary key
Replace :id with non-ID primary key in the first diff, in the second diff and in the third diff.
Comment this line in this diff.
Add a line in this file
let(:batch_column) { :non-id-pk }like we did in this example.Example MR: !165940 (merged) if the table uses
gitlab_cidb then we need to specifymigration: :gitlab_ciin migration files.
2. Tables containing composite primary keys (more than one)
Get the table information and check if either of the primary key columns has UNIQUE indexes defined, if YES then use that as primary key and batch column like we did above.
If none of the columns are unique, then we need to manually add changes.
Let’s take an example of the
deployment_merge_requeststable. This is a table with non-unique composite primary keys"deployment_merge_requests_pkey" PRIMARY KEY, btree (deployment_id, merge_request_id).We have used cursor based batching.
First, generate the changes using keep, then edit the changes.
Open the queue backfill post migrate file and remove all the changes and add new. For example:
Example MR: !183738 (merged)
The above changes style can be used for other tables with such specifications too.
Open the
lib/gitlab/background_migration/backfill_*.rband remove all the changes generated by keep and add:See !183738.
If the table is large, add the sharding key to ignored FK list
:ignored_fk_columns_mapin schema_spec.rb.Make sure to also update the specs.
More examples: !183047 (merged), !176714 (merged).
3. Tables in different database
It might be the case that the table is in
cidb and the sharding key is inmaindb.For example,
dast_site_profiles_buildsis insecdb and the sharding key tableprojectsis in main db.For this you may need to add a LFK - loose foreign key, example: MR created by housekeep and later we added LFK.
Make sure to add
migration: :gitlab_secin the backfill spec and queue spec.Normal FK won’t work as they are in different db.
For the parent table
dast_site_profileswe have LFK to projects.If table
dast_site_profiles_buildshas FK relation to its parent tabledast_site_profileswith CASCADE delete then records will get deleted when the associateddast_site_profilesrecords are deleted.But it’s also good to add a LFK entry for
dast_site_profiles_builds.dast_site_profiles_builds: - table: projects column: project_id on_delete: async_delete
2: Finalization Migration
Once the column has been added and the backfill is finished we need to finalize the migration. We can check the status of queued migration in #chat-ops-test Slack channel.
/chatops run batched_background_migrations list --job-class-name=<desired_sharding_key_migration_job_name>to check the status of a particular jobOutput will look something like:
The chatops output displays:
- Job class name
- Table name
- Status (e.g.,
finished,active,paused) - Progress percentage
Once it’s 100% create a new branch from master and run:
bundle exec rails g post_deployment_migration finalize_<table><sharding_key>This will create a post deployment migration file, edit it. For example, table
subscription_user_add_on_assignmentsit will look like:class FinalizeBackfillSubscriptionUserAddOnAssignmentsOrganizationId < Gitlab::Database::Migration[2.2] milestone '17.6' disable_ddl_transaction! restrict_gitlab_migration gitlab_schema: :gitlab_main_cell def up ensure_batched_background_migration_is_finished( job_class_name: 'BackfillSubscriptionUserAddOnAssignmentsOrganizationId', table_name: :subscription_user_add_on_assignments, column_name: :id, job_arguments: [:organization_id, :subscription_add_on_purchases, :organization_id, :add_on_purchase_id], finalize: true ) end def down; end endIt will be similar for every other table except the
job_class_name,table_name,column_name, andjob_arguments. Make sure the job arguments are correct. You can check the add sharding key & backfill MR to match the job arguments.Once it’s done, run
bin/rails db:migrateand update keyfinalized_byin db/docs. Example MR: !169834.
- You are all set. Git commit and create MR 🎉
3. Add a NOT NULL constraint
The last step is to make sure the sharding key has a NOT NULL constraint.
1. Small tables
Create a post deployment migration using
bundle exec rails g post_deployment_migration <table_name>_not_nullFor example, table subscription_user_add_on_assignments:
class AddSubscriptionUserAddOnAssignmentsOrganizationIdNotNull < Gitlab::Database::Migration[2.2] milestone '17.6' disable_ddl_transaction! def up add_not_null_constraint :subscription_user_add_on_assignments, :organization_id end def down remove_not_null_constraint :subscription_user_add_on_assignments, :organization_id end endRun
bin/rails db:migrate.Open the corresponding
db/docs.*.ymlfile, in this casedb/docs/subscription_user_add_on_assignments.ymland removedesired_sharding_keyanddesired_sharding_key_migration_job_nameconfiguration and add thesharding_key.sharding_key: organization_id: organizations
2. Large tables or tables that exceed runtime
In this case we have to add async validation before we can add the sharding key. It will be a 2 MR process. Let’s take an example of table packages_package_files.
Step 1 (MR 1): Add NOT NULL for sharding key on packages_package_files
Create a post deployment migration to add not null constraint with
validate: false.class AddPackagesPackageFilesProjectIdNotNull < Gitlab::Database::Migration[2.2] milestone '17.11' disable_ddl_transaction! def up add_not_null_constraint :packages_package_files, :project_id, validate: false end def down remove_not_null_constraint :packages_package_files, :project_id end endCreate another post deployment migration to prepare async constraint validation.
class PreparePackagesPackageFilesProjectIdNotNullValidation < Gitlab::Database::Migration[2.2] disable_ddl_transaction! milestone '17.11' CONSTRAINT_NAME = :check_43773f06dc def up prepare_async_check_constraint_validation :packages_package_files, name: CONSTRAINT_NAME end def down unprepare_async_check_constraint_validation :packages_package_files, name: CONSTRAINT_NAME end end
- Run
bin/rails db:migrateand create the MR with changes.
Step 2 (MR 2): Validate project_id NOT NULL on packages_package_files
Once the MR in Step 1 is merged, wait for a couple of days to prepare, you can check the status on https://console.postgres.ai/, just ask the joe instance bot for the table information. Look for
Check constraints.- Sharding key
project_idwill appear NOT VALID.
Check constraints: "check_43773f06dc" CHECK (project_id IS NOT NULL) NOT VALID- Sharding key
Once it’s there we can create a new post deployment migration to validate the not null constraint. It will be a no-op down migration.
Run
bin/rails db:migrateand remove the following add constraint fromstructure.sqland add it to table definition:- Remove:
ALTER TABLE packages_package_files ADD CONSTRAINT check_43773f06dc CHECK ((project_id IS NOT NULL)) NOT VALID;- Add:
CREATE TABLE packages_package_files ( . . CONSTRAINT check_43773f06dc CHECK ((project_id IS NOT NULL)), );Open the corresponding
db/docs.*.ymlfile, in this casedb/docs/packages_package_files.yml, and removedesired_sharding_keyanddesired_sharding_key_migration_job_nameconfiguration and add thesharding_key.Create the MR with label
pipeline:skip-check-migrationsas reverting this migration is intended to be#no-op.
Pipelines might complain about a missing FK. You must add the FK to allowed_to_be_missing_foreign_key in sharding_key_spec.rb.
4. Debug Failures
1. Using Kibana
There will be certain cases where you can get failure notification after queuing the backfill job. One way is to use the kibana logs.
Note: We only store kibana logs for 7 days
Let’s take the recent BackfillPushEventPayloadsProjectId BBM failure as an example.
Failures are also reported as a comment on backfilled original MR. Example: MR !183123
We can also check the status of the job in
#chat-ops-testSlack channel, using/chatops run batched_background_migrations list --job-class-name=<desired_sharding_key_migration_job_name>.Let’s figure out the reason for failure using kibana dashboard.
Make sure the data view is set to
pubsub-sidekiq-inf-gprd*.On the left side, you can see all the available fields. We only need
json.job_class_namei.e. desired sharding key migration job name and thejson.new_state: failed.Let’s add those filters to get the desired logs.
Set
json.job_class_nametoBackfillPushEventPayloadsProjectIdin this case andjson.new_statetofailedand apply the filter.Make sure to select the right timeline, since this migration was reported as a failure a few days ago I will filter it to show only the last 7 days.
After that you will see the desired logs with added filters.
Let’s expand the logs and find
json.exception_message.
As you can see this BBM was failed due to
Sidekiq::Shutdown😡.To fix it, just requeue the migration.
2. Using Grafana
Sometimes you won’t find anything on kibana since we only store logs up to 7 days. For this, we can use the Grafana dashboard.
Let’s take the recent BackfillApprovalMergeRequestRulesUsersProjectId BBM failure as an example.
We can also check the status of the job in
#chat-ops-testSlack channel, using/chatops run batched_background_migrations list --job-class-name=<desired_sharding_key_migration_job_name>.Let’s check the kibana dashboard. There are no logs for this job.
Let’s go to the Grafana dashboard.
Click on
Exploreand add a new query.The easiest way to debug sharding key failures is to check the table size anomaly.
- Metric:
gitlab_component_utilization:pg_table_size_bytes:1h. - Label filters:
env: gprd.type: patroni.relname: approval_merge_request_rules_users.
Timeline: select at least a few days prior to the date of job creation. You can see in the MR that failure was reported on 2025-03-31 and the job was created on 2025-03-11. I have selected the time range from 2025-03-01 to 2025-04-02. You can adjust it accordingly.
After running the query a graph will be generated within the selected timeframe.
Let’s make sense of this graph. Backfill job started on 2025-03-11, you can see a slight increase in table size starting at this date.
This is very normal
Let’s see the changes we have added in the post migration. First we added the
prepare_asyncindex. Let’s check the size on postgres.ai it’s size is 10 GB. It was created on 2025-03-15 at 00:00, as we can see in the spike in the graph.Once the index is created, backfill starts.
The BBM fails on 2025-03-29, you can see in the graph that at this point, table size dropped.
The index + the column backfill increased the table size to approx ~20 GB compared to before the backfill, an increase of ~22% in table size, from ~90 GB to ~110 GB 🫨.
We have a goal to keep all tables under 100 GB.
Update unresolved, closed issues
Some of the issues linked in the database YAML docs have been closed, sometimes in favor of new issues, but the YAML files still point to the original URL. You should update these to point to the correct items to ensure we’re accurately measuring progress.
Add more information to sharding issues
Every sharding issue should have an assignee, an associated milestone, and should link to blockers, if applicable. This helps us plan the work and estimate completion dates. It also ensures each issue names someone to contact in the case of problems or concerns. It also helps us to visualize the project work by highlighting blocker issues so we can help resolve them.
Note that a blocker can be a dependency. For example, the notes table needs to be fully migrated before other tables can proceed. Any downstream issues should mark the related item as a blocker to help us understand these relationships.
























