- What problem are we trying to solve?
- How are CI/CD data decomposition, partitioning, and time-decay related?
- Why do we need to partition CI/CD data?
- How do we want to partition CI/CD data?
- Why do we want to use explicit logical partition ids?
- Splitting large partitions into smaller ones
- Storing partitions metadata in the database
- Implementing a time-decay pattern using partitioning
- Accessing partitioned data
- Why not partition using the project or namespace ID?
- Partitioning builds queuing tables
- Iterating to reduce the risk
- Iterations
- Conclusions
- Who
Pipeline data partitioning design
Disclaimer: The following contains information related to upcoming products, features, and functionality.
It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes.
As with all projects, the items mentioned in this document and linked pages are subject to change or delay. The development, release and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
What problem are we trying to solve?
We want to partition the CI/CD dataset, because some of the database tables are extremely large, which might be challenging in terms of scaling single node reads, even after we ship the CI/CD database decomposition.
We want to reduce the risk of database performance degradation by transforming a few of the largest database tables into smaller ones using PostgreSQL declarative partitioning.
See more details about this effort in the parent blueprint.
How are CI/CD data decomposition, partitioning, and time-decay related?
CI/CD decomposition is an extraction of a CI/CD database cluster out of the “main” database cluster, to make it possible to have a different primary database receiving writes. The main benefit is doubling the capacity for writes and data storage. The new database cluster will not have to serve reads / writes for non-CI/CD database tables, so this offers some additional capacity for reads too.
CI/CD partitioning is dividing large CI/CD database tables into smaller ones. This will improve reads capacity on every CI/CD database node, because it is much less expensive to read data from small tables, than from large multi-terabytes tables. We can add more CI/CD database replicas to better handle the increase in the number of SQL queries that are reading data, but we need partitioning to perform a single read more efficiently. Performance in other aspects will improve too, because PostgreSQL will be more efficient in maintaining multiple small tables than in maintaining a very large database table.
CI/CD time-decay allows us to benefit from the strong time-decay characteristics of pipeline data. It can be implemented in many different ways, but using partitioning to implement time-decay might be especially beneficial. When implementing a time decay we usually mark data as archived, and migrate it out of a database to a different place when data is no longer relevant or needed. Our dataset is extremely large (tens of terabytes), so moving such a high volume of data is challenging. When time-decay is implemented using partitioning, we can archive the entire partition (or set of partitions) by simply updating a single record in one of our database tables. It is one of the least expensive ways to implement time-decay patterns at a database level.
Why do we need to partition CI/CD data?
We need to partition CI/CD data because our database tables storing pipelines,
builds, and artifacts are too large. The ci_builds
database table size is
currently around 2.5 TB with an index of around 1.4 GB. This is too much and
violates our principle of 100 GB max size.
We also want to build alerting
to notify us when this number is exceeded.
We’ve seen numerous S1 and S2 database-related production environment incidents, over the last couple of months, for example:
- S1: 2022-03-17 Increase in writes in
ci_builds
table - S1: 2021-11-22 Excessive buffer read in replicas for
ci_job_artifacts
- S2: 2022-04-12 Transactions detected that have been running for more than 10m
- S2: 2022-04-06 Database contention plausibly caused by excessive
ci_builds
reads - S2: 2022-03-18 Unable to remove a foreign key on
ci_builds
We have approximately 50 ci_*
prefixed database tables, and some of them
would benefit from partitioning.
A simple SQL query to get this data:
WITH tables AS (SELECT table_name FROM information_schema.tables WHERE table_name LIKE 'ci_%')
SELECT table_name,
pg_size_pretty(pg_total_relation_size(quote_ident(table_name))) AS total_size,
pg_size_pretty(pg_relation_size(quote_ident(table_name))) AS table_size,
pg_size_pretty(pg_indexes_size(quote_ident(table_name))) AS index_size,
pg_total_relation_size(quote_ident(table_name)) AS total_size_bytes
FROM tables ORDER BY total_size_bytes DESC;
See data from March 2022:
Table name | Total size | Index size |
---|---|---|
ci_builds |
3.5 TB | 1 TB |
ci_builds_metadata |
1.8 TB | 150 GB |
ci_job_artifacts |
600 GB | 300 GB |
ci_pipelines |
400 GB | 300 GB |
ci_stages |
200 GB | 120 GB |
ci_pipeline_variables |
100 GB | 20 GB |
(…around 40 more) |
Based on the table above, it is clear that there are tables with a lot of stored data.
While we have almost 50 CI/CD-related database tables, we are initially interested in partitioning only 6 of them. We can start by partitioning the most interesting tables in an iterative way, but we also should have a strategy for partitioning the remaining ones if needed. This document is an attempt to capture this strategy, describe as many details as possible, to share this knowledge among engineering teams.
How do we want to partition CI/CD data?
We want to partition the CI/CD tables in iterations. It might not be feasible to partition all of the 6 initial tables at once, so an iterative strategy might be necessary. We also want to have a strategy for partitioning the remaining database tables when it becomes necessary.
It is also important to avoid large data migrations. We store almost 6
terabytes of data in the biggest CI/CD tables, in many different columns and
indexes. Migrating this amount of data might be challenging and could cause
instability in the production environment. Due to this concern, we’ve developed
a way to attach an existing database table as a partition zero without downtime
and excessive database locking, what has been demonstrated in one of the
first proofs of concept.
This makes creation of a partitioned schema possible without a downtime (for
example using a routing table p_ci_pipelines
), by attaching an existing
ci_pipelines
table as partition zero without exclusive locking. It will be
possible to use the legacy table as usual, but we can create the next partition
when needed and the p_ci_pipelines
table will be used for routing queries. To
use the routing table we need to find a good partitioning key.
Our plan is to use logical partition IDs. We want to start with the
ci_pipelines
table and create a partition_id
column with a DEFAULT
value
of 100
or 1000
. Using a DEFAULT
value avoids the challenge of backfilling
this value for every row. Adding a CHECK
constraint prior to attaching the
first partition tells PostgreSQL that we’ve already ensured consistency and
there is no need to check it while holding an exclusive table lock when
attaching this table as a partition to the routing table (partitioned schema
definition). We will increment this value every time we create a new partition
for p_ci_pipelines
, and the partitioning strategy will be LIST
partitioning.
We will also create a partition_id
column in the other initial 6 database
tables we want to iteratively partition. After a new pipeline is created, it
will get a partition_id
assigned, and all the related resources, like builds
and artifacts, will share the same value. We want to add the partition_id
column into all 6 problematic tables because we can avoid backfilling this data
when we decide it is time to start partitioning them.
We want to partition CI/CD data iteratively, so we will start with the
pipelines table, and create at least one, but likely two, partitions. The
pipelines table will be partitioned using the LIST
partitioning strategy. It
is possible that, after some time, p_ci_pipelines
will store data in two
partitions with IDs of 100
and 101
. Then we will try partitioning
ci_builds
. Therefore we might want to use RANGE
partitioning in
p_ci_builds
with IDs 100
and 101
, because builds for the two logical
partitions used will still be stored in a single table.
Physical partitioning and logical partitioning will be separated, and a
strategy will be determined when we implement partitioning for the respective
database tables. Using RANGE
partitioning works similarly to using LIST
partitioning in database tables other than ci_pipelines
, but because we can
guarantee continuity of partition_id
values, using RANGE
partitioning might
be a better strategy.
Why do we want to use explicit logical partition ids?
Partitioning CI/CD data using a logical partition_id
has several benefits. We
could partition by a primary key, but this would introduce much more complexity
and additional cognitive load required to understand how the data is being
structured and stored in partitions.
CI/CD data is hierarchical data. Stages belong to pipelines, builds belong to
stages, artifacts belong to builds (with rare exceptions). We are designing a
partitioning strategy that reflects this hierarchy, to reduce the complexity
and therefore cognitive load for contributors. With an explicit partition_id
associated with a pipeline, we can cascade the partition ID number when trying
to retrieve all resources associated with a pipeline. We know that for a
pipeline 12345
with a partition_id
of 102
, we are always able to find
associated resources in logical partitions with number 102
in other routing
tables, and PostgreSQL will know in which partitions these records are being
stored in for every table.
Another interesting benefit for using a single and incremental latest
partition_id
number, associated with pipelines, is that in theory we can
cache it in Redis or in memory to avoid excessive reads from the database to
find this number, though we might not need to do this.
The single and uniform partition_id
value for pipeline data gives us more
choices later on than primary-keys-based partitioning.
Splitting large partitions into smaller ones
We want to start with the initial pipeline_id
number 100
(or higher, like
1000
, depending on our calculations and estimations). We do not want to start
from 1, because existing tables are also large already, and we might want to
split them into smaller partitions. If we start with 100
, we will be able to
create partitions for partition_id
of 1
, 20
, 45
, and move existing
records there by updating partition_id
from 100
to a smaller number.
PostgreSQL will move these records into their respective partitions in a consistent way, provided that we do it in a transaction for all pipeline resources at the same time. If we ever decide to split large partitions into smaller ones (it’s not yet clear if we will need to do this), we might be able to just use background migrations to update partition IDs, and PostgreSQL is smart enough to move rows between partitions on its own.
Storing partitions metadata in the database
In order to build an efficient mechanism that will be responsible for creating
new partitions, and to implement time decay we want to introduce a partitioning
metadata table, called ci_partitions
. In that table we would store metadata
about all the logical partitions, with many pipelines per partition. We may
need to store a range of pipeline ids per logical partition. Using it we will
be able to find the partition_id
number for a given pipeline ID and we will
also find information about which logical partitions are “active” or
“archived”, which will help us to implement a time-decay pattern using database
declarative partitioning.
ci_partitions
table will store information about a partition identifier,
pipeline ids range it is valid for and whether the partitions have been
archived or not. Additional columns with timestamps may be helpful too.
Implementing a time-decay pattern using partitioning
We can use ci_partitions
to implement a time-decay pattern using declarative
partitioning. By telling PostgreSQL which logical partitions are archived we
can stop reading from these partitions using a SQL query like the one below.
SELECT * FROM ci_builds WHERE partition_id IN (
SELECT id FROM ci_partitions WHERE active = true
);
This query will make it possible to limit the number of partitions we will read from, and therefore will cut access to “archived” pipeline data, using our data retention policy for CI/CD data. Ideally we do not want to read from more than two partitions at once, so we need to align the automatic partitioning mechanisms with the time-decay policy. We will still need to implement new access patterns for the archived data, presumably through the API, but the cost of storing archived data in PostgreSQL will be reduced significantly this way.
There are some technical details here that are out of the scope of this description, but by using this strategy we can “archive” data, and make it much less expensive to reside in our PostgreSQL cluster by simply toggling a boolean column value.
Accessing partitioned data
It will be possible to access partitioned data whether it has been archived or
not, in most places in GitLab. On a merge request page, we will always show
pipeline details even if the merge request was created years ago. We can do
that because