- Attempt B: Denormalize using an array column
GitLab has labels that can be assigned to issues,
merge requests, and epics. Labels on those objects are a many-to-many relation
through the polymorphic
To filter these objects by multiple labels - for instance, ‘all open
issues with the label ~Plan and the label ~backend’ - we generate a
query containing a
GROUP BY clause. In a simple form, this looks like:
SELECT issues.* FROM issues INNER JOIN label_links ON label_links.target_id = issues.id AND label_links.target_type = 'Issue' INNER JOIN labels ON labels.id = label_links.label_id WHERE issues.project_id = 13083 AND (issues.state IN ('opened')) AND labels.title IN ('Plan', 'backend') GROUP BY issues.id HAVING (COUNT(DISTINCT labels.title) = 2) ORDER BY issues.updated_at DESC, issues.id DESC LIMIT 20 OFFSET 0
In particular, note that:
GROUP BY issues.idso that we can …
- Use the
HAVING (COUNT(DISTINCT labels.title) = 2)condition to ensure that all matched issues have both labels.
This is more complicated than is ideal. It makes the query construction more prone to errors (such as issue #15557).
WHERE (EXISTS ( SELECT TRUE FROM label_links INNER JOIN labels ON labels.id = label_links.label_id WHERE labels.title = 'Plan' AND target_type = 'Issue' AND target_id = issues.id)) AND (EXISTS ( SELECT TRUE FROM label_links INNER JOIN labels ON labels.id = label_links.label_id WHERE labels.title = 'backend' AND target_type = 'Issue' AND target_id = issues.id))
While this worked without schema changes, and did improve readability somewhat, it did not improve query performance.
In merge request #34503, we followed a similar approach to A1. But this time, we
did a separate query to fetch the IDs of the labels used in the filter so that we avoid the
JOIN in the
EXISTS clause and filter directly by
label_links.label_id. We also added a new index on
label_links for the
target_type columns to speed up this query.
Finding the label IDs wasn’t straightforward because there could be multiple labels with the same title within a single root namespace. We solved
this by grouping the label IDs by title and then using the array of IDs in the
This resulted in a significant performance improvement. However, this optimization could not be applied to the dashboard pages where we do not have a project or group context. We could not easily search for the label IDs here because that would mean searching across all projects and groups that the user has access to.
Having removed MySQL support in GitLab 12.1,
using PostgreSQL’s arrays became more
tractable as we didn’t have to support two databases. We discussed denormalizing
label_links table for querying in
with two options: label IDs and titles.
We can think of both of those as array columns on
issues.label_ids would be an array column of label IDs, and
issues.label_titles would be an array of label titles.
These array columns can be complemented with GIN indexes to improve matching.
This has some strong advantages over titles:
- Unless a label is deleted, or a project is moved, we never need to bulk-update the denormalized column.
- It uses less storage than the titles.
Unfortunately, our application design makes this hard. If we were able to query
just by label ID easily, we wouldn’t need the
INNER JOIN labels in the initial
query at the start of this document. GitLab allows users to filter by label
title across projects and even across groups, so a filter by the label ~Plan may
include labels with multiple distinct IDs.
We do not want users to have to know about the different IDs, which means that given this data set:
|Project||~Plan label ID||~backend label ID|
We would need something like:
WHERE label_ids @> ARRAY[11, 12] OR label_ids @> ARRAY[21, 22] OR label_ids @> ARRAY[31, 32]
This can get even more complicated when we consider that in some cases, there might be two ~backend labels - with different IDs - that could apply to the same object, so the number of combinations would balloon further.
From the perspective of updating the object, this is the worst option. We have to bulk update the objects when:
- The objects are moved from one project to another.
- The project is moved from one group to another.
- The label is renamed.
- The label is deleted.
It also uses much more storage. Querying is simple, though:
WHERE label_titles @> ARRAY['Plan', 'backend']
And our tests in issue #49651 showed that this could be fast.
However, at present, the disadvantages outweigh the advantages.
We found a method A2 that does not need denormalization and improves the query performance significantly. This
did not apply to all cases, but we were able to apply method A1 to the rest of the cases so that we remove the
GROUP BY and
HAVING clauses in all scenarios.
This simplified the query and improved the performance in the most common cases.