Embeddings contribute

Embeddings are a way of representing data in a vectorised format, making it easy and efficient to find similar documents.

Currently embeddings are only generated for issues which allows for features such as

Architecture

Embeddings are stored in Elasticsearch which is also used for Advanced Search.

The process is driven by Search::Elastic::ProcessEmbeddingBookkeepingService which adds and pulls from a Redis queue.

Adding to the embedding queue

The following process description uses issues as an example.

An issue embedding is generated from the content "issue with title '#{issue.title}' and description '#{issue.description}'".

Using ActiveRecord callbacks defined in Search::Elastic::IssuesSearch, an embedding reference is added to the embedding queue if it is created or if the title or description is updated and if embedding generation is available for the issue.

Pulling from the embedding queue

A Search::ElasticIndexEmbeddingBulkCronWorker cron worker runs every minute and does the following:

Therefore we always make sure that we don’t exceed the rate limit setting of 450 embeddings per minute even with 16 concurrent processes generating embeddings at the same time.

Backfilling

An Advanced Search migration is used to perform the backfill. It essentially adds references to the queue in batches which are then processed by the cron worker as described above.

Adding a new embedding type

The following process outlines the steps to get embeddings generated and stored in Elasticsearch.

Do a cost and resource calculation to see if the Elasticsearch cluster can handle embedding generation or if it needs additional resources.
Decide where to store embeddings. Look at the existing indices in Elasticsearch and if there isn’t a suitable existing index, create a new index.
Add embedding fields to the index: example.
Update the way content is generated to accommodate the new type.
Add a new unit primitive: here and here.
Use Elastic::ApplicationVersionedSearch to access callbacks and add the necessary checks for when to generate embeddings. See Search::Elastic::IssuesSearch for an example.
Backfill embeddings: example.

Adding work item embeddings locally

Prerequisites

Make sure Elasticsearch is running.
If you have an existing Elasticsearch setup, make sure the AddEmbeddingToWorkItems migration has been completed by executing the following until it returns:
Ruby Copy to clipboard
```
Elastic::MigrationWorker.new.perform
```
Make sure you can run GitLab Duo features on your local environment.
Ensure running the following in a rails console outputs an embedding (a vector of 768 dimensions). If not, there is a problem with the AI setup.
Ruby Copy to clipboard
```
Gitlab::Llm::VertexAi::Embeddings::Text.new('text', user: nil, tracking_context: {}, unit_primitive: 'semantic_search_issue').execute
```

Running the backfill

To backfill work item embeddings for a project’s work items, run the following in a rails console:

 Ruby Copy to clipboard  
Gitlab::Duo::Developments::BackfillWorkItemEmbeddings.execute(project_id: project_id)

The task adds the work items to a queue and processes them in batches, indexing embeddings into Elasticsearch. It respects a rate limit of 450 embeddings per minute. Reach out to #g_global_search in Slack if there are any issues.

Verify

If the following returns 0, all work items for the project have embeddings:

 Shell Copy to clipboard  
curl "http://localhost:9200/gitlab-development-work_items/_count" \
--header "Content-Type: application/json" \
--data '{"query": {"bool": {"filter": [{"term": {"project_id": PROJECT_ID}}], "must_not": [{"exists": {"field": "embedding_0"}}]}}}' | jq '.count'

Replacing PROJECT_ID with your project ID.

Docs

Edit this page to fix an error or add an improvement in a merge request.

Create an issue to suggest an improvement to this page.

Product

Create an issue if there's something you don't like about this feature.

Propose functionality by submitting a feature request.

Feature availability and product trials

View pricing to see all GitLab tiers and features, or to upgrade.

Try GitLab for free with access to all features for 30 days.

Get help

If you didn't find what you were looking for, search the docs.

If you want help with something specific and could use community support, post on the GitLab forum.

For problems setting up or using this feature (depending on your GitLab subscription).

Request support