Embeddings

Embeddings are a way of representing data in a vectorised format, making it easy and efficient to find similar documents.

Currently embeddings are only generated for issues which allows for features such as

Architecture

Embeddings are stored in Elasticsearch which is also used for Advanced Search.

add to queue
pull from queue
database record
ActiveRecord callback
build embedding reference
queue
cron worker every minute
deserialize reference
generate embedding
AI gateway
Vertex API
upsert document with embedding
Elasticsearch

The process is driven by Search::Elastic::ProcessEmbeddingBookkeepingService which adds and pulls from a Redis queue.

Adding to the embedding queue

The following process description uses issues as an example.

An issue embedding is generated from the content "issue with title '#{issue.title}' and description '#{issue.description}'".

Using ActiveRecord callbacks defined in Search::Elastic::IssuesSearch, an embedding reference is added to the embedding queue if it is created or if the title or description is updated and if embedding generation is available for the issue.

Pulling from the embedding queue

A Search::ElasticIndexEmbeddingBulkCronWorker cron worker runs every minute and does the following:

no
each worker
no
each reference
no
cron
endpoint throttled?
schedule 16 workers
endpoint throttled?
fetch 19 references from queue
increment endpoint
endpoint throttled?
call AI gateway to generate embedding

Therefore we always make sure that we don’t exceed the rate limit setting of 450 embeddings per minute even with 16 concurrent processes generating embeddings at the same time.

Backfilling

An Advanced Search migration is used to perform the backfill. It essentially adds references to the queue in batches which are then processed by the cron worker as described above.

Adding a new embedding type

The following process outlines the steps to get embeddings generated and stored in Elasticsearch.

  1. Do a cost and resource calculation to see if the Elasticsearch cluster can handle embedding generation or if it needs additional resources.
  2. Decide where to store embeddings. Look at the existing indices in Elasticsearch and if there isn’t a suitable existing index, create a new index.
  3. Add embedding fields to the index: example.
  4. Update the way content is generated to accommodate the new type.
  5. Add a new unit primitive: here and here.
  6. Use Elastic::ApplicationVersionedSearch to access callbacks and add the necessary checks for when to generate embeddings. See Search::Elastic::IssuesSearch for an example.
  7. Backfill embeddings: example.

Adding work item embeddings locally

Prerequisites

  1. Make sure Elasticsearch is running.
  2. If you have an existing Elasticsearch setup, make sure the AddEmbeddingToWorkItems migration has been completed by executing the following until it returns:

    Elastic::MigrationWorker.new.perform
    
  3. Make sure you can run GitLab Duo features on your local environment.
  4. Ensure running the following in a rails console outputs an embedding (a vector of 768 dimensions). If not, there is a problem with the AI setup.

    Gitlab::Llm::VertexAi::Embeddings::Text.new('text', user: nil, tracking_context: {}, unit_primitive: 'semantic_search_issue').execute
    

Running the backfill

To backfill work item embeddings for a project’s work items, run the following in a rails console:

Gitlab::Duo::Developments::BackfillWorkItemEmbeddings.execute(project_id: project_id)

The task adds the work items to a queue and processes them in batches, indexing embeddings into Elasticsearch. It respects a rate limit of 450 embeddings per minute. Reach out to #g_global_search in Slack if there are any issues.

Verify

If the following returns 0, all work items for the project have embeddings:

curl "http://localhost:9200/gitlab-development-work_items/_count" \
--header "Content-Type: application/json" \
--data '{"query": {"bool": {"filter": [{"term": {"project_id": PROJECT_ID}}], "must_not": [{"exists": {"field": "embedding_0"}}]}}}' | jq '.count'

Replacing PROJECT_ID with your project ID.