Model Migration Process

Introduction

LLM models are constantly evolving, and GitLab needs to regularly update our AI features to support newer models. This guide provides a structured approach for migrating AI features to new models while maintaining stability and reliability.

Purpose

Provide a comprehensive guide for migrating AI models within GitLab.

Expected Duration

Model migrations typically follow these general timelines:

  • Simple Model Updates (Same Provider): 2-3 weeks
    • Example: Upgrading from Claude Sonnet 3.5 to 3.6
    • Involves model validation, testing, and staged rollout
    • Primary focus on maintaining stability and performance
    • Can sometimes be expedited when urgent, but 2 weeks is standard
  • Complex Migrations: 1-2 months (full milestone or longer)
    • Example: Adding support for a new provider like AWS Bedrock
    • Example: Major version upgrades with breaking changes (e.g., Claude 2 to 3)
    • Requires significant API integration work
    • May need infrastructure changes
    • Extensive testing and validation required

Timeline Factors

Several factors can impact migration timelines:

  • Current system stability and recent incidents
  • Resource availability and competing priorities
  • Complexity of behavioral changes in new model
  • Scale of testing required
  • Feature flag rollout strategy

Best Practices

  • Always err on the side of caution with initial timeline estimates
  • Use feature flags for gradual rollouts to minimize risk
  • Plan for buffer time to handle unexpected issues
  • Communicate conservative timelines externally while working to deliver faster
  • Prioritize system stability over speed of deployment
note
While some migrations can technically be completed quickly, we typically plan for longer timelines to ensure proper testing and staged rollouts. This approach helps maintain system stability and reliability.

Scope

Applicable to all AI model-related teams at GitLab. We currently support using Anthropic and Google Vertex models. Support for AWS Bedrock models is proposed in issue 498119.

Prerequisites

Before starting a model migration:

  • Create an issue under the AI Model Version Migration Initiative epic with the following:
    • Label with group::ai framework
    • Document any known behavioral changes or improvements in the new model
    • Include any breaking changes or compatibility issues
    • Reference any model provider documentation about the changes
  • Verify the new model is supported in our current AI-Gateway API specification by:

    • Check model definitions in AI gateway:
      • For LiteLLM models: ai_gateway/models/v2/container.py
      • For Anthropic models: ai_gateway/models/anthropic.py
      • For new providers: Create a new model definition file in ai_gateway/models/
    • Verify model configurations:
      • Model enum definitions
      • Stop tokens
      • Timeout settings
      • Completion type (text or chat)
      • Max token limits
    • Testing the model locally in AI gateway:
    • If the model isn’t supported, create an issue in the AI gateway repository to add support
    • Review the provider’s API documentation for any breaking changes:
  • Ensure you have access to testing environments and monitoring tools
  • Complete model evaluation using the Prompt Library
note
Documentation of model changes is crucial for tracking the impact of migrations and helping with future troubleshooting. Always create an issue to track these changes before beginning the migration process.

Migration Tasks

Migration Tasks for Anthropic Model

  • Optional - Investigate if the new model is supported within our current AI-Gateway API specification. This step can usually be skipped. However, sometimes to support a newer model, we may need to accommodate a new API format.
  • Add the new model to our available models list.
  • Change the default model in our AI-Gateway client. Please place the change around a feature flag. We may need to quickly rollback the change.
  • Update the model definitions in AI gateway following the prompt definition guidelines Note: While we’re moving toward AI gateway holding the prompts, feature flag implementation still requires a GitLab release.

Migration Tasks for Vertex Models

Work in Progress

Feature Flag Process

Implementation Steps

For implementing feature flags, refer to our Feature Flags Development Guidelines.

note
Feature flag implementations will affect self-hosted cloud-connected customers. These customers won’t receive the model upgrade until the feature flag is removed from the AI gateway codebase, as they won’t have access to the new GitLab release.

Model Selection Implementation

The model selection logic should be implemented in:

  • AI gateway client (ee/lib/gitlab/llm/chain/requests/ai_gateway.rb)
  • Model definitions in AI gateway
  • Any custom implementations in specific features that override the default model

Rollout Strategy

  • Enable the feature flag for a small percentage of users/groups initially
  • Monitor performance metrics and error rates using:
  • Gradually increase the rollout percentage
  • If issues arise, quickly disable the feature flag to rollback to the previous model
  • Once stability is confirmed, remove the feature flag and make the migration permanent

For more details on monitoring during migrations, see the Monitoring and Metrics section below.

Scope of Work

AI Features to Migrate

  • Duo Chat Tools:
    • ci_editor_assistant/prompts/anthropic.rb - CI Editor
    • gitlab_documentation/executor.rb - GitLab Documentation
    • epic_reader/prompts/anthropic.rb - Epic Reader
    • issue_reader/prompts/anthropic.rb - Issue Reader
    • merge_request_reader/prompts/anthropic.rb - Merge Request Reader
  • Chat Slash Commands:
    • refactor_code/prompts/anthropic.rb - Refactor
    • write_tests/prompts/anthropic.rb - Write Tests
    • explain_code/prompts/anthropic.rb - Explain Code
    • explain_vulnerability/executor.rb - Explain Vulnerability
  • Experimental Tools:
    • Summarize Comments Chat
    • Fill MR Description

Testing and Validation

Model Evaluation

The ai-model-validation team created the following library to evaluate the performance of prompt changes as well as model changes. The Prompt Library README.MD provides details on how to evaluate the performance of AI features.

Another use-case for running chat evaluation is during feature development cycle. The purpose is to verify how the changes to the code base and prompts affect the quality of chat responses before the code reaches the production environment.

For evaluation in merge request pipelines, we use:

Seed project and group resources for testing and evaluation

To seed project and group resources for testing and evaluation, run the following command:

SEED_GITLAB_DUO=1 FILTER=gitlab_duo bundle exec rake db:seed_fu

This command executes the development seed file for GitLab Duo, which creates gitlab-duo group in your GDK.

This command is responsible for seeding group and project resources for testing GitLab Duo features. It’s mainly used by the following scenarios:

  • Developers or UX designers have a local GDK but don’t know how to set up the group and project resources to test a feature in UI.
  • Evaluators (e.g. CEF) have input dataset that refers to a group or project resource e.g. (Summarize issue #123 requires a corresponding issue record in PosstgreSQL)

Currently, the input dataset of evaluators and this development seed file are managed separately. To ensure that the integration keeps working, this seeder has to create the same group/project resources every time. For example, ID and IID of the inserted PostgreSQL records must be the same every time we run this seeding process.

These fixtures are depended by the following projects:

See this architecture doc for more information.

Local Development

A valuable tool for local development to ensure the changes are correct outside of unit tests is to use LangSmith for tracing. The tool allows you to trace LLM calls within Duo Chat to verify the LLM tool is using the correct model.

To prevent regressions, we also have CI jobs to make sure our tools are working correctly. For more details, see the Duo Chat testing section.

Monitoring and Metrics

Monitor the following during migration: