GitLab Duo Self-Hosted supported platforms
- Tier: Ultimate
- Add-on: GitLab Duo Enterprise
- Offering: GitLab Self-Managed
There are multiple platforms available to host your self-hosted Large Language Models (LLMs). Each platform has unique features and benefits that can cater to different needs. The following documentation summarises the currently supported options:
For self-hosted model deployments
vLLM
vLLM is a high-performance inference server optimized for serving LLMs with memory efficiency. It supports model parallelism and integrates easily with existing workflows.
To install vLLM, see the vLLM Installation Guide. You should install version v0.6.4.post1 or later.
Endpoint Configuration
When configuring the endpoint URL for any OpenAI API compatible platforms (such as vLLM) in GitLab:
- The URL must be suffixed with
/v1
- If using the default vLLM configuration, the endpoint URL would be
https://<hostname>:8000/v1
- If your server is configured behind a proxy or load balancer, you might not need to specify the port, in which case the URL would be
https://<hostname>/v1
Finding the model name
After the model has been deployed, you can obtain the model name for the model identifier field in GitLab by querying the vLLM server’s /v1/models
endpoint:
curl \
--header "Authorization: Bearer API_KEY" \
--header "Content-Type: application/json" \
http://your-vllm-server:8000/v1/models
The model name is the value of the data.id
field in the response.
Example response:
{
"object": "list",
"data": [
{
"id": "Mixtral-8x22B-Instruct-v0.1",
"object": "model",
"created": 1739421415,
"owned_by": "vllm",
"root": "mistralai/Mixtral-8x22B-Instruct-v0.1",
// Additional fields removed for readability
}
]
}
In this example, if the model’s id
is Mixtral-8x22B-Instruct-v0.1
, you would set the model identifier in GitLab as custom_openai/Mixtral-8x22B-Instruct-v0.1
.
For more information on:
- vLLM supported models, see the vLLM Supported Models documentation.
- Available options when using vLLM to run a model, see the vLLM documentation on engine arguments.
- The hardware needed for the models, see the Supported models and Hardware requirements documentation.
Examples:
Mistral-7B-Instruct-v0.2
Download the model from HuggingFace:
git clone https://<your-hugging-face-username>:<your-hugging-face-token>@huggingface.co/mistralai/Mistral-7B-Instruct-v0.3
Run the server:
vllm serve <path-to-model>/Mistral-7B-Instruct-v0.3 \ --served_model_name <choose-a-name-for-the-model> \ --tokenizer_mode mistral \ --tensor_parallel_size <number-of-gpus> \ --load_format mistral \ --config_format mistral \ --tokenizer <path-to-model>/Mistral-7B-Instruct-v0.3
Mixtral-8x7B-Instruct-v0.1
Download the model from HuggingFace:
git clone https://<your-hugging-face-username>:<your-hugging-face-token>@huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1
Rename the token config:
cd <path-to-model>/Mixtral-8x7B-Instruct-v0.1 cp tokenizer.model tokenizer.model.v3
Run the model:
vllm serve <path-to-model>/Mixtral-8x7B-Instruct-v0.1 \ --tensor_parallel_size 4 \ --served_model_name <choose-a-name-for-the-model> \ --tokenizer_mode mistral \ --load_format safetensors \ --tokenizer <path-to-model>/Mixtral-8x7B-Instruct-v0.1
Disable request logging to reduce latency
When running vLLM in production, you can significantly reduce latency by using the --disable-log-requests
flag to disable request logging.
Use this flag only when you do not need detailed request logging.
Disabling request logging minimizes the overhead introduced by verbose logs, especially under high load, and can help improve performance levels.
vllm serve <path-to-model>/<model-version> \
--served_model_name <choose-a-name-for-the-model> \
--disable-log-requests
This change has been observed to notably improve response times in internal benchmarks.
For cloud-hosted model deployments
AWS Bedrock. A fully managed service that allows developers to build and scale generative AI applications using pre-trained models from leading AI companies. It seamlessly integrates with other AWS services and offers a pay-as-you-go pricing model.
You must configure the GitLab instance with your appropriate AWS IAM permissions before accessing Bedrock models. You cannot do this in the GitLab Duo Self-Hosted UI. For example, you can authenticate the AI Gateway instance by defining the
AWS_ACCESS_KEY_ID
,AWS_SECRET_ACCESS_KEY
andAWS_REGION_NAME
when starting the Docker image. For more information, see the AWS Identity and Access Management (IAM) Guide.Azure OpenAI. Provides access to OpenAI’s powerful models, enabling developers to integrate advanced AI capabilities into their applications with robust security and scalable infrastructure.
Use multiple models and platforms
With GitLab Duo Self-Hosted, you can use multiple models and platforms in the same GitLab instance.
For example, you can configure one feature to use Azure OpenAI, and another feature to use AWS Bedrock or self-hosted models served with vLLM.
This setup gives you flexibility to choose the best model and platform for each use case. Models must be supported and served through a compatible platform.
For more information on setting up different providers, see:
Docs
Edit this page to fix an error or add an improvement in a merge request.
Create an issue to suggest an improvement to this page.
Product
Create an issue if there's something you don't like about this feature.
Propose functionality by submitting a feature request.
Feature availability and product trials
View pricing to see all GitLab tiers and features, or to upgrade.
Try GitLab for free with access to all features for 30 days.
Get help
If you didn't find what you were looking for, search the docs.
If you want help with something specific and could use community support, post on the GitLab forum.
For problems setting up or using this feature (depending on your GitLab subscription).
Request support