Set up a self-hosted large language model with LiteLLM

LiteLLM is an OpenAI proxy server. You can use LiteLLM to simplify the integration with different large language models (LLMs) by leveraging the OpenAI API spec. Use LiteLLM to easily switch between different LLMs.

%%{init: { "fontFamily": "GitLab Sans" }}%% sequenceDiagram accTitle: LiteLLM architecture accDescr: Shows how GitLab sends requests to the AI Gateway when set up with the LiteLLM OpenAI proxy server. actor Client participant GitLab participant AIGateway as AI Gateway box Self Hosted setup using an OpenAI Proxy Server participant LiteLLM participant SelfHostedModel as Ollama end Client ->> GitLab: Send request GitLab ->> AIGateway: Create prompt and send request AIGateway ->> LiteLLM: Perform API request to the AI model <br> using the OpenAI format LiteLLM ->> SelfHostedModel: Translate and forward the request<br>to the model provider specific format SelfHostedModel -->> LiteLLM: Respond to the prompt LiteLLM -->> AIGateway: Forward AI response AIGateway -->> GitLab: Forward AI response GitLab -->> Client: Forward AI response

On Kubernetes

On Kubernetes environments, Ollama can be installed with a Helm chart or following the example in the official documentation.

Example setup with LiteLLM and Ollama

  1. Pull and serve the model with Ollama:

    ollama pull codegemma:2b
    ollama serve
    
  2. Create the LiteLLM proxy configuration that routes a request from the AI Gateway directed to a specific model version instead of the generic named codegemma model. In this example we are using codegemma:2b, which is being served at http://localhost:11434 by Ollama:

    # config.yaml
    model_list:
    - model_name: codegemma
      litellm_params:
          model: ollama/codegemma:2b
          api_base: http://localhost:11434
    
  3. Run the proxy:

    litellm --config config.yaml
    
  4. Send a test request:

    curl --request 'POST' \
    'http://localhost:5052/v2/code/completions' \
    -H 'accept: application/json' \
    -H 'Content-Type: application/json' \
    -d '{
    "current_file": {
       "file_name": "app.py",
       "language_identifier": "python",
       "content_above_cursor": "<|fim_prefix|>def hello_world():<|fim_suffix|><|fim_middle|>",
       "content_below_cursor": ""
    },
    "model_provider": "litellm",
    "model_endpoint": "http://127.0.0.1:4000",
    "model_name": "codegemma",
    "telemetry": [],
    "prompt_version": 2,
    "prompt": ""
    }' | jq
    
    {
       "id": "id",
       "model": {
          "engine": "litellm",
          "name": "text-completion-openai/codegemma",
          "lang": "python"
       },
       "experiments": [],
       "object": "text_completion",
       "created": 1718631985,
       "choices": [
          {
             "text": "print(\"Hello, World!\")",
             "index": 0,
             "finish_reason": "length"
          }
       ]
    }
    

Example setup for Codestral with Ollama

When serving the Codestral model through Ollama, there is an additional step required to make Codestral work with both code completions and code generations.

  1. Pull the Codestral model:

    ollama pull codestral
    
  2. Edit the default template used for Codestral:

    ollama run codestral
    
    >>> /set template {{ .Prompt }}
    Set prompt template.
    >>> /save codestral
    Created new model 'codestral'