Serve Large Language Models APIs Locally
There are several ways to serve large language models (LLMs) for local or self-deployment purposes.
MistralAI recommends two different serving frameworks for their models:
- vLLM: A Python-only serving framework which deploys an API matching OpenAI’s spec. vLLM provides a paged attention kernel to improve serving throughput.
- Nvidia’s TensorRT-LLM served with Nvidia’s Triton Inference Server: TensorRT-LLM provides a DSL to build fast inference engines with dedicated kernels for large language models. Triton Inference Server allows efficient serving of these inference engines.
These solutions require access to an Nvidia GPU as they rely on the CUDA graphics API for computation. However, Ollama offers a low configuration cross-platform solution to do it. This is the solution we are going to explore.
Ollama
Ollama is an open-source framework to help you get up and running with large language models locally. You can serve any supported LLMs. You can also make your own and push it to Hugging Face.
Be aware that LLMs are usually very heavy to run.
Therefore, we are just going to focus on serving one model, namely mistral:instruct
as it is relatively lightweight to run given its accuracy.
Setup Ollama
Install Ollama by following these instructions for your OS.
On MacOS, you can alternatively use Homebrew by running brew install ollama
in your terminal.
Once installed, pull the model with ollama pull mistral:instruct
in your terminal.
If the model was successfully pulled, give it a run with ollama run mistral:instruct
. Exit the process once you’ve tested the model.
Now you can use the Ollama server. Visit http://localhost:11434/
; you should see Ollama is running
. This means your server is already running. If that’s not the case, you can run ollama serve
in your terminal. Use brew services start ollama
if you installed it with Homebrew.
The Ollama serving framework has an OpenAI-compatible API. The API reference is documented here. Here is a simple example you can try:
curl "http://localhost:11434/api/chat" \
--data '{
"model": "mistral:instruct",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"stream": false
}'
It runs on the port 11434
by default. If you are running into issues because this port is already in use by another application, you can follow these instructions.
Docs
Edit this page to fix an error or add an improvement in a merge request.
Create an issue to suggest an improvement to this page.
Product
Create an issue if there's something you don't like about this feature.
Propose functionality by submitting a feature request.
Feature availability and product trials
View pricing to see all GitLab tiers and features, or to upgrade.
Try GitLab for free with access to all features for 30 days.
Get help
If you didn't find what you were looking for, search the docs.
If you want help with something specific and could use community support, post on the GitLab forum.
For problems setting up or using this feature (depending on your GitLab subscription).
Request support