Serve Large Language Models APIs Locally
There are several ways to serve large language models (LLMs) for local or self-deployment purposes.
MistralAI recommends two different serving frameworks for their models:
- vLLM: A Python-only serving framework which deploys an API matching OpenAI’s spec. vLLM provides a paged attention kernel to improve serving throughput.
- Nvidia’s TensorRT-LLM served with Nvidia’s Triton Inference Server: TensorRT-LLM provides a DSL to build fast inference engines with dedicated kernels for large language models. Triton Inference Server allows efficient serving of these inference engines.
These solutions require access to an Nvidia GPU as they rely on the CUDA graphics API for computation. However, Ollama offers a low configuration cross-platform solution to do it. This is the solution we are going to explore.
Ollama
Ollama is an open-source framework to help you get up and running with large language models locally. You can serve any supported LLMs. You can also make your own and push it to Hugging Face.
Be aware that LLMs are usually very heavy to run.
Therefore, we are just going to focus on serving one model, namely mistral:instruct
as it is relatively lightweight to run given its accuracy.
Setup Ollama
Install Ollama by following these instructions for your OS.
On MacOS, you can alternatively use Homebrew by running brew install ollama
in your terminal.
Once installed, pull the model with ollama pull mistral:instruct
in your terminal.
If the model was successfully pulled, give it a run with ollama run mistral:instruct
. Exit the process once you’ve tested the model.
Now you can use the Ollama server. Visit http://localhost:11434/
; you should see Ollama is running
. This means your server is already running. If that’s not the case, you can run ollama serve
in your terminal. Use brew services start ollama
if you installed it with Homebrew.
The Ollama serving framework has an OpenAI-compatible API. The API reference is documented here. Here is a simple example you can try:
curl "http://localhost:11434/api/chat" \
--data '{
"model": "mistral:instruct",
"messages": [
{
"role": "user",
"content": "why is the sky blue?"
}
],
"stream": false
}'
It runs on the port 11434
by default. If you are running into issues because this port is already in use by another application, you can follow these instructions.