Version: Latest

Configure models with Ollama

Ollama is a simple way to run local LLMs without a full vLLM or GPU stack. Ollama exposes an OpenAI-compatible API on port 11434, which Semantic Router can use as a model backend during first-run setup or in hand-authored YAML.

This guide walks through:

Installing Ollama and pulling a model on your host
Verifying the Ollama API
Registering the model in the Semantic Router setup dashboard
Activating the config and sending a test request

tip

Semantic Router runs in Docker during vllm-sr serve. Point the router at host.docker.internal:11434, not localhost:11434, so the container can reach Ollama on your host.

Prerequisites

Semantic Router installed and runnable with vllm-sr serve (Linux, macOS, or WSL2 with Docker)
Ollama installed on the same machine that runs Docker
Enough disk space for at least one model (for example, llama3.2:3b is about 2 GB)

1. Install Ollama

Install Ollama from ollama.com/download for your platform, then confirm the CLI is available:

ollama --version

On Linux you can also use the install script:

curl -fsSL https://ollama.com/install.sh | sh

Ollama starts a background service automatically. It listens on http://127.0.0.1:11434 by default.

2. Pull a model

Pull a model tag from the Ollama library. This example uses llama3.2:3b, a small general-purpose model that works well for local testing:

ollama pull llama3.2:3b

List locally available models:

ollama list

Pull an Ollama model and confirm it is available locally

note

Use the exact Ollama tag (for example llama3.2:3b, qwen2.5-coder:7b) as the model name in Semantic Router. The router forwards that name to Ollama unchanged.

3. Verify Ollama is serving

Before opening the Semantic Router dashboard, confirm Ollama responds on the host:

curl http://localhost:11434/v1/models

Send a quick chat completion:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Say hello in one sentence."}]
  }'

Verify the Ollama OpenAI-compatible API with curl

If either command fails, fix Ollama on the host before continuing. Semantic Router cannot reach a backend that is not already serving on port 11434.

4. Configure the model in the setup dashboard

Start Semantic Router (or use the instance already started by the installer):

vllm-sr serve

If config.yaml does not exist yet in the current directory, the dashboard opens in setup mode at http://localhost:8700.

On Step 1 — Connect model, register your Ollama model:

Field	Value
Model name	Your Ollama tag, for example `llama3.2:3b`
Provider	Local vLLM
Base URL or host	`host.docker.internal:11434`
Endpoint label	`primary` (or any short label)
Default	Select this model if it is your only backend

Configure an Ollama backend in the setup dashboard

Why Local vLLM and not OpenAI-compatible API?

Ollama serves an OpenAI-compatible surface at /v1/chat/completions.
The Local vLLM provider type matches how other local backends are configured in vllm-sr serve and resolves to host.docker.internal:11434 inside the generated config.yaml.

Alternative: you can choose OpenAI-compatible API and set the base URL to http://host.docker.internal:11434/v1. Both paths work; pick one and stay consistent when adding more models.

Click Continue when the model card validates.

5. Choose routing and activate

On Step 2 — Choose routing, keep the Single-model baseline if you only registered one Ollama model. You can import a preset or remote config later when you add more backends.

On Step 3 — Review & activate, confirm the model summary, then click Activate configuration.

Review the generated config and activate setup

Activation writes config.yaml to the current directory and exits setup mode. Envoy starts on port 8888 and routes requests through Semantic Router to your Ollama backend.

6. Test through Semantic Router

Send a request through the router proxy:

curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.2:3b",
    "messages": [{"role": "user", "content": "Hello from Semantic Router!"}]
  }'

If you kept the default single-model baseline, you can also use the auto-routing alias:

curl http://localhost:8888/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "MoM",
    "messages": [{"role": "user", "content": "Hello from Semantic Router!"}]
  }'

A JSON chat completion response means Ollama is wired correctly.

YAML configuration (advanced)

If you prefer to edit YAML directly instead of the dashboard, add a model entry like this:

version: v0.3

providers:
  defaults:
    default_model: llama3.2:3b
  models:
    - name: llama3.2:3b
      provider_model_id: llama3.2:3b
      api_format: openai
      backend_refs:
        - name: local-ollama
          endpoint: host.docker.internal:11434
          protocol: http
          weight: 100

routing:
  modelCards:
    - name: llama3.2:3b
  decisions:
    - name: default-route
      description: Route all requests to the local Ollama model.
      priority: 100
      rules:
        operator: AND
        conditions: []
      modelRefs:
        - model: llama3.2:3b
          use_reasoning: false

Validate and serve:

vllm-sr validate config.yaml
vllm-sr serve --config config.yaml

Troubleshooting

Router cannot reach Ollama

Use host.docker.internal:11434 in config, not localhost:11434. Inside the router container, localhost refers to the container itself.
On Linux, vllm-sr serve adds --add-host=host.docker.internal:host-gateway automatically. If connectivity still fails, see Container connectivity.
Confirm Ollama responds on the host: curl http://localhost:11434/v1/models.

Model not found or 404 from Ollama

The Model name in Semantic Router must match the Ollama tag exactly (llama3.2:3b, not llama3.2).
Run ollama list and pull the tag if it is missing: ollama pull <tag>.

Slow first request

Ollama loads models on demand. The first request after idle time may take longer while weights are loaded into memory.

Reasoning models (Qwen3 and similar)

Some reasoning models spend the full token budget on internal thinking when called through Ollama's OpenAI-compatible endpoint. For advanced local setups with Qwen3-style models, see bench/grounded_fusion/ollama_proxy.py in the repository.

Next steps

Add more backends and turn on semantic routing presets in the dashboard
Read the Configuration guide for decisions, signals, and model cards
See the agentgateway homelab blog post for a multi-model setup that includes local Ollama

Configure models with Ollama

Prerequisites​

1. Install Ollama​

2. Pull a model​

3. Verify Ollama is serving​

4. Configure the model in the setup dashboard​

5. Choose routing and activate​

6. Test through Semantic Router​

YAML configuration (advanced)​

Troubleshooting​

Router cannot reach Ollama​

Model not found or 404 from Ollama​

Slow first request​

Reasoning models (Qwen3 and similar)​

Next steps​