Willem Medendorp's Blog

Willem Medendorp's Blog

Running LLM's locally

📅
🏷️ [Linux,LLM,ollama,Docker]

Why would you want to do this?

LLM's (Large Language models) are all the buzz at the moment. They are a great companion and tool for developers, researchers, and data scientists. Everyone knows about ChatGPT most likely your grandma aswell. It is a great tool, fast and easy to use, but it is not "free". All your data might be used to train the model. If you are working with sensitive data this is less then ideal. DeepSeek R1 was just released, and is a awesome alternative to ChatGPT, however it stores your data in China, which is not ideal for everyone. Luckily for us DeepSeek R1 and many other models are "open source" and can be run locally. This is a blog on how to run LLM locally, all you need is Docker and a beefy GPU with atleast 8GB of VRAM.

Ollama

Ollama is a great utility to run and swap between models. Initially it was made for llama the LLM from Meta, but it has been expanded to support many other models. ollama can been seen as the docker for LLM's. It can fetch any model from the internet and run it locally swapping it into memory when needed. Without you having to think about all the details.

You can use ollama as is:

ollama run deepseek-r1:1.5b

With will fetch the 1.5 billion parameter deepseek-r1 model and run it needing at least 1.1GB of VRAM. After the model is fetched you are greeted with a prompt:

>>> Send a message (/? for help)

Or you can run ollama in a docker container By first starting the ollama container:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

You might need --device nvidia.com/gpu=all instead of --gpus=all depending on your setup.

and then running the ollama command in the container:

docker exec -it ollama ollama run deepseek-r1:1.5b

Running ollama like this is great, but not very user friendly of practical. To give is more of a ChatGPT like experience we can use the Open Web UI.

Open Web UI

Open web UI is a feature-rich and user-friendly web interface for running LLM's. Is works similar to ChatGPT but then on your local machine. It is just a front-end, so you can use different back-ends like ollama or external API's. Easiest way to run it is with Docker:

docker run -d -p 3000:8080 --gpus all --add-host=host.docker.internal:host-gateway -v open-webui:/app/backend/data --name open-webui --restart always ghcr.io/open-webui/open-webui:cuda

You might need --device nvidia.com/gpu=all instead of --gpus=all depending on your setup.

After running the command you can go to http://localhost:3000 in your browser and you will be greeted with the Open Web UI. You can now run any model you want, and it will be run locally on your machine.

In the top left you can select your model: Open Web UI

If you don't have the model you can type to correct ollama tag in the search bar and you can directly fetch it: Open web UI

Now we can start to chat: Open web UI

Because it is just a 1.5b model the output is blazing fast. Even more wonderfull is that it is correct as well. This is a great way to play around with LLM's without having to worry about any cloud services.

Docker compose file

To make it even easier to run ollama and Open Web UI you can use the following docker-compose.yml file, this starts and links both services with just a simple docker compose up command.

# services section to define individual services
services:
  open-webui:
    # Name of the image to use
    image: ghcr.io/open-webui/open-webui:cuda
    
    # Container name (based on the image name)
    container_name: open-webui
    
    # Expose ports from the container
    ports:
      - "80:8080"
    
    # Environment variables to set in the container
    environment:
      - ADD_HOST=host.docker.internal:host-gateway
      - OLLAMA_BASE_URL=http://ollama:11434
    
    # Volumes to mount between the host and the container
    volumes:
      - open-webui:/app/backend/data
    
    # Healthcheck configuration to ensure the service is running correctly
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 40s
    
    # Flag to ensure the service restarts always
    restart: unless-stopped
    
    # GPU configuration using CDI
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids:
                - nvidia.com/gpu=all
              capabilities: [gpu]

  ollama:
    # Name of the image to use
    image: ollama/ollama
    
    # Container name (based on the image name)
    container_name: ollama
    
    # Expose ports from the container
    ports:
      - "11434:11434"
    
    # Volumes to mount between the host and the container
    volumes:
      - ollama:/root/.ollama
    
    # Healthcheck configuration to ensure the service is running correctly
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 40s
    
    # Flag to ensure the service restarts always
    restart: unless-stopped
    
    # GPU configuration using CDI
    deploy:
      resources:
        reservations:
          devices:
            - driver: cdi
              device_ids:
                - nvidia.com/gpu=all
              capabilities: [gpu]

# Volumes declaration
volumes:
  ollama:
  open-webui:

# Default network declaration for containers to communicate
networks:
  default:
    driver: bridge

Things to try in the future

Now that we can run our "own" LLM's we can try and adapt it to different use cases or needs. WebUI has many features such as pipelines, tools and standards prompt which we could setup for different tasks.

Altough more advanced we could also explore further training/finetuning of a model on specific data and then play around to see how it behaves.

Copyright 2025
Willem Medendorp

made with
and