Page History

How This guide explains how to configure llama 3.18b model for self hosting and used in translate5 as language resource on hetzner ubuntu and self-host Meta-Llama-3.1-8B-Instruct using vLLM on a Hetzner Ubuntu 22.04 server with NVIDIA RTX 4000 SFF Ada For building nvidia drivers we need kernal headers. This is handled with this.
Step 1: Check kernel headers (needed for driver build)
uname -r
dpkg -l | grep headers | grep $(uname GPU, and then use it as a Translate5 Language Resource.

1. System Preparation

Install Kernel Headers

Required for NVIDIA driver builds:

Code Block

language	bash

uname -r

dpkg -l | grep headers | grep $(uname -r)

If

...

no headers are found:

Code Block

language	bash

sudo apt install -y

...

linux-headers-$(uname -r)

Install NVIDIA Drivers

Code Block

language

...

bash

# Detect recommended driver
ubuntu-drivers devices

...



# Remove old drivers if any

sudo apt purge -y 'nvidia-*'

...



sudo apt autoremove -y

...



# Install prerequisites

sudo apt update

sudo apt install -y linux-headers-$(uname -r) build-essential

...



# Install driver (580 recommended for RTX 4000 Ada)

sudo apt install -y nvidia-driver-580

...

 

# Reboot is imporatant
sudo reboot

# Verify installation:

# 
lsmod | grep nvidia 

# You should be able to see the gpu info
nvidia-smi 

# driver build status. Output something like: nvidia/580.65.06, 5.15.0-153-generic, x86_64: installed
dkms status

✅ Example output shows the driver, CUDA version, and GPU details.

2. Docker & NVIDIA Runtime

Install Docker & Compose

Code Block

language	bash

curl -fsSL https://get.docker.com -o get-docker.sh 
sudo sh get-docker.sh 
sudo apt install docker-compose-plugin -y

# Add user to Docker group:
sudo usermod -aG docker $USER newgrp docker

Python for monitoring

Code Block

# Install Python dependencies for monitoring
sudo apt install python3-pip -y
pip3 install requests psutil nvidia-ml-py tabulate

# Create project directory in /opt
sudo mkdir -p /opt/vllm
sudo chown -R $USER:$USER /opt/vllm

Install NVIDIA Container Toolkit

Code Block

language	bash

# Remove old packages

sudo apt-get remove nvidia-docker nvidia-docker2 nvidia-container-runtime

# Add repo

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install toolkit

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

#Test GPU passthrough inside Docker:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

...

3. Project Setup

Create Directories

sudo mkdir -p /opt/vllm sudo chown -R $USER:$USER /opt/vllm cd /opt/vllm

Create `.env` Configuration

# HuggingFace HUGGING_FACE_HUB_TOKEN=your_token_here MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct # vLLM VLLM_PORT=8000 VLLM_API_KEY=your_random_secret TENSOR_PARALLEL_SIZE=1 GPU_MEMORY_UTILIZATION=0.90 MAX_MODEL_LEN=4096 # Nginx NGINX_PORT=80 NGINX_SSL_PORT=443 # Server SERVER_IP=your.server.ip

Create `docker-compose.yml`

Located at /opt/vllm/docker-compose.yml.

Contains vLLM service and Nginx reverse proxy (see full file in notes).

...

4. Nginx Configuration

Create the config file in: /opt/vllm/nginx/nginx.conf:

Note: the current configuration does not use ssl!

Code Block

language	bash

events { worker_connections 1024; } http { upstream vllm_backend { server vllm:8000; keepalive 32; } limit_req_zone $binary_remote_addr zone=api_limit:100m rate=100r/s; server { listen 80; server_name _; access_log /var/log/nginx/vllm_access.log; error_log /var/log/nginx/vllm_error.log warn; location /health { return 200 "healthy\n"; add_header Content-Type text/plain; } location / { limit_req zone=api_limit burst=20 nodelay; proxy_pass http://vllm_backend; proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_connect_timeout 60s; proxy_send_timeout 300s; proxy_read_timeout 300s; proxy_buffering off; proxy_cache off; add_header Access-Control-Allow-Origin * always; add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always; add_header Access-Control-Allow-Headers "Authorization, Content-Type" always; } } }

...

`5. Deployment Script`

Create /opt/vllm/scripts/startup.sh (see full script in notes).

It handles:

Env validation
GPU check (nvidia-smi)
Docker service checks
Container startup
Health checks (/health)

Run it:

bash /opt/vllm/scripts/startup.sh

...

`6. Verify Deployment`

Logs:
docker compose logs -f vllm docker compose logs -f nginx
Health check:
curl -s -H "Authorization: Bearer $VLLM_API_KEY" http://localhost:8000/health
Sample API call:
curl

nvidia-smi

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

...

# output like this nvidia/580.65.06, 5.15.0-153-generic, x86_64: installed

# Install Docker curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh

# Install Docker Compose sudo apt install docker-compose-plugin -y

# Add your user to docker group sudo usermod -aG docker $USER newgrp docker

# Install Python dependencies for monitoring sudo apt install python3-pip -y pip3 install requests psutil nvidia-ml-py tabulate

# Create project directory in /opt sudo mkdir -p /opt/vllm sudo chown -R $USER:$USER /opt/vllm # in the vllm dir create new .env for all configs needed

...

# vLLM Configuration VLLM_PORT=8000 VLLM_API_KEY=random generated key which will be used for the services TENSOR_PARALLEL_SIZE=1 GPU_MEMORY_UTILIZATION=0.90 MAX_MODEL_LEN=4096

# Nginx Configuration NGINX_PORT=80 NGINX_SSL_PORT=443

# Server Configuration SERVER_IP=the ip of the server

# Now create docker-compose.yml (in /opt/vllm) with this content

services: vllm: image: vllm/vllm-openai:latest container_name: vllm-server # Option 1: Use runtime (requires nvidia-container-toolkit) runtime: nvidia environment: - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN} volumes: - ~/.cache/huggingface:/root/.cache/huggingface - ./models:/models - ./logs:/logs ports: - "127.0.0.1:${VLLM_PORT}:8000" command: > --model ${MODEL_NAME} --tensor-parallel-size ${TENSOR_PARALLEL_SIZE} --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION} --max-model-len ${MAX_MODEL_LEN} --api-key ${VLLM_API_KEY} --port 8000 --host 0.0.0.0 healthcheck: test: ["CMD", "curl", "-f", "http://localhost:8000/health"] interval: 30s timeout: 10s retries: 5 start_period: 300s restart: unless-stopped networks: - vllm-network logging: driver: "json-file" options: max-size: "100m" max-file: "10"

nginx: image: nginx:alpine container_name: nginx-proxy ports: - "${NGINX_PORT}:80" - "${NGINX_SSL_PORT}:443" volumes: - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro - ./nginx/logs:/var/log/nginx - ./nginx/ssl:/etc/nginx/ssl:ro depends_on: - vllm healthcheck: test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost/health"] interval: 30s timeout: 10s retries: 3 restart: unless-stopped networks: - vllm-network logging: driver: "json-file" options: max-size: "50m" max-file: "5"

networks: vllm-network: driver: bridge

...

# 2. Set up the NVIDIA Container Toolkit repository distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# 3. Update package list sudo apt-get update

# 4. Install NVIDIA Container Toolkit sudo apt-get install -y nvidia-container-toolkit

# 5. Configure Docker to use the NVIDIA runtime sudo nvidia-ctk runtime configure --runtime=docker

# 6. Restart Docker sudo systemctl restart docker

# 7. Test if it works docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

Output should look like this:

aleksandar@Ubuntu-2204-jammy-amd64-base:/opt/vllm/scripts$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally 11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda aece8493d397: Pull complete 5e3b7ee77381: Pull complete 5bd037f007fd: Pull complete 4cda774ad2ec: Pull complete 775f22adee62: Pull complete Digest: sha256:f895871972c1c91eb6a896eee68468f40289395a1e58c492e1be7929d0f8703b Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04 Thu Sep 25 10:20:14 2025 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 | +-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA RTX 4000 SFF Ada ... Off | 00000000:01:00.0 Off | Off | | 30% 31C P8 5W / 70W | 2MiB / 20475MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+

# Now create the ngix conf

touch /opt/vllm/nginx/nginx.conf

events { worker_connections 1024; }

http { upstream vllm_backend { server vllm:8000; keepalive 32; }

# Rate limiting limit_req_zone $binary_remote_addr zone=api_limit:100m rate=100r/s; # Logging format log_format vllm_log '$remote_addr - $remote_user [$time_local] "$request" ' '$status $body_bytes_sent "$http_referer" ' '"$http_user_agent" rt=$request_time ' 'uct="$upstream_connect_time" uht="$upstream_header_time" ' 'urt="$upstream_response_time"';

server { listen 80; server_name _; access_log /var/log/nginx/vllm_access.log vllm_log; error_log /var/log/nginx/vllm_error.log warn;

# Health check endpoint location /health { access_log off; return 200 "healthy\n"; add_header Content-Type text/plain; }

# vLLM API endpoints location / { limit_req zone=api_limit burst=20 nodelay; proxy_pass http://vllm_backend; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection "upgrade"; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; # Timeouts for LLM streaming proxy_connect_timeout 60s; proxy_send_timeout 300s; proxy_read_timeout 300s; # Buffer settings for streaming proxy_buffering off; proxy_cache off; # CORS headers (adjust as needed) add_header Access-Control-Allow-Origin * always; add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always; add_header Access-Control-Allow-Headers "Authorization, Content-Type" always; } } }

And at the end the startup script. You need to create the script under /opt/vllm/scripts and run it there.

...

set -e

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" PROJECT_DIR="$(dirname "$SCRIPT_DIR")"

cd "$PROJECT_DIR"

echo "==========================================" echo "vLLM Deployment Startup" echo "=========================================="

# Check if .env file exists if [ ! -f .env ]; then echo "❌ Error: .env file not found!" echo "Please create .env file with your configuration" exit 1 fi

# Load environment variables source .env

# Validate required environment variables if [ -z "$HUGGING_FACE_HUB_TOKEN" ]; then echo "❌ Error: HUGGING_FACE_HUB_TOKEN not set in .env!" exit 1 fi

if [ -z "$VLLM_API_KEY" ]; then echo "⚠️ Warning: VLLM_API_KEY not set. Using default (not secure!)" fi

# Check GPU availability echo "🔍 Checking GPU availability..." if ! nvidia-smi &> /dev/null; then echo "❌ Error: NVIDIA GPU not detected!" echo "Please ensure NVIDIA drivers are installed" exit 1 fi

echo "✅ GPU detected:" nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader

# Check Docker echo "🔍 Checking Docker..." if ! docker --version &> /dev/null; then echo "❌ Error: Docker not installed!" exit 1 fi

# Check Docker Compose if ! docker compose version &> /dev/null; then echo "❌ Error: Docker Compose not installed!" exit 1 fi

# Create necessary directories echo "📁 Creating directories..." mkdir -p nginx/logs mkdir -p logs mkdir -p models

# Stop existing containers if running echo "🔄 Stopping existing containers (if any)..." docker compose down 2>/dev/null || true

# Pull latest images echo "📥 Pulling Docker images..." docker compose pull

# Start services echo "🚀 Starting services..." docker compose up -d

# Wait for services to be ready echo "⏳ Waiting for services to start (this may take a few minutes for model loading)..." sleep 10

# Check startup logs echo "📋 Checking startup logs..." echo "--- vLLM logs ---" docker compose logs --tail=20 vllm

echo "" echo "--- Nginx logs ---" docker compose logs --tail=10 nginx

# Wait more for model loading echo "⏳ Waiting for model to load completely..." MAX_WAIT=300 # 5 minutes maximum WAITED=0

while [ $WAITED -lt $MAX_WAIT ]; do if curl -s -f -H "Authorization: Bearer ${VLLM_API_KEY}" http://localhost:8000/health > /dev/null 2>&1; then echo "✅ vLLM is ready!" break fi echo -n "." sleep 5 WAITED=$((WAITED + 5)) done

if [ $WAITED -ge $MAX_WAIT ]; then echo "" echo "⚠️ Warning: vLLM health check timeout. Check logs with: docker compose logs vllm" fi

# Run health check echo "" echo "🏥 Running health check..." python3 scripts/health_check.py

...

-X POST http://${SERVER_IP}/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${VLLM_API_KEY}" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }'

...

`7. Integration with Translate5`

Go to Administration → Language Resources in Translate5.
Add a new Llama resource.
Configure:
- Base URL: http://your.server.ip or your domain if you have one
- API Key: use the value from .env (VLLM_API_KEY)
- Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Test the connection.
Enable for desired language combinations.

\" echo " -H \"Content-Type: application/json\" \\" echo " -H \"Authorization: Bearer ${VLLM_API_KEY}\" \\" echo " -d '{\"model\": \"${MODEL_NAME}\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}'" echo ""

Page tree

Versions Compared

Old Version 1

New Version 2

Key

1. System Preparation

Install Kernel Headers

Install NVIDIA Drivers

2. Docker & NVIDIA Runtime

Install Docker & Compose

Python for monitoring

Install NVIDIA Container Toolkit

3. Project Setup

Create Directories

Create `.env` Configuration

Create `docker-compose.yml`

4. Nginx Configuration

`5. Deployment Script`

`6. Verify Deployment`

`7. Integration with Translate5`

Page tree

Page History

Versions Compared

Old Version 1

New Version 2

Key

1. System Preparation

Install Kernel Headers

Install NVIDIA Drivers

2. Docker & NVIDIA Runtime

Install Docker & Compose

Python for monitoring

Install NVIDIA Container Toolkit

3. Project Setup

Create Directories

Create .env Configuration

Create docker-compose.yml

4. Nginx Configuration

5. Deployment Script

6. Verify Deployment

7. Integration with Translate5

Create `.env` Configuration

Create `docker-compose.yml`

`5. Deployment Script`

`6. Verify Deployment`

`7. Integration with Translate5`