How This guide explains how to configure llama 3.18b model for self hosting and used in translate5 as language resource on hetzner ubuntu and self-host Meta-Llama-3.1-8B-Instruct using vLLM on a Hetzner Ubuntu 22.04 server with NVIDIA RTX 4000 SFF Ada For building nvidia drivers we need kernal headers. This is handled with this.
Step 1: Check kernel headers (needed for driver build)
uname -r
dpkg -l | grep headers | grep $(uname GPU, and then use it as a Translate5 Language Resource.
1. System Preparation
Install Kernel Headers
Required for NVIDIA driver builds:
Code Block | ||
---|---|---|
| ||
uname -r dpkg -l | grep headers | grep $(uname -r) |
If
...
no headers are found:
Code Block | ||
---|---|---|
| ||
sudo apt install -y |
...
linux-headers-$(uname -r) |
Install NVIDIA Drivers
Code Block | |
---|---|
|
...
| |
# Detect recommended driver ubuntu-drivers devices |
...
# Remove old drivers if any sudo apt purge -y 'nvidia-*' |
...
sudo apt autoremove -y |
...
# Install prerequisites sudo apt update sudo apt install -y linux-headers-$(uname -r) build-essential |
...
# Install driver (580 recommended for RTX 4000 Ada) sudo apt install -y nvidia-driver-580 |
...
# Reboot is imporatant
sudo reboot
# Verify installation:
#
lsmod | grep nvidia
# You should be able to see the gpu info
nvidia-smi
# driver build status. Output something like: nvidia/580.65.06, 5.15.0-153-generic, x86_64: installed
dkms status |
✅ Example output shows the driver, CUDA version, and GPU details.
2. Docker & NVIDIA Runtime
Install Docker & Compose
Code Block | ||
---|---|---|
| ||
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
sudo apt install docker-compose-plugin -y
# Add user to Docker group:
sudo usermod -aG docker $USER newgrp docker |
Python for monitoring
Code Block |
---|
# Install Python dependencies for monitoring
sudo apt install python3-pip -y
pip3 install requests psutil nvidia-ml-py tabulate
# Create project directory in /opt
sudo mkdir -p /opt/vllm
sudo chown -R $USER:$USER /opt/vllm |
Install NVIDIA Container Toolkit
Code Block | ||
---|---|---|
| ||
# Remove old packages
sudo apt-get remove nvidia-docker nvidia-docker2 nvidia-container-runtime
# Add repo
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Install toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Configure Docker runtime
sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
#Test GPU passthrough inside Docker:
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi |
...
3. Project Setup
Create Directories
sudo mkdir -p /opt/vllm sudo chown -R $USER:$USER /opt/vllm cd /opt/vllm
Create .env
Configuration
# HuggingFace HUGGING_FACE_HUB_TOKEN=your_token_here MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct # vLLM VLLM_PORT=8000 VLLM_API_KEY=your_random_secret TENSOR_PARALLEL_SIZE=1 GPU_MEMORY_UTILIZATION=0.90 MAX_MODEL_LEN=4096 # Nginx NGINX_PORT=80 NGINX_SSL_PORT=443 # Server SERVER_IP=your.server.ip
Create docker-compose.yml
Located at /opt/vllm/docker-compose.yml
.
Contains vLLM service and Nginx reverse proxy (see full file in notes).
...
4. Nginx Configuration
Create the config file in: /opt/vllm/nginx/nginx.conf
:
Note: the current configuration does not use ssl!
Code Block | ||
---|---|---|
| ||
events { worker_connections 1024; } http { upstream vllm_backend { server vllm:8000; keepalive 32; } limit_req_zone $binary_remote_addr zone=api_limit:100m rate=100r/s; server { listen 80; server_name _; access_log /var/log/nginx/vllm_access.log; error_log /var/log/nginx/vllm_error.log warn; location /health { return 200 "healthy\n"; add_header Content-Type text/plain; } location / { limit_req zone=api_limit burst=20 nodelay; proxy_pass http://vllm_backend; proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_connect_timeout 60s; proxy_send_timeout 300s; proxy_read_timeout 300s; proxy_buffering off; proxy_cache off; add_header Access-Control-Allow-Origin * always; add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always; add_header Access-Control-Allow-Headers "Authorization, Content-Type" always; } } } |
...
5. Deployment Script
Create /opt/vllm/scripts/startup.sh
(see full script in notes).
It handles:
Env validation
GPU check (
nvidia-smi
)Docker service checks
Container startup
Health checks (
/health
)
Run it:
bash /opt/vllm/scripts/startup.sh
...
6. Verify Deployment
Logs:
docker compose logs -f vllm docker compose logs -f nginx
Health check:
curl -s -H "Authorization: Bearer $VLLM_API_KEY" http://localhost:8000/health
Sample API call:
curl
nvidia-smi
# the output will be like this
aleksandar@Ubuntu-2204-jammy-amd64-base:~$ nvidia-smi
Thu Sep 25 09:55:20 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 4000 SFF Ada ... Off | 00000000:01:00.0 Off | Off |
| 30% 48C P8 7W / 70W | 2MiB / 20475MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
...
# output like this
nvidia/580.65.06, 5.15.0-153-generic, x86_64: installed
# Install Docker
curl -fsSL https://get.docker.com -o get-docker.sh
sudo sh get-docker.sh
# Install Docker Compose
sudo apt install docker-compose-plugin -y
# Add your user to docker group
sudo usermod -aG docker $USER
newgrp docker
# Install Python dependencies for monitoring
sudo apt install python3-pip -y
pip3 install requests psutil nvidia-ml-py tabulate
# Create project directory in /opt
sudo mkdir -p /opt/vllm
sudo chown -R $USER:$USER /opt/vllm
# in the vllm dir create new .env for all configs needed
...
# vLLM Configuration
VLLM_PORT=8000
VLLM_API_KEY=random generated key which will be used for the services
TENSOR_PARALLEL_SIZE=1
GPU_MEMORY_UTILIZATION=0.90
MAX_MODEL_LEN=4096
# Nginx Configuration
NGINX_PORT=80
NGINX_SSL_PORT=443
# Server Configuration
SERVER_IP=the ip of the server
# Now create docker-compose.yml (in /opt/vllm) with this content
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-server
# Option 1: Use runtime (requires nvidia-container-toolkit)
runtime: nvidia
environment:
- HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
volumes:
- ~/.cache/huggingface:/root/.cache/huggingface
- ./models:/models
- ./logs:/logs
ports:
- "127.0.0.1:${VLLM_PORT}:8000"
command: >
--model ${MODEL_NAME}
--tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
--gpu-memory-utilization ${GPU_MEMORY_UTILIZATION}
--max-model-len ${MAX_MODEL_LEN}
--api-key ${VLLM_API_KEY}
--port 8000
--host 0.0.0.0
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 300s
restart: unless-stopped
networks:
- vllm-network
logging:
driver: "json-file"
options:
max-size: "100m"
max-file: "10"
nginx:
image: nginx:alpine
container_name: nginx-proxy
ports:
- "${NGINX_PORT}:80"
- "${NGINX_SSL_PORT}:443"
volumes:
- ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
- ./nginx/logs:/var/log/nginx
- ./nginx/ssl:/etc/nginx/ssl:ro
depends_on:
- vllm
healthcheck:
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost/health"]
interval: 30s
timeout: 10s
retries: 3
restart: unless-stopped
networks:
- vllm-network
logging:
driver: "json-file"
options:
max-size: "50m"
max-file: "5"
networks:
vllm-network:
driver: bridge
...
# 2. Set up the NVIDIA Container Toolkit repository
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# 3. Update package list
sudo apt-get update
# 4. Install NVIDIA Container Toolkit
sudo apt-get install -y nvidia-container-toolkit
# 5. Configure Docker to use the NVIDIA runtime
sudo nvidia-ctk runtime configure --runtime=docker
# 6. Restart Docker
sudo systemctl restart docker
# 7. Test if it works
docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Output should look like this:
aleksandar@Ubuntu-2204-jammy-amd64-base:/opt/vllm/scripts$ docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Unable to find image 'nvidia/cuda:11.8.0-base-ubuntu22.04' locally
11.8.0-base-ubuntu22.04: Pulling from nvidia/cuda
aece8493d397: Pull complete
5e3b7ee77381: Pull complete
5bd037f007fd: Pull complete
4cda774ad2ec: Pull complete
775f22adee62: Pull complete
Digest: sha256:f895871972c1c91eb6a896eee68468f40289395a1e58c492e1be7929d0f8703b
Status: Downloaded newer image for nvidia/cuda:11.8.0-base-ubuntu22.04
Thu Sep 25 10:20:14 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.65.06 Driver Version: 580.65.06 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA RTX 4000 SFF Ada ... Off | 00000000:01:00.0 Off | Off |
| 30% 31C P8 5W / 70W | 2MiB / 20475MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
# Now create the ngix conf
touch /opt/vllm/nginx/nginx.conf
events {
worker_connections 1024;
}
http {
upstream vllm_backend {
server vllm:8000;
keepalive 32;
}
# Rate limiting
limit_req_zone $binary_remote_addr zone=api_limit:100m rate=100r/s;
# Logging format
log_format vllm_log '$remote_addr - $remote_user [$time_local] "$request" '
'$status $body_bytes_sent "$http_referer" '
'"$http_user_agent" rt=$request_time '
'uct="$upstream_connect_time" uht="$upstream_header_time" '
'urt="$upstream_response_time"';
server {
listen 80;
server_name _;
access_log /var/log/nginx/vllm_access.log vllm_log;
error_log /var/log/nginx/vllm_error.log warn;
# Health check endpoint
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
# vLLM API endpoints
location / {
limit_req zone=api_limit burst=20 nodelay;
proxy_pass http://vllm_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeouts for LLM streaming
proxy_connect_timeout 60s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
# Buffer settings for streaming
proxy_buffering off;
proxy_cache off;
# CORS headers (adjust as needed)
add_header Access-Control-Allow-Origin * always;
add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always;
add_header Access-Control-Allow-Headers "Authorization, Content-Type" always;
}
}
}
And at the end the startup script. You need to create the script under /opt/vllm/scripts and run it there.
...
set -e
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"
cd "$PROJECT_DIR"
echo "=========================================="
echo "vLLM Deployment Startup"
echo "=========================================="
# Check if .env file exists
if [ ! -f .env ]; then
echo "❌ Error: .env file not found!"
echo "Please create .env file with your configuration"
exit 1
fi
# Load environment variables
source .env
# Validate required environment variables
if [ -z "$HUGGING_FACE_HUB_TOKEN" ]; then
echo "❌ Error: HUGGING_FACE_HUB_TOKEN not set in .env!"
exit 1
fi
if [ -z "$VLLM_API_KEY" ]; then
echo "⚠️ Warning: VLLM_API_KEY not set. Using default (not secure!)"
fi
# Check GPU availability
echo "🔍 Checking GPU availability..."
if ! nvidia-smi &> /dev/null; then
echo "❌ Error: NVIDIA GPU not detected!"
echo "Please ensure NVIDIA drivers are installed"
exit 1
fi
echo "✅ GPU detected:"
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader
# Check Docker
echo "🔍 Checking Docker..."
if ! docker --version &> /dev/null; then
echo "❌ Error: Docker not installed!"
exit 1
fi
# Check Docker Compose
if ! docker compose version &> /dev/null; then
echo "❌ Error: Docker Compose not installed!"
exit 1
fi
# Create necessary directories
echo "📁 Creating directories..."
mkdir -p nginx/logs
mkdir -p logs
mkdir -p models
# Stop existing containers if running
echo "🔄 Stopping existing containers (if any)..."
docker compose down 2>/dev/null || true
# Pull latest images
echo "📥 Pulling Docker images..."
docker compose pull
# Start services
echo "🚀 Starting services..."
docker compose up -d
# Wait for services to be ready
echo "⏳ Waiting for services to start (this may take a few minutes for model loading)..."
sleep 10
# Check startup logs
echo "📋 Checking startup logs..."
echo "--- vLLM logs ---"
docker compose logs --tail=20 vllm
echo ""
echo "--- Nginx logs ---"
docker compose logs --tail=10 nginx
# Wait more for model loading
echo "⏳ Waiting for model to load completely..."
MAX_WAIT=300 # 5 minutes maximum
WAITED=0
while [ $WAITED -lt $MAX_WAIT ]; do
if curl -s -f -H "Authorization: Bearer ${VLLM_API_KEY}" http://localhost:8000/health > /dev/null 2>&1; then
echo "✅ vLLM is ready!"
break
fi
echo -n "."
sleep 5
WAITED=$((WAITED + 5))
done
if [ $WAITED -ge $MAX_WAIT ]; then
echo ""
echo "⚠️ Warning: vLLM health check timeout. Check logs with: docker compose logs vllm"
fi
# Run health check
echo ""
echo "🏥 Running health check..."
python3 scripts/health_check.py
...
-X POST http://${SERVER_IP}/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${VLLM_API_KEY}" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }'
...
7. Integration with Translate5
Go to Administration → Language Resources in Translate5.
Add a new Llama resource.
Configure:
Base URL:
http://your.server.ip or your domain if you have one
API Key: use the value from
.env
(VLLM_API_KEY
)Model:
meta-llama/Meta-Llama-3.1-8B-Instruct
Test the connection.
Enable for desired language combinations.
\"
echo " -H \"Content-Type: application/json\" \\"
echo " -H \"Authorization: Bearer ${VLLM_API_KEY}\" \\"
echo " -d '{\"model\": \"${MODEL_NAME}\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}'"
echo ""