This guide explains how to configure and self-host Meta-Llama-3.1-8B-Instruct using vLLM on a Hetzner Ubuntu 22.04 server with NVIDIA RTX 4000 SFF Ada GPU, and then use it as a Translate5 Language Resource.
1. System Preparation
Install Kernel Headers
Required for NVIDIA driver builds:
uname -r dpkg -l | grep headers | grep $(uname -r)
If no headers are found:
sudo apt install -y linux-headers-$(uname -r)
Install NVIDIA Drivers
# Detect recommended driver ubuntu-drivers devices # Remove old drivers if any sudo apt purge -y 'nvidia-*' sudo apt autoremove -y # Install prerequisites sudo apt update sudo apt install -y linux-headers-$(uname -r) build-essential # Install driver (580 recommended for RTX 4000 Ada) sudo apt install -y nvidia-driver-580 # Reboot is imporatant sudo reboot # Verify installation: # lsmod | grep nvidia # You should be able to see the gpu info nvidia-smi # driver build status. Output something like: nvidia/580.65.06, 5.15.0-153-generic, x86_64: installed dkms status
✅ Example output shows the driver, CUDA version, and GPU details.
2. Docker & NVIDIA Runtime
Install Docker & Compose
curl -fsSL https://get.docker.com -o get-docker.sh sudo sh get-docker.sh sudo apt install docker-compose-plugin -y # Add user to Docker group: sudo usermod -aG docker $USER newgrp docker
Python for monitoring
# Install Python dependencies for monitoring sudo apt install python3-pip -y pip3 install requests psutil nvidia-ml-py tabulate # Create project directory in /opt sudo mkdir -p /opt/vllm sudo chown -R $USER:$USER /opt/vllm
Install NVIDIA Container Toolkit
# Remove old packages sudo apt-get remove nvidia-docker nvidia-docker2 nvidia-container-runtime # Add repo distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list # Install toolkit sudo apt-get update sudo apt-get install -y nvidia-container-toolkit # Configure Docker runtime sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker #Test GPU passthrough inside Docker: docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
3. Project Setup
Create Directories
sudo mkdir -p /opt/vllm
sudo chown -R $USER:$USER /opt/vllm
cd /opt/vllm
Create .env
Configuration
# HuggingFace
HUGGING_FACE_HUB_TOKEN=your_token_here
MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct
# vLLM
VLLM_PORT=8000
VLLM_API_KEY=your_random_secret
TENSOR_PARALLEL_SIZE=1
GPU_MEMORY_UTILIZATION=0.90
MAX_MODEL_LEN=4096
# Nginx
NGINX_PORT=80
NGINX_SSL_PORT=443
# Server
SERVER_IP=your.server.ip
Create docker-compose.yml
Located at /opt/vllm/docker-compose.yml
.
Contains vLLM service and Nginx reverse proxy (see full file in notes).
4. Nginx Configuration
Create the config file in: /opt/vllm/nginx/nginx.conf
:
Note: the current configuration does not use ssl!
events { worker_connections 1024; } http { upstream vllm_backend { server vllm:8000; keepalive 32; } limit_req_zone $binary_remote_addr zone=api_limit:100m rate=100r/s; server { listen 80; server_name _; access_log /var/log/nginx/vllm_access.log; error_log /var/log/nginx/vllm_error.log warn; location /health { return 200 "healthy\n"; add_header Content-Type text/plain; } location / { limit_req zone=api_limit burst=20 nodelay; proxy_pass http://vllm_backend; proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_connect_timeout 60s; proxy_send_timeout 300s; proxy_read_timeout 300s; proxy_buffering off; proxy_cache off; add_header Access-Control-Allow-Origin * always; add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always; add_header Access-Control-Allow-Headers "Authorization, Content-Type" always; } } }
5. Deployment Script
Create /opt/vllm/scripts/startup.sh
(see full script in notes).
It handles:
Env validation
GPU check (
nvidia-smi
)Docker service checks
Container startup
Health checks (
/health
)
Run it:
bash /opt/vllm/scripts/startup.sh
6. Verify Deployment
Logs:
docker compose logs -f vllm docker compose logs -f nginx
Health check:
curl -s -H "Authorization: Bearer $VLLM_API_KEY" http://localhost:8000/health
Sample API call:
curl -X POST http://${SERVER_IP}/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${VLLM_API_KEY}" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }'
7. Integration with Translate5
Go to Administration → Language Resources in Translate5.
Add a new Llama resource.
Configure:
Base URL:
http://your.server.ip or your domain if you have one
API Key: use the value from
.env
(VLLM_API_KEY
)Model:
meta-llama/Meta-Llama-3.1-8B-Instruct
Test the connection.
Enable for desired language combinations.