Deploying Llama 3.1-8B (vLLM) on Hetzner Ubuntu Server with NVIDIA RTX 4000 SFF Ada for Translate5

This guide explains how to configure and self-host Meta-Llama-3.1-8B-Instruct using vLLM on a Hetzner Ubuntu 22.04 server with NVIDIA RTX 4000 SFF Ada GPU, and then use it as a Translate5 Language Resource.

1. System Preparation

Install Kernel Headers

Required for NVIDIA driver builds:

uname -r

dpkg -l | grep headers | grep $(uname -r)

If no headers are found:

sudo apt install -y linux-headers-$(uname -r)

Install NVIDIA Drivers

# Detect recommended driver
ubuntu-drivers devices

# Remove old drivers if any

sudo apt purge -y 'nvidia-*'

sudo apt autoremove -y

# Install prerequisites

sudo apt update

sudo apt install -y linux-headers-$(uname -r) build-essential

# Install driver (580 recommended for RTX 4000 Ada)

sudo apt install -y nvidia-driver-580 

# Reboot is imporatant
sudo reboot

# Verify installation:

# 
lsmod | grep nvidia 

# You should be able to see the gpu info
nvidia-smi 

# driver build status. Output something like: nvidia/580.65.06, 5.15.0-153-generic, x86_64: installed
dkms status

Example output shows the driver, CUDA version, and GPU details.

2. Docker & NVIDIA Runtime

Install Docker & Compose

curl -fsSL https://get.docker.com -o get-docker.sh 
sudo sh get-docker.sh 
sudo apt install docker-compose-plugin -y

# Add user to Docker group:
sudo usermod -aG docker $USER newgrp docker

Python for monitoring

# Install Python dependencies for monitoring
sudo apt install python3-pip -y
pip3 install requests psutil nvidia-ml-py tabulate

# Create project directory in /opt
sudo mkdir -p /opt/vllm
sudo chown -R $USER:$USER /opt/vllm

Install NVIDIA Container Toolkit

# Remove old packages

sudo apt-get remove nvidia-docker nvidia-docker2 nvidia-container-runtime

# Add repo

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -

curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

# Install toolkit

sudo apt-get update

sudo apt-get install -y nvidia-container-toolkit

# Configure Docker runtime

sudo nvidia-ctk runtime configure --runtime=docker

sudo systemctl restart docker

#Test GPU passthrough inside Docker:

docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi

3. Project Setup

Create Directories

sudo mkdir -p /opt/vllm

sudo chown -R $USER:$USER /opt/vllm

cd /opt/vllm

Create `.env` Configuration

# HuggingFace 
HUGGING_FACE_HUB_TOKEN=your_token_here 
MODEL_NAME=meta-llama/Meta-Llama-3.1-8B-Instruct 

# vLLM 
VLLM_PORT=8000 
VLLM_API_KEY=your_random_secret 
TENSOR_PARALLEL_SIZE=1 
GPU_MEMORY_UTILIZATION=0.90 
MAX_MODEL_LEN=4096 

# Nginx 
NGINX_PORT=80 
NGINX_SSL_PORT=443 

# Server 
SERVER_IP=your.server.ip

Create `docker-compose.yml`

Located at /opt/vllm/docker-compose.yml.

Contains vLLM service and Nginx reverse proxy

services:
  vllm:
    image: vllm/vllm-openai:latest
    container_name: vllm-server
    # Runtime (requires nvidia-container-toolkit)
    runtime: nvidia
    environment:
      - HUGGING_FACE_HUB_TOKEN=${HUGGING_FACE_HUB_TOKEN}
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
      - ./models:/models
      - ./logs:/logs
    ports:
      - "127.0.0.1:${VLLM_PORT}:8000"
    command: >
      --model ${MODEL_NAME}
      --tensor-parallel-size ${TENSOR_PARALLEL_SIZE}
      --gpu-memory-utilization ${GPU_MEMORY_UTILIZATION}
      --max-model-len ${MAX_MODEL_LEN}
      --api-key ${VLLM_API_KEY}
      --port 8000
      --host 0.0.0.0
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 300s
    restart: unless-stopped
    networks:
      - vllm-network
    logging:
      driver: "json-file"
      options:
        max-size: "100m"
        max-file: "10"

  nginx:
    image: nginx:alpine
    container_name: nginx-proxy
    ports:
      - "${NGINX_PORT}:80"
      - "${NGINX_SSL_PORT}:443"
    volumes:
      - ./nginx/nginx.conf:/etc/nginx/nginx.conf:ro
      - ./nginx/logs:/var/log/nginx
      - ./nginx/ssl:/etc/nginx/ssl:ro
    depends_on:
      - vllm
    healthcheck:
      test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost/health"]
      interval: 30s
      timeout: 10s
      retries: 3
    restart: unless-stopped
    networks:
      - vllm-network
    logging:
      driver: "json-file"
      options:
        max-size: "50m"
        max-file: "5"

networks:
  vllm-network:
    driver: bridge

4. Nginx Configuration

Create the config file in: /opt/vllm/nginx/nginx.conf:

Note: the current configuration does not use ssl!

events { worker_connections 1024; } http { upstream vllm_backend { server vllm:8000; keepalive 32; } limit_req_zone $binary_remote_addr zone=api_limit:100m rate=100r/s; server { listen 80; server_name _; access_log /var/log/nginx/vllm_access.log; error_log /var/log/nginx/vllm_error.log warn; location /health { return 200 "healthy\n"; add_header Content-Type text/plain; } location / { limit_req zone=api_limit burst=20 nodelay; proxy_pass http://vllm_backend; proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_connect_timeout 60s; proxy_send_timeout 300s; proxy_read_timeout 300s; proxy_buffering off; proxy_cache off; add_header Access-Control-Allow-Origin * always; add_header Access-Control-Allow-Methods "GET, POST, OPTIONS" always; add_header Access-Control-Allow-Headers "Authorization, Content-Type" always; } } }

5. Deployment Script

Create /opt/vllm/scripts/startup.sh

#!/bin/bash

set -e

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_DIR="$(dirname "$SCRIPT_DIR")"

cd "$PROJECT_DIR"

echo "=========================================="
echo "vLLM Deployment Startup"
echo "=========================================="

# Check if .env file exists
if [ ! -f .env ]; then
    echo "Error: .env file not found!"
    echo "Please create .env file with your configuration"
    exit 1
fi

# Load environment variables
source .env

# Validate required environment variables
if [ -z "$HUGGING_FACE_HUB_TOKEN" ]; then
    echo "Error: HUGGING_FACE_HUB_TOKEN not set in .env!"
    exit 1
fi

if [ -z "$VLLM_API_KEY" ]; then
    echo "Warning: VLLM_API_KEY not set. Using default (not secure!)"
fi

# Check GPU availability
echo " Checking GPU availability..."
if ! nvidia-smi &> /dev/null; then
    echo "Error: NVIDIA GPU not detected!"
    echo "Please ensure NVIDIA drivers are installed"
    exit 1
fi

echo "GPU detected:"
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv,noheader

# Check Docker
echo "Checking Docker..."
if ! docker --version &> /dev/null; then
    echo "Error: Docker not installed!"
    exit 1
fi

# Check Docker Compose
if ! docker compose version &> /dev/null; then
    echo "Error: Docker Compose not installed!"
    exit 1
fi

# Create necessary directories
echo "Creating directories..."
mkdir -p nginx/logs
mkdir -p logs
mkdir -p models

# Stop existing containers if running
echo "Stopping existing containers (if any)..."
docker compose down 2>/dev/null || true

# Pull latest images
echo "Pulling Docker images..."
docker compose pull

# Start services
echo "Starting services..."
docker compose up -d

# Wait for services to be ready
echo "Waiting for services to start (this may take a few minutes for model loading)..."
sleep 10

# Check startup logs
echo "Checking startup logs..."
echo "--- vLLM logs ---"
docker compose logs --tail=20 vllm

echo ""
echo "--- Nginx logs ---"
docker compose logs --tail=10 nginx

# Wait more for model loading
echo "Waiting for model to load completely..."
MAX_WAIT=300  # 5 minutes maximum
WAITED=0

while [ $WAITED -lt $MAX_WAIT ]; do
    if curl -s -f -H "Authorization: Bearer ${VLLM_API_KEY}" http://localhost:8000/health > /dev/null 2>&1; then
        echo "vLLM is ready!"
        break
    fi
    echo -n "."
    sleep 5
    WAITED=$((WAITED + 5))
done

if [ $WAITED -ge $MAX_WAIT ]; then
    echo ""
    echo "Warning: vLLM health check timeout. Check logs with: docker compose logs vllm"
fi

# Run health check
echo ""
echo "Running health check..."
python3 scripts/health_check.py

echo ""
echo "=========================================="
echo " Deployment Complete!"
echo "=========================================="
echo ""
echo " Access Points:"
echo "  - vLLM API: http://${SERVER_IP}"
echo "  - API Docs: http://${SERVER_IP}/docs"
echo ""
echo " Useful Commands:"
echo "  - View logs:        docker compose logs -f [vllm|nginx]"
echo "  - Health check:     python3 scripts/health_check.py"
echo "  - Start monitoring: python3 scripts/monitor.py"
echo "  - Stop services:    docker compose down"
echo "  - Restart services: docker compose restart"
echo ""
echo " API Usage:"
echo "  curl -X POST http://${SERVER_IP}/v1/chat/completions \\"
echo "    -H \"Content-Type: application/json\" \\"
echo "    -H \"Authorization: Bearer ${VLLM_API_KEY}\" \\"
echo "    -d '{\"model\": \"${MODEL_NAME}\", \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}]}'"
echo ""

It handles:

Env validation
GPU check (nvidia-smi)
Docker service checks
Container startup
Health checks (/health)

Run it:

bash /opt/vllm/scripts/startup.sh

6. Verify Deployment

Logs:
docker compose logs -f vllm docker compose logs -f nginx
Health check:
curl -s -H "Authorization: Bearer $VLLM_API_KEY" http://localhost:8000/health
Sample API call:
curl -X POST http://${SERVER_IP}/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer ${VLLM_API_KEY}" \ -d '{ "model": "meta-llama/Meta-Llama-3.1-8B-Instruct", "messages": [{"role": "user", "content": "Hello!"}] }'

7. Integration with Translate5

Go to Administration → Language Resources in Translate5.
Add a new Llama resource.
Configure:
- Base URL: http://your.server.ip or your domain if you have one
- API Key: use the value from .env (VLLM_API_KEY)
- Model: meta-llama/Meta-Llama-3.1-8B-Instruct
Test the connection.
Enable for desired language combinations.

Page tree