Bazzite-AI Environment Setup¶
One-time setup for the LLMs on Supercomputers Course
This notebook configures your bazzite-ai environment for running the course notebooks. Run this once at the start of each JupyterLab session.
Attribution¶
The course notebooks are adapted from the Foundations of LLM Mastery training series by:
- Simeon Harrison (INiTS and AI Factory Austria AI:AT)
- Thomas Haschka (Campus IT / HPC, TU Wien)
- Martin Pfister (Advanced Computing Austria ACA GmbH)
Original source: gitlab.tuwien.ac.at/vsc-public/training/LLMs-on-supercomputers
Adapted for Bazzite.AI by Andreas Trawöger
License: CC BY-SA 4.0
Bazzite-AI vs Supercomputer Environment¶
The original course was designed for the Vienna Scientific Cluster (VSC) supercomputer. In bazzite-ai, we run everything locally with:
| Aspect | VSC Supercomputer | Bazzite-AI |
|---|---|---|
| GPU Access | SLURM job scheduler | Direct GPU access via container |
| LLM Inference | vLLM server | Ollama pod (containerized) |
| Model Loading | Shared NFS storage | HuggingFace Hub / Ollama pull |
| API Compatibility | OpenAI-compatible vLLM | OpenAI-compatible Ollama |
Key Difference: Ollama as OpenAI Drop-in¶
Instead of OpenAI's paid API or a vLLM server, we use Ollama which provides an OpenAI-compatible endpoint locally:
# OpenAI (paid cloud API)
client = OpenAI(api_key="sk-...")
# Ollama (free local inference - same code works!)
client = OpenAI(base_url="http://ollama:11434/v1", api_key="ollama")
This means most code samples work unchanged - just point to Ollama instead of OpenAI.
1. GPU Access & Environment Testing¶
First, let's verify GPU access and check available memory.
import torch
import gc
print("=" * 50)
print("GPU & Environment Status")
print("=" * 50)
# Check PyTorch CUDA availability
print(f"\nPyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA version: {torch.version.cuda}")
print(f"GPU count: {torch.cuda.device_count()}")
print(f"Current device: {torch.cuda.current_device()}")
print(f"GPU name: {torch.cuda.get_device_name(0)}")
# System-wide GPU memory check using pynvml
try:
import pynvml
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
print(f"\n--- System-Wide GPU Memory ---")
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
name = pynvml.nvmlDeviceGetName(handle)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
total_gb = info.total / 1024**3
used_gb = info.used / 1024**3
free_gb = info.free / 1024**3
usage_pct = (info.used / info.total) * 100
print(f"\nGPU {i}: {name}")
print(f" Total: {total_gb:.2f} GB")
print(f" Used: {used_gb:.2f} GB ({usage_pct:.1f}%)")
print(f" Free: {free_gb:.2f} GB")
# Warning thresholds
if free_gb < 4.0:
print(f" \u26a0\ufe0f CRITICAL: Very low GPU memory!")
print(f" Shutdown other notebook kernels before proceeding.")
elif free_gb < 6.0:
print(f" \u26a0\ufe0f WARNING: Low GPU memory.")
print(f" 7B models need ~5GB with 4-bit quantization.")
pynvml.nvmlShutdown()
except ImportError:
print("\n\u26a0\ufe0f pynvml not installed - using PyTorch memory info (per-process only)")
if torch.cuda.is_available():
for i in range(torch.cuda.device_count()):
total = torch.cuda.get_device_properties(i).total_memory / 1024**3
allocated = torch.cuda.memory_allocated(i) / 1024**3
print(f" GPU {i}: {allocated:.2f} / {total:.2f} GB (this process only)")
# Quick GPU test
if torch.cuda.is_available():
print("\n--- GPU Computation Test ---")
try:
x = torch.randn(1000, 1000, device="cuda")
y = torch.matmul(x, x)
del x, y
torch.cuda.empty_cache()
print("\u2705 GPU computation test passed!")
except Exception as e:
print(f"\u274c GPU test failed: {e}")
else:
print("\n\u26a0\ufe0f No GPU available - running on CPU only")
print("\n" + "=" * 50)
==================================================
GPU & Environment Status
==================================================
PyTorch version: 2.9.1+cu130
CUDA available: True
CUDA version: 13.0
GPU count: 1
Current device: 0
GPU name: NVIDIA GeForce RTX 4080 SUPER
--- System-Wide GPU Memory ---
GPU 0: b'NVIDIA GeForce RTX 4080 SUPER'
Total: 15.99 GB
Used: 10.37 GB (64.8%)
Free: 5.62 GB
⚠️ WARNING: Low GPU memory.
7B models need ~5GB with 4-bit quantization.
--- GPU Computation Test ---
✅ GPU computation test passed!
================================================== 2. Ollama Pod Management¶
Ollama runs as a containerized pod in bazzite-ai. Let's check if it's running.
import os
import requests
# Ollama configuration
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
print(f"Ollama host: {OLLAMA_HOST}")
print("\n--- Checking Ollama Connection ---")
def check_ollama_health():
"""Check if Ollama server is running and healthy."""
try:
response = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5)
if response.status_code == 200:
return True, response.json()
return False, f"Unexpected status: {response.status_code}"
except requests.exceptions.ConnectionError:
return False, "Connection refused - Ollama pod not running"
except requests.exceptions.Timeout:
return False, "Connection timed out"
except Exception as e:
return False, str(e)
is_running, result = check_ollama_health()
if is_running:
print("\u2705 Ollama server is running!")
models = result.get("models", [])
if models:
print(f"\nAvailable models ({len(models)}):")
for m in models:
name = m.get("name", "Unknown")
size_gb = m.get("size", 0) / 1024**3
print(f" - {name} ({size_gb:.1f} GB)")
else:
print("\nNo models pulled yet (we'll pull them in the next step).")
else:
print(f"\u274c Ollama is not running: {result}")
print("\n--- How to Start Ollama ---")
print("Run this command in a terminal:")
print("")
print(" ujust ollama start")
print("")
print("Then re-run this cell to verify the connection.")
Ollama host: http://ollama:11434 --- Checking Ollama Connection --- ✅ Ollama server is running! Available models (2): - hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M (4.1 GB) - llama3.2:latest (1.9 GB)
3. Model Management (Auto-Pull)¶
The course notebooks require specific models. Let's check if they're available and pull any missing ones.
import json
# Required models for the course
REQUIRED_MODELS = [
{
"name": "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M",
"used_by": "D1 notebooks (Prompt Engineering)",
"size_hint": "~4.4 GB"
},
{
"name": "llama3.2:latest",
"used_by": "D2 notebooks (RAG)",
"size_hint": "~2.0 GB"
}
]
def get_available_models():
"""Get list of models available in Ollama."""
try:
response = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5)
if response.status_code == 200:
return [m.get("name", "") for m in response.json().get("models", [])]
except:
pass
return []
def pull_model(model_name):
"""Pull a model from Ollama, showing progress."""
print(f"\nPulling '{model_name}'...")
print("(This may take several minutes for large models)")
try:
response = requests.post(
f"{OLLAMA_HOST}/api/pull",
json={"name": model_name},
stream=True,
timeout=1800 # 30 minute timeout for large models
)
last_status = ""
for line in response.iter_lines():
if line:
data = json.loads(line)
status = data.get("status", "")
# Show download progress
if "pulling" in status or "downloading" in status:
completed = data.get("completed", 0)
total = data.get("total", 0)
if total > 0:
pct = (completed / total) * 100
print(f"\r Progress: {pct:.1f}%", end="", flush=True)
elif status != last_status:
if last_status:
print() # newline after progress
print(f" {status}")
last_status = status
if status == "success":
print(f"\n\u2705 Model '{model_name}' pulled successfully!")
return True
return True
except Exception as e:
print(f"\n\u274c Failed to pull model: {e}")
return False
# Check connection first
is_running, _ = check_ollama_health()
if not is_running:
print("\u274c Ollama is not running. Start it first with: ujust ollama start")
else:
print("Checking required models...\n")
available = get_available_models()
all_ready = True
for model_info in REQUIRED_MODELS:
model_name = model_info["name"]
# Check if model is available (exact match or prefix match)
is_available = any(model_name in m or m in model_name for m in available)
if is_available:
print(f"\u2705 {model_name}")
print(f" Used by: {model_info['used_by']}")
else:
print(f"\u274c {model_name} - NOT FOUND")
print(f" Used by: {model_info['used_by']}")
print(f" Size: {model_info['size_hint']}")
# Auto-pull missing model
success = pull_model(model_name)
if not success:
all_ready = False
print("\n" + "=" * 50)
if all_ready:
print("\u2705 All required models are available!")
else:
print("\u26a0\ufe0f Some models failed to download. Try manually:")
print(" ujust ollama pull <model-name>")
Checking required models... ✅ hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M Used by: D1 notebooks (Prompt Engineering) ✅ llama3.2:latest Used by: D2 notebooks (RAG) ================================================== ✅ All required models are available!
4. Ollama as OpenAI Drop-in Replacement¶
Ollama provides an OpenAI-compatible API, which means you can use the same code for both:
| Aspect | OpenAI API | Ollama (bazzite-ai) |
|---|---|---|
| Cost | Pay per token | Free (runs locally) |
| API Key | Required | Not needed |
| Privacy | Data sent to cloud | Data stays local |
| Models | OpenAI models only | Any GGUF model |
| base_url | https://api.openai.com/v1 | http://ollama:11434/v1 |
Configuration Pattern¶
In the course notebooks, you'll see this minimal configuration:
import os
from openai import OpenAI
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
MODEL = "hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M"
client = OpenAI(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama" # Required by library but ignored by Ollama
)
The same pattern works with LangChain:
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama",
model=MODEL
)
# Quick API test
from openai import OpenAI
# === Model Configuration ===
HF_LLM_MODEL = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF"
OLLAMA_LLM_MODEL = f"hf.co/{HF_LLM_MODEL}:Q4_K_M"
print("Testing OpenAI-compatible API...")
print(f"Model: {OLLAMA_LLM_MODEL}")
try:
client = OpenAI(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama"
)
response = client.chat.completions.create(
model=OLLAMA_LLM_MODEL,
messages=[{"role": "user", "content": "Say 'Hello from Ollama!' in exactly 5 words."}],
max_tokens=20
)
print(f"\u2705 API test passed!")
print(f"\nResponse: {response.choices[0].message.content}")
except Exception as e:
print(f"\u274c API test failed: {e}")
print("\nMake sure Ollama is running and the model is pulled.")
Testing OpenAI-compatible API... Model: hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M ✅ API test passed! Response: Hello, Ollama! Welcome!
5. Datasets Directory¶
Notebooks can use relative paths to access datasets. The Jupyter kernel runs in the notebook's directory.
from pathlib import Path
# With kernel cwd fix, notebooks run in their own directory
# Datasets are in ./datasets/ relative to this notebook
DATASETS_DIR = Path("./datasets")
print(f"Datasets directory: {DATASETS_DIR.resolve()}")
datasets = list(DATASETS_DIR.glob('*.csv'))
if datasets:
print(f"Available datasets: {[d.name for d in datasets]}")
Datasets directory: /workspace/Sync/AI/bazzite/bazzite-ai-testing/notebooks/llms_on_supercomputers/datasets Available datasets: ['booking_queries_dataset.csv', 'code_review_dataset.csv', 'health_and_fitness_qna.csv']
6. Environment Verification & Readiness Check¶
Final verification that everything is working.
print("=" * 60)
print("ENVIRONMENT READINESS CHECK")
print("=" * 60)
checks = []
# 1. GPU Check
gpu_ok = torch.cuda.is_available()
checks.append(("GPU Access", gpu_ok, "CUDA available" if gpu_ok else "No GPU - CPU only"))
# 2. Ollama Check
ollama_ok, _ = check_ollama_health()
checks.append(("Ollama Server", ollama_ok, "Running" if ollama_ok else "Not running"))
# 3. Models Check
if ollama_ok:
available = get_available_models()
model_count = len(available)
models_ok = model_count > 0
checks.append(("Ollama Models", models_ok, f"{model_count} models available" if models_ok else "No models"))
else:
checks.append(("Ollama Models", False, "Ollama not running"))
# 4. API Test
if ollama_ok and models_ok:
try:
client = OpenAI(base_url=f"{OLLAMA_HOST}/v1", api_key="ollama")
# Quick test with small output
response = client.chat.completions.create(
model=available[0],
messages=[{"role": "user", "content": "Hi"}],
max_tokens=5
)
api_ok = True
except:
api_ok = False
checks.append(("API Inference", api_ok, "Working" if api_ok else "Failed"))
else:
checks.append(("API Inference", False, "Prerequisites not met"))
# Print results
print("\n")
all_ok = True
for name, ok, detail in checks:
status = "\u2705" if ok else "\u274c"
print(f"{status} {name}: {detail}")
if not ok and name not in ["GPU Access"]: # GPU is optional
all_ok = False
print("\n" + "=" * 60)
if all_ok:
print("\u2705 ENVIRONMENT READY!")
print("\nYou can now proceed to D1_01_Prompting_with_LangChain.ipynb")
else:
print("\u26a0\ufe0f SOME ISSUES DETECTED")
print("\nPlease resolve the issues above before continuing.")
if not ollama_ok:
print("\nTo start Ollama, run in a terminal:")
print(" ujust ollama start")
print("=" * 60)
============================================================ ENVIRONMENT READINESS CHECK ============================================================ ✅ GPU Access: CUDA available ✅ Ollama Server: Running ✅ Ollama Models: 2 models available ✅ API Inference: Working ============================================================ ✅ ENVIRONMENT READY! You can now proceed to D1_01_Prompting_with_LangChain.ipynb ============================================================
Next Steps¶
Your bazzite-ai environment is configured! You can now proceed with the course:
D1 - Prompt Engineering Essentials¶
D1_01_Prompting_with_LangChain.ipynb- Start here!D1_02_Prompt_templates_and_parsing.ipynbD1_05_Chaining.ipynbD1_08_LLM_Evaluation.ipynbD1_09_LLM_as_a_Judge.ipynbD1_10_Prompt_Optimization.ipynb
D2 - Retrieval Augmented Generation¶
D2_01_rag_with_basic_tools.ipynbD2_02_rag_with_langchain_and_chromadb.ipynb
D3 - Fine-tuning on One GPU¶
D3_01_Transformer_Architecture.ipynbD3_02_Finetuning_LLM_with_PyTorch.ipynbD3_03_Finetuning_LLM_with_Huggingface.ipynbD3_04_Quantization.ipynbD3_05_PEFT.ipynbD3_06_Unsloth.ipynb
Note: You only need to run this setup notebook once per JupyterLab session. The Ollama pod persists between notebook runs.