HPC & Apptainer
Test Bazzite AI containers locally, then deploy on HPC clusters using SLURM.
Workflow Overview
| Phase | Environment | Purpose |
| 1. Local Testing | Bazzite AI OS | Verify container works before cluster |
| 2. HPC Deployment | Cluster | Run at scale with SLURM |
Phase 1: Local Testing
Test containers on your local Bazzite AI system before submitting to HPC.
Pull Container Image
# Pull the nvidia-python image
ujust apptainer pull -i nvidia-python -t stable
# Verify download
ujust apptainer cache status
Verify GPU Detection
# Check which GPU is detected and what flags to use
ujust apptainer gpu
Output shows:
- Detected GPU type (NVIDIA, AMD, Intel, or none)
- Apptainer flags (
--nv, --rocm, or --bind /dev/dri)
Interactive Testing
# Open interactive shell
ujust apptainer shell -i nvidia-python
# Inside container, verify GPU
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"
Run Your Training Script
# Test your training script locally first
ujust apptainer exec -i nvidia-python -- python train.py --epochs=1
# Run with specific arguments
ujust apptainer exec -i nvidia-python -- python train.py --config=test.yaml
Phase 2: HPC Deployment
Once local testing succeeds, deploy to your HPC cluster.
Transfer SIF to Cluster
# Find your SIF files
ls ~/.local/share/bazzite-ai/apptainer/
# Copy to cluster
scp ~/.local/share/bazzite-ai/apptainer/bazzite-ai-pod-nvidia-python_stable.sif \
user@cluster:/path/to/containers/
SLURM Job Script (Basic)
#!/bin/bash
#SBATCH --job-name=ml-training
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00
#SBATCH --output=%j.out
#SBATCH --error=%j.err
module load apptainer
apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
python /workspace/train.py
SLURM Job Script (Multi-GPU)
#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --time=24:00:00
#SBATCH --partition=gpu
module load apptainer
export APPTAINER_BIND="/scratch/$USER:/workspace"
apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
python -m torch.distributed.launch --nproc_per_node=4 train.py
SLURM Array Jobs (Hyperparameter Sweep)
#!/bin/bash
#SBATCH --job-name=hp-sweep
#SBATCH --array=1-10
#SBATCH --gres=gpu:1
#SBATCH --time=2:00:00
module load apptainer
apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
python train.py --config=configs/run_${SLURM_ARRAY_TASK_ID}.yaml
GPU Flags Reference
| GPU Vendor | Apptainer Flag | SLURM Resource |
| NVIDIA | --nv | --gres=gpu:N |
| AMD | --rocm | Cluster-specific |
| Intel | --bind /dev/dri | Cluster-specific |
Common Patterns
Bind Mount Data Directories
apptainer exec --nv \
--bind /scratch/data:/data \
--bind /home/$USER/models:/models \
container.sif python train.py
Environment Variables
apptainer exec --nv \
--env CUDA_VISIBLE_DEVICES=0,1 \
--env WANDB_API_KEY=$WANDB_API_KEY \
container.sif python train.py
Available Images
| Image | Use Case | Size | Pull Command |
nvidia-python | ML/AI training | ~14GB | ujust apptainer pull -i nvidia-python |
jupyter | Interactive notebooks | ~17GB | ujust apptainer pull -i jupyter |
Apptainer Commands Reference
See Also