Skip to content

HPC & Apptainer

Test Bazzite AI containers locally, then deploy on HPC clusters using SLURM.

Workflow Overview

Phase Environment Purpose
1. Local Testing Bazzite AI OS Verify container works before cluster
2. HPC Deployment Cluster Run at scale with SLURM

Phase 1: Local Testing

Test containers on your local Bazzite AI system before submitting to HPC.

Pull Container Image

Command Description Recording
ujust apptainer pull -i nvidia-python Download ML container
ujust apptainer pull -i jupyter Download Jupyter container
# Pull the nvidia-python image
ujust apptainer pull -i nvidia-python -t stable

# Verify download
ujust apptainer cache status

Verify GPU Detection

Command Description Recording
ujust apptainer gpu Show GPU type and flags
# Check which GPU is detected and what flags to use
ujust apptainer gpu

Output shows:

  • Detected GPU type (NVIDIA, AMD, Intel, or none)
  • Apptainer flags (--nv, --rocm, or --bind /dev/dri)

Interactive Testing

Command Description Recording
ujust apptainer shell -i nvidia-python Interactive shell
# Open interactive shell
ujust apptainer shell -i nvidia-python

# Inside container, verify GPU
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

Run Your Training Script

Command Description Recording
ujust apptainer exec -i nvidia-python -- CMD Execute command
# Test your training script locally first
ujust apptainer exec -i nvidia-python -- python train.py --epochs=1

# Run with specific arguments
ujust apptainer exec -i nvidia-python -- python train.py --config=test.yaml

Phase 2: HPC Deployment

Once local testing succeeds, deploy to your HPC cluster.

Transfer SIF to Cluster

# Find your SIF files
ls ~/.local/share/bazzite-ai/apptainer/

# Copy to cluster
scp ~/.local/share/bazzite-ai/apptainer/bazzite-ai-pod-nvidia-python_stable.sif \
    user@cluster:/path/to/containers/

SLURM Job Script (Basic)

#!/bin/bash
#SBATCH --job-name=ml-training
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module load apptainer

apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
    python /workspace/train.py

SLURM Job Script (Multi-GPU)

#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --time=24:00:00
#SBATCH --partition=gpu

module load apptainer

export APPTAINER_BIND="/scratch/$USER:/workspace"

apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
    python -m torch.distributed.launch --nproc_per_node=4 train.py

SLURM Array Jobs (Hyperparameter Sweep)

#!/bin/bash
#SBATCH --job-name=hp-sweep
#SBATCH --array=1-10
#SBATCH --gres=gpu:1
#SBATCH --time=2:00:00

module load apptainer

apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
    python train.py --config=configs/run_${SLURM_ARRAY_TASK_ID}.yaml

GPU Flags Reference

GPU Vendor Apptainer Flag SLURM Resource
NVIDIA --nv --gres=gpu:N
AMD --rocm Cluster-specific
Intel --bind /dev/dri Cluster-specific

Common Patterns

Bind Mount Data Directories

apptainer exec --nv \
    --bind /scratch/data:/data \
    --bind /home/$USER/models:/models \
    container.sif python train.py

Environment Variables

apptainer exec --nv \
    --env CUDA_VISIBLE_DEVICES=0,1 \
    --env WANDB_API_KEY=$WANDB_API_KEY \
    container.sif python train.py

Available Images

Image Use Case Size Pull Command
nvidia-python ML/AI training ~14GB ujust apptainer pull -i nvidia-python
jupyter Interactive notebooks ~17GB ujust apptainer pull -i jupyter

Apptainer Commands Reference

Command Description Recording
ujust apptainer pull Download image to SIF
ujust apptainer shell Interactive shell
ujust apptainer exec Execute command
ujust apptainer run Run default command
ujust apptainer gpu Show GPU detection
ujust apptainer inspect Show SIF metadata
ujust apptainer cache Manage cache

See Also