HPC & Apptainer¶

Test Bazzite AI containers locally, then deploy on HPC clusters using SLURM.

Workflow Overview¶

Phase	Environment	Purpose
1. Local Testing	Bazzite AI OS	Verify container works before cluster
2. HPC Deployment	Cluster	Run at scale with SLURM

Phase 1: Local Testing¶

Test containers on your local Bazzite AI system before submitting to HPC.

Pull Container Image¶

Command	Description	Recording
`ujust apptainer pull -i nvidia-python`	Download ML container
`ujust apptainer pull -i jupyter`	Download Jupyter container

# Pull the nvidia-python image
ujust apptainer pull -i nvidia-python -t stable

# Verify download
ujust apptainer cache status

Verify GPU Detection¶

Command	Description	Recording
`ujust apptainer gpu`	Show GPU type and flags

# Check which GPU is detected and what flags to use
ujust apptainer gpu

Output shows:

Detected GPU type (NVIDIA, AMD, Intel, or none)
Apptainer flags (--nv, --rocm, or --bind /dev/dri)

Interactive Testing¶

Command	Description	Recording
`ujust apptainer shell -i nvidia-python`	Interactive shell

# Open interactive shell
ujust apptainer shell -i nvidia-python

# Inside container, verify GPU
nvidia-smi
python -c "import torch; print(torch.cuda.is_available())"

Run Your Training Script¶

Command	Description	Recording
`ujust apptainer exec -i nvidia-python -- CMD`	Execute command

# Test your training script locally first
ujust apptainer exec -i nvidia-python -- python train.py --epochs=1

# Run with specific arguments
ujust apptainer exec -i nvidia-python -- python train.py --config=test.yaml

Phase 2: HPC Deployment¶

Once local testing succeeds, deploy to your HPC cluster.

Transfer SIF to Cluster¶

# Find your SIF files
ls ~/.local/share/bazzite-ai/apptainer/

# Copy to cluster
scp ~/.local/share/bazzite-ai/apptainer/bazzite-ai-pod-nvidia-python_stable.sif \
    user@cluster:/path/to/containers/

SLURM Job Script (Basic)¶

#!/bin/bash
#SBATCH --job-name=ml-training
#SBATCH --gres=gpu:1
#SBATCH --time=4:00:00
#SBATCH --output=%j.out
#SBATCH --error=%j.err

module load apptainer

apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
    python /workspace/train.py

SLURM Job Script (Multi-GPU)¶

#!/bin/bash
#SBATCH --job-name=distributed-training
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --gres=gpu:4
#SBATCH --time=24:00:00
#SBATCH --partition=gpu

module load apptainer

export APPTAINER_BIND="/scratch/$USER:/workspace"

apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
    python -m torch.distributed.launch --nproc_per_node=4 train.py

SLURM Array Jobs (Hyperparameter Sweep)¶

#!/bin/bash
#SBATCH --job-name=hp-sweep
#SBATCH --array=1-10
#SBATCH --gres=gpu:1
#SBATCH --time=2:00:00

module load apptainer

apptainer exec --nv /path/to/bazzite-ai-pod-nvidia-python_stable.sif \
    python train.py --config=configs/run_${SLURM_ARRAY_TASK_ID}.yaml

GPU Flags Reference¶

GPU Vendor	Apptainer Flag	SLURM Resource
NVIDIA	`--nv`	`--gres=gpu:N`
AMD	`--rocm`	Cluster-specific
Intel	`--bind /dev/dri`	Cluster-specific

Common Patterns¶

Bind Mount Data Directories¶

apptainer exec --nv \
    --bind /scratch/data:/data \
    --bind /home/$USER/models:/models \
    container.sif python train.py

Environment Variables¶

apptainer exec --nv \
    --env CUDA_VISIBLE_DEVICES=0,1 \
    --env WANDB_API_KEY=$WANDB_API_KEY \
    container.sif python train.py

Available Images¶

Image	Use Case	Size	Pull Command
`nvidia-python`	ML/AI training	~14GB	`ujust apptainer pull -i nvidia-python`
`jupyter`	Interactive notebooks	~17GB	`ujust apptainer pull -i jupyter`

Apptainer Commands Reference¶

Command	Description	Recording
`ujust apptainer pull`	Download image to SIF
`ujust apptainer shell`	Interactive shell
`ujust apptainer exec`	Execute command
`ujust apptainer run`	Run default command
`ujust apptainer gpu`	Show GPU detection
`ujust apptainer inspect`	Show SIF metadata
`ujust apptainer cache`	Manage cache