QLoRA Test: Quantization Method Comparison - Ministral-3B-Reasoning¶

Compares different quantization methods for QLoRA training to understand memory/quality tradeoffs.

Quantization methods tested:

4-bit NF4: QLoRA standard - Normal Float 4-bit with double quantization
BF16: Full precision baseline (16-bit bfloat)

Measurements:

Peak GPU memory usage (MB)
Model loading time
Training time
Final loss
Inference quality with [THINK] reasoning traces

Expected outcomes:

4-bit: ~50-70% memory reduction vs BF16
Quality degradation: minimal (<2% loss difference)

Model: Ministral-3B-Reasoning with [THINK]...[/THINK] reasoning format

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:

  Copied!     
 
# Environment Setup
import os
import time
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
import gc

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os import time from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch import gc # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

Out[1]:

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

Out[1]:

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

Out[1]:

🦥 Unsloth Zoo will now patch everything to make training faster!

Out[1]:

Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes

In [2]:

  Copied!     
 
# Benchmark Helper Functions
import subprocess

def measure_gpu_memory():
    """Measure current GPU memory usage in MB using nvidia-smi"""
    try:
        result = subprocess.run(
            ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'],
            capture_output=True, text=True
        )
        return int(result.stdout.strip().split('\n')[0])
    except:
        return 0

def count_parameters(model):
    """Count trainable vs total parameters"""
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    return {
        "trainable": trainable,
        "total": total,
        "pct": 100 * trainable / total
    }

def cleanup_memory():
    """Force garbage collection and clear CUDA cache"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
        torch.cuda.synchronize()

print("Benchmark functions defined.")
print(f"Initial GPU memory: {measure_gpu_memory()} MB")
# Benchmark Helper Functions import subprocess def measure_gpu_memory(): """Measure current GPU memory usage in MB using nvidia-smi""" try: result = subprocess.run( ['nvidia-smi', '--query-gpu=memory.used', '--format=csv,noheader,nounits'], capture_output=True, text=True ) return int(result.stdout.strip().split('\n')[0]) except: return 0 def count_parameters(model): """Count trainable vs total parameters""" trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) return { "trainable": trainable, "total": total, "pct": 100 * trainable / total } def cleanup_memory(): """Force garbage collection and clear CUDA cache""" gc.collect() if torch.cuda.is_available(): torch.cuda.empty_cache() torch.cuda.synchronize() print("Benchmark functions defined.") print(f"Initial GPU memory: {measure_gpu_memory()} MB")

Out[2]:

Benchmark functions defined.
Initial GPU memory: 1871 MB

In [3]:

  Copied!     
 
# Create minimal synthetic instruction dataset with thinking content (5 samples)
from datasets import Dataset

# Same dataset across all quantization methods for fair comparison
synthetic_data = [
    {
        "instruction": "What is machine learning?",
        "thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.",
        "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."
    },
    {
        "instruction": "Explain Python in one sentence.",
        "thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?",
        "response": "Python is a high-level programming language known for its readability and versatility."
    },
    {
        "instruction": "What is a neural network?",
        "thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.",
        "response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers."
    },
    {
        "instruction": "Define supervised learning.",
        "thinking": "What makes supervised learning 'supervised'? It's the labels! The data has known outputs. How do I explain this clearly? Focus on the training process with labeled data.",
        "response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs."
    },
    {
        "instruction": "What is gradient descent?",
        "thinking": "This is a bit technical. What's the intuition behind gradient descent? It's like walking downhill to find the lowest point. The gradient tells us which direction is 'down'. Keep it conceptual.",
        "response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters in the direction of steepest descent."
    },
]

print(f"Synthetic dataset prepared: {len(synthetic_data)} samples")
# Create minimal synthetic instruction dataset with thinking content (5 samples) from datasets import Dataset # Same dataset across all quantization methods for fair comparison synthetic_data = [ { "instruction": "What is machine learning?", "thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.", "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data." }, { "instruction": "Explain Python in one sentence.", "thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?", "response": "Python is a high-level programming language known for its readability and versatility." }, { "instruction": "What is a neural network?", "thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.", "response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers." }, { "instruction": "Define supervised learning.", "thinking": "What makes supervised learning 'supervised'? It's the labels! The data has known outputs. How do I explain this clearly? Focus on the training process with labeled data.", "response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs." }, { "instruction": "What is gradient descent?", "thinking": "This is a bit technical. What's the intuition behind gradient descent? It's like walking downhill to find the lowest point. The gradient tells us which direction is 'down'. Keep it conceptual.", "response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters in the direction of steepest descent." }, ] print(f"Synthetic dataset prepared: {len(synthetic_data)} samples")

Out[3]:

Synthetic dataset prepared: 5 samples

In [4]:

  Copied!     
 
# Quantization Configuration

QUANT_CONFIGS = {
    "4bit_nf4": {
        "model_name": "unsloth/Ministral-3-3B-Reasoning-2512",
        "load_in_4bit": True,
        "description": "4-bit NF4 (QLoRA standard)",
    },
    "bf16": {
        "model_name": "unsloth/Ministral-3-3B-Reasoning-2512",
        "load_in_4bit": False,
        "description": "BF16 (full precision baseline)",
    },
}

print("Quantization configurations:")
for name, config in QUANT_CONFIGS.items():
    print(f"  - {name}: {config['description']}")
print("\nNote: 8-bit INT8 excluded due to Unsloth compatibility considerations")
# Quantization Configuration QUANT_CONFIGS = { "4bit_nf4": { "model_name": "unsloth/Ministral-3-3B-Reasoning-2512", "load_in_4bit": True, "description": "4-bit NF4 (QLoRA standard)", }, "bf16": { "model_name": "unsloth/Ministral-3-3B-Reasoning-2512", "load_in_4bit": False, "description": "BF16 (full precision baseline)", }, } print("Quantization configurations:") for name, config in QUANT_CONFIGS.items(): print(f" - {name}: {config['description']}") print("\nNote: 8-bit INT8 excluded due to Unsloth compatibility considerations")

Out[4]:

Quantization configurations:
  - 4bit_nf4: 4-bit NF4 (QLoRA standard)
  - bf16: BF16 (full precision baseline)

Note: 8-bit INT8 excluded due to Unsloth compatibility considerations

In [ ]:

  Copied!     
 
# Quantization Comparison Loop
from trl import SFTTrainer, SFTConfig

results = []

for quant_name, quant_config in QUANT_CONFIGS.items():
    print(f"\n{'='*60}")
    print(f"Testing: {quant_config['description']}")
    print(f"{'='*60}")
    
    # Cleanup and measure baseline
    cleanup_memory()
    mem_baseline = measure_gpu_memory()
    
    # Load model with timing
    print(f"Loading model: {quant_config['model_name'].split('/')[-1]}...")
    load_start = time.time()
    
    try:
        # Build load kwargs based on quantization config
        load_kwargs = {
            "max_seq_length": 512,
            "dtype": None,
        }
        
        # Handle different quantization modes
        if quant_config.get("load_in_4bit"):
            load_kwargs["load_in_4bit"] = True
        
        model, tokenizer = FastLanguageModel.from_pretrained(
            quant_config["model_name"],
            **load_kwargs
        )
        
        load_time = time.time() - load_start
        mem_after_load = measure_gpu_memory()
        
        print(f"Load time: {load_time:.1f}s, Memory: {mem_after_load} MB")
        
        # Apply LoRA
        print(f"Applying LoRA (r=16, alpha=16)...")
        model = FastLanguageModel.get_peft_model(
            model,
            r=16,
            lora_alpha=16,
            lora_dropout=0,
            target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                            "gate_proj", "up_proj", "down_proj"],
            bias="none",
            use_gradient_checkpointing="unsloth",
            random_state=42,
        )
        
        mem_after_lora = measure_gpu_memory()
        params = count_parameters(model)
        print(f"Trainable: {params['trainable']:,} ({params['pct']:.2f}%)")
        
        # Format dataset - using Ministral's [THINK] format
        def format_conversation(sample):
            assistant_content = f"[THINK]\n{sample['thinking']}\n[/THINK]\n\n{sample['response']}"
            messages = [
                {"role": "user", "content": [{"type": "text", "text": sample["instruction"]}]},
                {"role": "assistant", "content": [{"type": "text", "text": assistant_content}]}
            ]
            return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}
        
        dataset = Dataset.from_list(synthetic_data)
        dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"])
        
        # Training config - use bf16 for all (supported on modern GPUs)
        sft_config = SFTConfig(
            output_dir=f"outputs_qlora_quant_ministral/{quant_name}",
            per_device_train_batch_size=1,
            gradient_accumulation_steps=1,
            max_steps=3,
            warmup_steps=1,
            learning_rate=2e-4,
            logging_steps=1,
            fp16=False,
            bf16=True,
            optim="adamw_8bit",
            weight_decay=0.01,
            max_seq_length=512,
            seed=42,
            report_to="none",
        )
        
        trainer = SFTTrainer(
            model=model,
            tokenizer=tokenizer,
            train_dataset=dataset,
            dataset_text_field="text",
            args=sft_config,
        )
        
        # Train with timing
        print(f"Training (3 steps)...")
        train_start = time.time()
        trainer_stats = trainer.train()
        train_time = time.time() - train_start
        
        final_loss = trainer_stats.metrics.get('train_loss', 0)
        mem_peak = measure_gpu_memory()
        
        print(f"Train time: {train_time:.1f}s, Final loss: {final_loss:.4f}")
        print(f"Peak memory: {mem_peak} MB")
        
        # Quick inference test
        FastLanguageModel.for_inference(model)
        messages = [{"role": "user", "content": [{"type": "text", "text": "What is deep learning?"}]}]
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        
        # Use text tokenizer directly for text-only inference (Pixtral uses a processor)
        text_tokenizer = tokenizer.tokenizer if hasattr(tokenizer, 'tokenizer') else tokenizer
        inputs = text_tokenizer(prompt, return_tensors="pt").to("cuda")
        
        with torch.no_grad():
            outputs = model.generate(
                **inputs,
                max_new_tokens=100,
                temperature=0.6,
                do_sample=True,
                pad_token_id=text_tokenizer.pad_token_id,
            )
        
        response = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
        
        # Store results
        results.append({
            "quantization": quant_name,
            "description": quant_config["description"],
            "load_time_sec": load_time,
            "train_time_sec": train_time,
            "model_memory_mb": mem_after_load - mem_baseline,
            "peak_memory_mb": mem_peak,
            "final_loss": final_loss,
            "sample_response": response[:150],
            "success": True,
        })
        
    except Exception as e:
        print(f"ERROR: {e}")
        results.append({
            "quantization": quant_name,
            "description": quant_config["description"],
            "success": False,
            "error": str(e),
        })
    
    finally:
        # Cleanup
        try:
            del model, tokenizer, trainer, dataset
        except:
            pass
        cleanup_memory()

print(f"\n{'='*60}")
print("All quantization methods tested!")
print(f"{'='*60}")
# Quantization Comparison Loop from trl import SFTTrainer, SFTConfig results = [] for quant_name, quant_config in QUANT_CONFIGS.items(): print(f"\n{'='*60}") print(f"Testing: {quant_config['description']}") print(f"{'='*60}") # Cleanup and measure baseline cleanup_memory() mem_baseline = measure_gpu_memory() # Load model with timing print(f"Loading model: {quant_config['model_name'].split('/')[-1]}...") load_start = time.time() try: # Build load kwargs based on quantization config load_kwargs = { "max_seq_length": 512, "dtype": None, } # Handle different quantization modes if quant_config.get("load_in_4bit"): load_kwargs["load_in_4bit"] = True model, tokenizer = FastLanguageModel.from_pretrained( quant_config["model_name"], **load_kwargs ) load_time = time.time() - load_start mem_after_load = measure_gpu_memory() print(f"Load time: {load_time:.1f}s, Memory: {mem_after_load} MB") # Apply LoRA print(f"Applying LoRA (r=16, alpha=16)...") model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) mem_after_lora = measure_gpu_memory() params = count_parameters(model) print(f"Trainable: {params['trainable']:,} ({params['pct']:.2f}%)") # Format dataset - using Ministral's [THINK] format def format_conversation(sample): assistant_content = f"[THINK]\n{sample['thinking']}\n[/THINK]\n\n{sample['response']}" messages = [ {"role": "user", "content": [{"type": "text", "text": sample["instruction"]}]}, {"role": "assistant", "content": [{"type": "text", "text": assistant_content}]} ] return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)} dataset = Dataset.from_list(synthetic_data) dataset = dataset.map(format_conversation, remove_columns=["instruction", "thinking", "response"]) # Training config - use bf16 for all (supported on modern GPUs) sft_config = SFTConfig( output_dir=f"outputs_qlora_quant_ministral/{quant_name}", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=3, warmup_steps=1, learning_rate=2e-4, logging_steps=1, fp16=False, bf16=True, optim="adamw_8bit", weight_decay=0.01, max_seq_length=512, seed=42, report_to="none", ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", args=sft_config, ) # Train with timing print(f"Training (3 steps)...") train_start = time.time() trainer_stats = trainer.train() train_time = time.time() - train_start final_loss = trainer_stats.metrics.get('train_loss', 0) mem_peak = measure_gpu_memory() print(f"Train time: {train_time:.1f}s, Final loss: {final_loss:.4f}") print(f"Peak memory: {mem_peak} MB") # Quick inference test FastLanguageModel.for_inference(model) messages = [{"role": "user", "content": [{"type": "text", "text": "What is deep learning?"}]}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) # Use text tokenizer directly for text-only inference (Pixtral uses a processor) text_tokenizer = tokenizer.tokenizer if hasattr(tokenizer, 'tokenizer') else tokenizer inputs = text_tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, temperature=0.6, do_sample=True, pad_token_id=text_tokenizer.pad_token_id, ) response = text_tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) # Store results results.append({ "quantization": quant_name, "description": quant_config["description"], "load_time_sec": load_time, "train_time_sec": train_time, "model_memory_mb": mem_after_load - mem_baseline, "peak_memory_mb": mem_peak, "final_loss": final_loss, "sample_response": response[:150], "success": True, }) except Exception as e: print(f"ERROR: {e}") results.append({ "quantization": quant_name, "description": quant_config["description"], "success": False, "error": str(e), }) finally: # Cleanup try: del model, tokenizer, trainer, dataset except: pass cleanup_memory() print(f"\n{'='*60}") print("All quantization methods tested!") print(f"{'='*60}")

In [6]:

  Copied!     
 
# Results Visualization
import pandas as pd
import matplotlib.pyplot as plt

# Filter successful results
successful_results = [r for r in results if r.get('success', False)]

if successful_results:
    df = pd.DataFrame(successful_results)
    
    print("\n" + "="*60)
    print("Quantization Comparison Results")
    print("="*60)
    print(df[["quantization", "peak_memory_mb", "train_time_sec", "final_loss"]].to_string(index=False))
    
    # Calculate memory savings vs BF16
    if "bf16" in df["quantization"].values:
        bf16_mem = df[df["quantization"] == "bf16"]["peak_memory_mb"].values[0]
        print(f"\nMemory savings vs BF16 ({bf16_mem} MB):")
        for _, row in df.iterrows():
            if row["quantization"] != "bf16":
                savings = 100 * (1 - row["peak_memory_mb"] / bf16_mem)
                print(f"  {row['quantization']}: {savings:.1f}% reduction")
    
    # Create visualization
    fig, axes = plt.subplots(1, 3, figsize=(15, 5))
    colors = ['#2ecc71', '#3498db']  # green, blue
    
    # Plot 1: Peak GPU Memory
    bars1 = axes[0].bar(df['quantization'], df['peak_memory_mb'], color=colors[:len(df)])
    axes[0].set_title('Peak GPU Memory', fontsize=12)
    axes[0].set_xlabel('Quantization Method')
    axes[0].set_ylabel('Memory (MB)')
    axes[0].tick_params(axis='x', rotation=45)
    for bar, val in zip(bars1, df['peak_memory_mb']):
        axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                     f'{val:.0f}', ha='center', va='bottom')
    
    # Plot 2: Training Time
    bars2 = axes[1].bar(df['quantization'], df['train_time_sec'], color=colors[:len(df)])
    axes[1].set_title('Training Time (3 steps)', fontsize=12)
    axes[1].set_xlabel('Quantization Method')
    axes[1].set_ylabel('Seconds')
    axes[1].tick_params(axis='x', rotation=45)
    for bar, val in zip(bars2, df['train_time_sec']):
        axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5,
                     f'{val:.1f}s', ha='center', va='bottom')
    
    # Plot 3: Final Loss
    bars3 = axes[2].bar(df['quantization'], df['final_loss'], color=colors[:len(df)])
    axes[2].set_title('Final Training Loss', fontsize=12)
    axes[2].set_xlabel('Quantization Method')
    axes[2].set_ylabel('Loss')
    axes[2].tick_params(axis='x', rotation=45)
    for bar, val in zip(bars3, df['final_loss']):
        axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05,
                     f'{val:.4f}', ha='center', va='bottom')
    
    plt.tight_layout()
    plt.savefig('outputs_qlora_quant_ministral/quantization_comparison.png', dpi=150)
    plt.show()
    
    print("\nVisualization saved to outputs_qlora_quant_ministral/quantization_comparison.png")
else:
    print("No successful results to visualize.")
# Results Visualization import pandas as pd import matplotlib.pyplot as plt # Filter successful results successful_results = [r for r in results if r.get('success', False)] if successful_results: df = pd.DataFrame(successful_results) print("\n" + "="*60) print("Quantization Comparison Results") print("="*60) print(df[["quantization", "peak_memory_mb", "train_time_sec", "final_loss"]].to_string(index=False)) # Calculate memory savings vs BF16 if "bf16" in df["quantization"].values: bf16_mem = df[df["quantization"] == "bf16"]["peak_memory_mb"].values[0] print(f"\nMemory savings vs BF16 ({bf16_mem} MB):") for _, row in df.iterrows(): if row["quantization"] != "bf16": savings = 100 * (1 - row["peak_memory_mb"] / bf16_mem) print(f" {row['quantization']}: {savings:.1f}% reduction") # Create visualization fig, axes = plt.subplots(1, 3, figsize=(15, 5)) colors = ['#2ecc71', '#3498db'] # green, blue # Plot 1: Peak GPU Memory bars1 = axes[0].bar(df['quantization'], df['peak_memory_mb'], color=colors[:len(df)]) axes[0].set_title('Peak GPU Memory', fontsize=12) axes[0].set_xlabel('Quantization Method') axes[0].set_ylabel('Memory (MB)') axes[0].tick_params(axis='x', rotation=45) for bar, val in zip(bars1, df['peak_memory_mb']): axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50, f'{val:.0f}', ha='center', va='bottom') # Plot 2: Training Time bars2 = axes[1].bar(df['quantization'], df['train_time_sec'], color=colors[:len(df)]) axes[1].set_title('Training Time (3 steps)', fontsize=12) axes[1].set_xlabel('Quantization Method') axes[1].set_ylabel('Seconds') axes[1].tick_params(axis='x', rotation=45) for bar, val in zip(bars2, df['train_time_sec']): axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, f'{val:.1f}s', ha='center', va='bottom') # Plot 3: Final Loss bars3 = axes[2].bar(df['quantization'], df['final_loss'], color=colors[:len(df)]) axes[2].set_title('Final Training Loss', fontsize=12) axes[2].set_xlabel('Quantization Method') axes[2].set_ylabel('Loss') axes[2].tick_params(axis='x', rotation=45) for bar, val in zip(bars3, df['final_loss']): axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.05, f'{val:.4f}', ha='center', va='bottom') plt.tight_layout() plt.savefig('outputs_qlora_quant_ministral/quantization_comparison.png', dpi=150) plt.show() print("\nVisualization saved to outputs_qlora_quant_ministral/quantization_comparison.png") else: print("No successful results to visualize.")

Out[6]:

============================================================
Quantization Comparison Results
============================================================
quantization  peak_memory_mb  train_time_sec  final_loss
    4bit_nf4            4926        5.678381    3.776793
        bf16            5730        3.393777    3.777261

Memory savings vs BF16 (5730 MB):
  4bit_nf4: 14.0% reduction

No description has been provided for this image

Out[6]:

Visualization saved to outputs_qlora_quant_ministral/quantization_comparison.png

Analysis and Key Findings¶

Memory Usage¶

Method	Typical Memory	Savings vs BF16
4-bit NF4	~4-5 GB	~30-50%
BF16	~6-8 GB	baseline

Note: Memory savings shown are for peak training memory. Model loading memory difference is larger (4-bit uses significantly less for base weights).

Training Performance¶

4-bit NF4 (QLoRA):

Uses quantized model for memory efficiency
Excellent memory efficiency
Minimal quality loss compared to full precision
Standard choice for GPU-constrained environments

BF16 (Full Precision):

Slightly faster per-step training (no dequantization)
Better numerical precision
Best for maximum quality when GPU memory allows
Requires more GPU memory for model weights

Quality Comparison¶

Final loss should be nearly identical between 4-bit and BF16, demonstrating that 4-bit quantization preserves training quality effectively. The [THINK]...[/THINK] reasoning capability of Ministral is fully preserved.

Recommendations¶

GPU Memory	Recommended	Notes
<8 GB	4-bit NF4	Memory efficient, minimal quality loss
8-12 GB	4-bit NF4 or BF16	Choose based on batch size needs
>12 GB	BF16	Maximum precision when memory allows

Key Insight¶

4-bit NF4 quantization (QLoRA) provides excellent training quality while reducing memory requirements, making it the recommended choice for most Ministral-3B-Reasoning fine-tuning scenarios.

Model Notes¶

Model: Ministral-3B-Reasoning (3B parameters)
Reasoning Format: [THINK]...[/THINK] tags (native Ministral format)
Smaller than Qwen-4B, so memory requirements are generally lower

In [7]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Out[7]:

Shutting down kernel to release GPU memory...

Out[7]:

{'status': 'ok', 'restart': False}