Skip to content

Model Quantization

Overview

Quantization reduces model precision to save memory and speed up inference. A 7B model at FP32 requires ~28GB, but at 4-bit only ~4GB.

Quick Reference

Precision Bits Memory Quality Speed
FP32 32 4x Best Slowest
FP16 16 2x Excellent Fast
BF16 16 2x Excellent Fast
INT8 8 1x Good Faster
INT4 4 0.5x Acceptable Fastest

Memory Estimation

def estimate_memory(params_billions, precision_bits):
    """Estimate model memory in GB."""
    bytes_per_param = precision_bits / 8
    return params_billions * bytes_per_param

# Example: 7B model
model_size = 7  # billion parameters

print(f"FP32: {estimate_memory(7, 32):.1f} GB")  # 28 GB
print(f"FP16: {estimate_memory(7, 16):.1f} GB")  # 14 GB
print(f"INT8: {estimate_memory(7, 8):.1f} GB")   # 7 GB
print(f"INT4: {estimate_memory(7, 4):.1f} GB")   # 3.5 GB

Measure Model Size

def get_model_size(model):
    """Get model size in GB including buffers."""
    param_size = sum(p.numel() * p.element_size() for p in model.parameters())
    buffer_size = sum(b.numel() * b.element_size() for b in model.buffers())
    total = (param_size + buffer_size) / 1024**3
    return total

print(f"Model size: {get_model_size(model):.2f} GB")

Load Model at Different Precisions

FP32 (Default)

from transformers import AutoModelForCausalLM

model_32bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    device_map="auto"
)

print(f"FP32 size: {get_model_size(model_32bit):.2f} GB")

FP16 / BF16

import torch

model_16bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    torch_dtype=torch.float16,  # or torch.bfloat16
    device_map="auto"
)

print(f"FP16 size: {get_model_size(model_16bit):.2f} GB")

8-bit Quantization

from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True
)

model_8bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=quantization_config,
    device_map="auto"
)

print(f"8-bit size: {get_model_size(model_8bit):.2f} GB")
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True  # Nested quantization
)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=quantization_config,
    device_map="auto"
)

print(f"4-bit size: {get_model_size(model_4bit):.2f} GB")

BitsAndBytesConfig Options

4-bit Configuration

from transformers import BitsAndBytesConfig
import torch

config = BitsAndBytesConfig(
    load_in_4bit=True,

    # Quantization type
    bnb_4bit_quant_type="nf4",  # "nf4" or "fp4"

    # Compute dtype for dequantized weights
    bnb_4bit_compute_dtype=torch.bfloat16,

    # Double quantization (saves more memory)
    bnb_4bit_use_double_quant=True,
)

Options Explained

Option Values Effect
load_in_4bit True/False Enable 4-bit
bnb_4bit_quant_type "nf4", "fp4" nf4 better for LLMs
bnb_4bit_compute_dtype float16, bfloat16 Computation precision
bnb_4bit_use_double_quant True/False Quantize quantization constants

Compare Precision Performance

from transformers import pipeline
import time

model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

# Test message
messages = [{"role": "user", "content": "Explain quantum computing."}]

def benchmark(model, tokenizer, name):
    pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

    start = time.time()
    output = pipe(messages, max_new_tokens=100, return_full_text=False)
    elapsed = time.time() - start

    print(f"{name}:")
    print(f"  Time: {elapsed:.2f}s")
    print(f"  Size: {get_model_size(model):.2f} GB")
    print(f"  Output: {output[0]['generated_text'][:50]}...")
    print()

# Benchmark each precision
benchmark(model_32bit, tokenizer, "FP32")
benchmark(model_16bit, tokenizer, "FP16")
benchmark(model_8bit, tokenizer, "8-bit")
benchmark(model_4bit, tokenizer, "4-bit")

Quantization for Training

QLoRA Setup

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit base model
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    quantization_config=quantization_config,
    device_map="auto"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Add LoRA adapters
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Precision Comparison

Precision Memory Quality Training Best For
FP32 4x Perfect Yes Research, baselines
FP16 2x Excellent Yes Standard training
BF16 2x Excellent Yes Large models
INT8 1x Good Limited Inference
INT4 0.5x Acceptable QLoRA Memory-constrained

FP16 vs BF16

Aspect FP16 BF16
Range Smaller Larger (like FP32)
Precision Higher Lower
Overflow risk Higher Lower
Hardware All GPUs Ampere+
Best for Inference Training

4-bit NF4 vs BF16 Comparison (Tested)

Based on experiments with Qwen3-4B-Thinking models:

Comparison Results

Method Peak Memory Final Loss Quality
4-bit NF4 ~5.7GB 3.0742 Excellent
BF16 ~6.5GB 3.0742 Reference

Key Finding: 4-bit NF4 achieves identical final loss with 11-15% memory savings.

Use pre-quantized models for faster loading:

from unsloth import FastLanguageModel

# Pre-quantized (fast loading)
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",  # -bnb-4bit suffix
    max_seq_length=1024,
    load_in_4bit=True,
)

# vs. On-demand quantization (slower)
model, tokenizer = FastLanguageModel.from_pretrained(
    "Qwen/Qwen3-4B-Thinking-2507",  # Full precision
    max_seq_length=1024,
    load_in_4bit=True,  # Quantize during load
)

GPU Memory Recommendations

GPU VRAM Recommended Notes
<12GB 4-bit NF4 Required for training
12-16GB 4-bit NF4 Allows larger batches
>16GB BF16 or 4-bit Choose based on batch needs

Quality Preservation

4-bit NF4 preserves: - Training convergence (identical final loss) - Thinking tag structure (<think>...</think>) - Response quality and coherence - Model reasoning capabilities

Troubleshooting

Out of Memory

Symptom: CUDA OOM error

Fix:

# Use 4-bit quantization
config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True
)

Quality Degradation

Symptom: Poor model outputs after quantization

Fix:

  • Use nf4 instead of fp4
  • Try 8-bit instead of 4-bit
  • Increase LoRA rank if fine-tuning

Slow Loading

Symptom: Model takes long to load

Fix:

  • Quantization happens at load time
  • Use device_map="auto" for multi-GPU

When to Use This Skill

Use when:

  • Model doesn't fit in GPU memory
  • Need faster inference
  • Training with limited resources (QLoRA)
  • Deploying to edge devices

Cross-References

  • bazzite-ai-jupyter:qlora - Advanced QLoRA experiments
  • bazzite-ai-jupyter:peft - LoRA with quantization (QLoRA)
  • bazzite-ai-jupyter:finetuning - Full fine-tuning
  • bazzite-ai-jupyter:sft - SFT training with quantization
  • bazzite-ai-jupyter:inference - Fast inference patterns
  • bazzite-ai-jupyter:transformers - Model architecture