Unsloth Vision Training Verification¶

This notebook tests the complete vision model fine-tuning pipeline:

FastVisionModel loading
LoRA adapter configuration
Dataset loading and formatting
SFTTrainer training loop
Inference after training

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory after the vision training test.

In [1]:

  Copied!     
 
# Environment Setup
from dotenv import load_dotenv
import os
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
import transformers
import vllm
import trl
import torch

print(f"unsloth: {unsloth.__version__}")
print(f"transformers: {transformers.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"TRL: {trl.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
# Environment Setup from dotenv import load_dotenv import os load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}") # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastVisionModel, is_bf16_supported from unsloth.trainer import UnslothVisionDataCollator import transformers import vllm import trl import torch print(f"unsloth: {unsloth.__version__}") print(f"transformers: {transformers.__version__}") print(f"vLLM: {vllm.__version__}") print(f"TRL: {trl.__version__}") print(f"PyTorch: {torch.__version__}") print(f"CUDA: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name(0)}")

HF_TOKEN loaded: Yes🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

🦥 Unsloth Zoo will now patch everything to make training faster!unsloth: 2025.12.10
transformers: 5.0.0rc1
vLLM: 0.14.0rc1.dev201+gadcf682fc
TRL: 0.26.2
PyTorch: 2.9.1+cu130
CUDA: True
GPU: NVIDIA GeForce RTX 4080 SUPER

Ministral VL (Vision) Training Verification¶

This section tests the complete vision model fine-tuning pipeline:

FastVisionModel loading
LoRA adapter configuration
Dataset loading and formatting
SFTTrainer training loop
Inference after training

In [ ]:

  Copied!     
 
# Complete Vision Pipeline Test (self-contained)
# Tests: Model loading, LoRA, Dataset, Training (2 steps), Inference
print("=== Vision Training Pipeline Test ===")

from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

# 1. Load model
model, tokenizer = FastVisionModel.from_pretrained(
    "unsloth/Ministral-3-3B-Reasoning-2512",
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)
print(f"✓ FastVisionModel loaded: {type(model).__name__}")

# 2. Apply LoRA
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    random_state=3407,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"✓ LoRA applied ({trainable:,} trainable params)")

# 3. Load dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]")
instruction = "Write the LaTeX representation for this image."

def convert_to_conversation(sample):
    return {
        "messages": [
            {"role": "user", "content": [
                {"type": "text", "text": instruction},
                {"type": "image", "image": sample["image"]}
            ]},
            {"role": "assistant", "content": [
                {"type": "text", "text": sample["text"]}
            ]}
        ]
    }

converted_dataset = [convert_to_conversation(s) for s in dataset]
print(f"✓ Dataset loaded ({len(converted_dataset)} samples)")

# 4. Train (2 steps)
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    data_collator=UnslothVisionDataCollator(model, tokenizer),
    train_dataset=converted_dataset,
    args=SFTConfig(
        per_device_train_batch_size=1,
        max_steps=2,
        warmup_steps=0,
        learning_rate=2e-4,
        logging_steps=1,
        fp16=not is_bf16_supported(),
        bf16=is_bf16_supported(),
        output_dir="outputs_ministral_vl_test",
        remove_unused_columns=False,
        dataset_text_field="",
        dataset_kwargs={"skip_prepare_dataset": True},
        max_seq_length=1024,
    ),
)
trainer_stats = trainer.train()
print(f"✓ Training completed (loss: {trainer_stats.metrics.get('train_loss', 'N/A'):.4f})")

# 5. Inference test
FastVisionModel.for_inference(model)
test_image = dataset[0]["image"]
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)
print("✓ Inference test passed")
print("✓ Vision Training Pipeline test PASSED")
# Complete Vision Pipeline Test (self-contained) # Tests: Model loading, LoRA, Dataset, Training (2 steps), Inference print("=== Vision Training Pipeline Test ===") from unsloth import FastVisionModel, is_bf16_supported from unsloth.trainer import UnslothVisionDataCollator from trl import SFTTrainer, SFTConfig from datasets import load_dataset # 1. Load model model, tokenizer = FastVisionModel.from_pretrained( "unsloth/Ministral-3-3B-Reasoning-2512", load_in_4bit=True, use_gradient_checkpointing="unsloth", ) print(f"✓ FastVisionModel loaded: {type(model).__name__}") # 2. Apply LoRA model = FastVisionModel.get_peft_model( model, finetune_vision_layers=True, finetune_language_layers=True, finetune_attention_modules=True, finetune_mlp_modules=True, r=16, lora_alpha=16, lora_dropout=0, bias="none", random_state=3407, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"✓ LoRA applied ({trainable:,} trainable params)") # 3. Load dataset dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]") instruction = "Write the LaTeX representation for this image." def convert_to_conversation(sample): return { "messages": [ {"role": "user", "content": [ {"type": "text", "text": instruction}, {"type": "image", "image": sample["image"]} ]}, {"role": "assistant", "content": [ {"type": "text", "text": sample["text"]} ]} ] } converted_dataset = [convert_to_conversation(s) for s in dataset] print(f"✓ Dataset loaded ({len(converted_dataset)} samples)") # 4. Train (2 steps) trainer = SFTTrainer( model=model, tokenizer=tokenizer, data_collator=UnslothVisionDataCollator(model, tokenizer), train_dataset=converted_dataset, args=SFTConfig( per_device_train_batch_size=1, max_steps=2, warmup_steps=0, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), output_dir="outputs_ministral_vl_test", remove_unused_columns=False, dataset_text_field="", dataset_kwargs={"skip_prepare_dataset": True}, max_seq_length=1024, ), ) trainer_stats = trainer.train() print(f"✓ Training completed (loss: {trainer_stats.metrics.get('train_loss', 'N/A'):.4f})") # 5. Inference test FastVisionModel.for_inference(model) test_image = dataset[0]["image"] messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1) print("✓ Inference test passed") print("✓ Vision Training Pipeline test PASSED")

Test Complete¶

The Vision Training Pipeline test has completed. The kernel will now shut down to release all GPU memory.

What Was Verified¶

FastVisionModel loading with 4-bit quantization
LoRA adapter application (vision + language layers)
Dataset loading and conversation formatting
SFTTrainer training loop (2 steps)
Post-training inference

Ready for Production¶

If this test passed, your environment is ready for:

Ministral_3_VL_(3B)_Vision.ipynb - Full vision fine-tuning tutorial

In [3]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Shutting down kernel to release GPU memory...

Out[3]:

{'status': 'ok', 'restart': False}