Unsloth Vision Training Verification¶
This notebook tests the complete vision model fine-tuning pipeline:
- FastVisionModel loading
- LoRA adapter configuration
- Dataset loading and formatting
- SFTTrainer training loop
- Inference after training
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory after the vision training test.
In [1]:
Copied!
# Environment Setup
from dotenv import load_dotenv
import os
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
import transformers
import vllm
import trl
import torch
print(f"unsloth: {unsloth.__version__}")
print(f"transformers: {transformers.__version__}")
print(f"vLLM: {vllm.__version__}")
print(f"TRL: {trl.__version__}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"GPU: {torch.cuda.get_device_name(0)}")
# Environment Setup from dotenv import load_dotenv import os load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}") # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastVisionModel, is_bf16_supported from unsloth.trainer import UnslothVisionDataCollator import transformers import vllm import trl import torch print(f"unsloth: {unsloth.__version__}") print(f"transformers: {transformers.__version__}") print(f"vLLM: {vllm.__version__}") print(f"TRL: {trl.__version__}") print(f"PyTorch: {torch.__version__}") print(f"CUDA: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"GPU: {torch.cuda.get_device_name(0)}")
HF_TOKEN loaded: Yes🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
🦥 Unsloth Zoo will now patch everything to make training faster!unsloth: 2025.12.10 transformers: 5.0.0rc1 vLLM: 0.14.0rc1.dev201+gadcf682fc TRL: 0.26.2 PyTorch: 2.9.1+cu130 CUDA: True GPU: NVIDIA GeForce RTX 4080 SUPER
Ministral VL (Vision) Training Verification¶
This section tests the complete vision model fine-tuning pipeline:
- FastVisionModel loading
- LoRA adapter configuration
- Dataset loading and formatting
- SFTTrainer training loop
- Inference after training
In [ ]:
Copied!
# Complete Vision Pipeline Test (self-contained)
# Tests: Model loading, LoRA, Dataset, Training (2 steps), Inference
print("=== Vision Training Pipeline Test ===")
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
# 1. Load model
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Ministral-3-3B-Reasoning-2512",
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
print(f"✓ FastVisionModel loaded: {type(model).__name__}")
# 2. Apply LoRA
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=3407,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"✓ LoRA applied ({trainable:,} trainable params)")
# 3. Load dataset
dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]")
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
return {
"messages": [
{"role": "user", "content": [
{"type": "text", "text": instruction},
{"type": "image", "image": sample["image"]}
]},
{"role": "assistant", "content": [
{"type": "text", "text": sample["text"]}
]}
]
}
converted_dataset = [convert_to_conversation(s) for s in dataset]
print(f"✓ Dataset loaded ({len(converted_dataset)} samples)")
# 4. Train (2 steps)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=UnslothVisionDataCollator(model, tokenizer),
train_dataset=converted_dataset,
args=SFTConfig(
per_device_train_batch_size=1,
max_steps=2,
warmup_steps=0,
learning_rate=2e-4,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
output_dir="outputs_ministral_vl_test",
remove_unused_columns=False,
dataset_text_field="",
dataset_kwargs={"skip_prepare_dataset": True},
max_seq_length=1024,
),
)
trainer_stats = trainer.train()
print(f"✓ Training completed (loss: {trainer_stats.metrics.get('train_loss', 'N/A'):.4f})")
# 5. Inference test
FastVisionModel.for_inference(model)
test_image = dataset[0]["image"]
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)
print("✓ Inference test passed")
print("✓ Vision Training Pipeline test PASSED")
# Complete Vision Pipeline Test (self-contained) # Tests: Model loading, LoRA, Dataset, Training (2 steps), Inference print("=== Vision Training Pipeline Test ===") from unsloth import FastVisionModel, is_bf16_supported from unsloth.trainer import UnslothVisionDataCollator from trl import SFTTrainer, SFTConfig from datasets import load_dataset # 1. Load model model, tokenizer = FastVisionModel.from_pretrained( "unsloth/Ministral-3-3B-Reasoning-2512", load_in_4bit=True, use_gradient_checkpointing="unsloth", ) print(f"✓ FastVisionModel loaded: {type(model).__name__}") # 2. Apply LoRA model = FastVisionModel.get_peft_model( model, finetune_vision_layers=True, finetune_language_layers=True, finetune_attention_modules=True, finetune_mlp_modules=True, r=16, lora_alpha=16, lora_dropout=0, bias="none", random_state=3407, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) print(f"✓ LoRA applied ({trainable:,} trainable params)") # 3. Load dataset dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]") instruction = "Write the LaTeX representation for this image." def convert_to_conversation(sample): return { "messages": [ {"role": "user", "content": [ {"type": "text", "text": instruction}, {"type": "image", "image": sample["image"]} ]}, {"role": "assistant", "content": [ {"type": "text", "text": sample["text"]} ]} ] } converted_dataset = [convert_to_conversation(s) for s in dataset] print(f"✓ Dataset loaded ({len(converted_dataset)} samples)") # 4. Train (2 steps) trainer = SFTTrainer( model=model, tokenizer=tokenizer, data_collator=UnslothVisionDataCollator(model, tokenizer), train_dataset=converted_dataset, args=SFTConfig( per_device_train_batch_size=1, max_steps=2, warmup_steps=0, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), output_dir="outputs_ministral_vl_test", remove_unused_columns=False, dataset_text_field="", dataset_kwargs={"skip_prepare_dataset": True}, max_seq_length=1024, ), ) trainer_stats = trainer.train() print(f"✓ Training completed (loss: {trainer_stats.metrics.get('train_loss', 'N/A'):.4f})") # 5. Inference test FastVisionModel.for_inference(model) test_image = dataset[0]["image"] messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1) print("✓ Inference test passed") print("✓ Vision Training Pipeline test PASSED")
Test Complete¶
The Vision Training Pipeline test has completed. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- FastVisionModel loading with 4-bit quantization
- LoRA adapter application (vision + language layers)
- Dataset loading and conversation formatting
- SFTTrainer training loop (2 steps)
- Post-training inference
Ready for Production¶
If this test passed, your environment is ready for:
Ministral_3_VL_(3B)_Vision.ipynb- Full vision fine-tuning tutorial
In [3]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Shutting down kernel to release GPU memory...
Out[3]:
{'status': 'ok', 'restart': False}