SFT Training Test: Ministral (Vision)¶
Tests Supervised Fine-Tuning with Unsloth's optimized SFTTrainer on Ministral-3B using vision mode.
Model Variant: Vision (FastVisionModel) Expected Result: Works - Uses native vision capabilities
Key features tested:
- FastVisionModel loading with 4-bit quantization
- LoRA adapter configuration (vision + language layers)
- SFTTrainer with UnslothVisionDataCollator
- Vision dataset (LaTeX_OCR) with image inputs
- Post-training inference verification
Key Differences from Text-Only:
- Uses
FastVisionModelinstead ofFastLanguageModel - Uses
UnslothVisionDataCollatorfor vision data - Dataset includes actual images
- Chat format includes
{"type": "image"}elements
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [2]:
Copied!
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastVisionModel, is_bf16_supported
from unsloth.trainer import UnslothVisionDataCollator
import torch
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastVisionModel, is_bf16_supported from unsloth.trainer import UnslothVisionDataCollator import torch # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER HF_TOKEN loaded: Yes
In [3]:
Copied!
# Load Ministral-3B with FastVisionModel for vision capabilities
MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...")
model, tokenizer = FastVisionModel.from_pretrained(
MODEL_NAME,
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
print(f"Model loaded: {type(model).__name__}")
# Load Ministral-3B with FastVisionModel for vision capabilities MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512" print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastVisionModel...") model, tokenizer = FastVisionModel.from_pretrained( MODEL_NAME, load_in_4bit=True, use_gradient_checkpointing="unsloth", ) print(f"Model loaded: {type(model).__name__}")
Loading Ministral-3-3B-Reasoning-2512 with FastVisionModel...==((====))== Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130. \\ /| NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Loading weights: 0%| | 0/458 [00:00<?, ?it/s]
Model loaded: Mistral3ForConditionalGeneration
In [4]:
Copied!
# Apply LoRA adapters for vision training
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=42,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for vision training model = FastVisionModel.get_peft_model( model, finetune_vision_layers=True, finetune_language_layers=True, finetune_attention_modules=True, finetune_mlp_modules=True, r=16, lora_alpha=16, lora_dropout=0, bias="none", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradientsLoRA applied: 33,751,040 trainable / 2,160,030,720 total (1.56%)
In [5]:
Copied!
# Load vision dataset (LaTeX_OCR - 5 samples for testing)
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]")
instruction = "Write the LaTeX representation for this image."
def convert_to_conversation(sample):
return {
"messages": [
{"role": "user", "content": [
{"type": "text", "text": instruction},
{"type": "image", "image": sample["image"]}
]},
{"role": "assistant", "content": [
{"type": "text", "text": sample["text"]}
]}
]
}
converted_dataset = [convert_to_conversation(s) for s in dataset]
print(f"Dataset loaded: {len(converted_dataset)} vision samples")
# Load vision dataset (LaTeX_OCR - 5 samples for testing) from datasets import load_dataset from trl import SFTTrainer, SFTConfig dataset = load_dataset("unsloth/LaTeX_OCR", split="train[:5]") instruction = "Write the LaTeX representation for this image." def convert_to_conversation(sample): return { "messages": [ {"role": "user", "content": [ {"type": "text", "text": instruction}, {"type": "image", "image": sample["image"]} ]}, {"role": "assistant", "content": [ {"type": "text", "text": sample["text"]} ]} ] } converted_dataset = [convert_to_conversation(s) for s in dataset] print(f"Dataset loaded: {len(converted_dataset)} vision samples")
Dataset loaded: 5 vision samples
In [ ]:
Copied!
# SFT Training with Vision (minimal steps for testing)
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
data_collator=UnslothVisionDataCollator(model, tokenizer),
train_dataset=converted_dataset,
args=SFTConfig(
per_device_train_batch_size=1,
max_steps=3, # Minimal steps for testing
warmup_steps=1,
learning_rate=2e-4,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
output_dir="outputs_sft_ministral_vision_test",
remove_unused_columns=False,
dataset_text_field="",
dataset_kwargs={"skip_prepare_dataset": True},
max_seq_length=1024,
),
)
print("Starting SFT Vision training (3 steps)...")
try:
trainer_stats = trainer.train()
final_loss = trainer_stats.metrics.get('train_loss', 'N/A')
print(f"Training completed. Final loss: {final_loss:.4f}")
SFT_VISION_SUPPORTED = True
except Exception as e:
print(f"Training failed: {e}")
SFT_VISION_SUPPORTED = False
# SFT Training with Vision (minimal steps for testing) trainer = SFTTrainer( model=model, tokenizer=tokenizer, data_collator=UnslothVisionDataCollator(model, tokenizer), train_dataset=converted_dataset, args=SFTConfig( per_device_train_batch_size=1, max_steps=3, # Minimal steps for testing warmup_steps=1, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), output_dir="outputs_sft_ministral_vision_test", remove_unused_columns=False, dataset_text_field="", dataset_kwargs={"skip_prepare_dataset": True}, max_seq_length=1024, ), ) print("Starting SFT Vision training (3 steps)...") try: trainer_stats = trainer.train() final_loss = trainer_stats.metrics.get('train_loss', 'N/A') print(f"Training completed. Final loss: {final_loss:.4f}") SFT_VISION_SUPPORTED = True except Exception as e: print(f"Training failed: {e}") SFT_VISION_SUPPORTED = False
In [7]:
Copied!
# Post-training inference test with vision
FastVisionModel.for_inference(model)
test_image = dataset[0]["image"]
messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1)
response = tokenizer.decode(output[0], skip_special_tokens=True)
# Clean up BPE artifacts from Ministral tokenizer (Ġ=space, Ċ=newline)
response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip()
# Clear success/failure banner
print("=" * 60)
if SFT_VISION_SUPPORTED:
print("SFT Training: SUPPORTED for Ministral (Vision)")
print("Model: FastVisionModel + Ministral-3-3B-Reasoning-2512")
print("Components: UnslothVisionDataCollator, LaTeX_OCR dataset")
else:
print("SFT Training: NOT SUPPORTED for Ministral (Vision)")
print("Reason: See error above")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test with vision FastVisionModel.for_inference(model) test_image = dataset[0]["image"] messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": instruction}]}] input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) inputs = tokenizer(test_image, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=64, temperature=1.5, min_p=0.1) response = tokenizer.decode(output[0], skip_special_tokens=True) # Clean up BPE artifacts from Ministral tokenizer (Ġ=space, Ċ=newline) response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip() # Clear success/failure banner print("=" * 60) if SFT_VISION_SUPPORTED: print("SFT Training: SUPPORTED for Ministral (Vision)") print("Model: FastVisionModel + Ministral-3-3B-Reasoning-2512") print("Components: UnslothVisionDataCollator, LaTeX_OCR dataset") else: print("SFT Training: NOT SUPPORTED for Ministral (Vision)") print("Reason: See error above") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")
============================================================
SFT Training: SUPPORTED for Ministral (Vision)
Model: FastVisionModel + Ministral-3-3B-Reasoning-2512
Components: UnslothVisionDataCollator, LaTeX_OCR dataset
============================================================
Sample generation:
.Okay, I need to write the LaTeX representation for this image. Let me look at the equation in the image.
The equation is:
\[ \frac{N}{M} \in \mathbb{Z}, \frac{M}{P} \in \mathbb{Z}, \frac{P}{Q} \in \ Test Complete¶
The SFT Training Pipeline test for Ministral (Vision) has completed. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- FastVisionModel loading with 4-bit quantization (Ministral-3B)
- LoRA adapter configuration (vision + language layers)
- Vision dataset loading (LaTeX_OCR)
- UnslothVisionDataCollator integration
- SFTTrainer training loop (3 steps)
- Post-training vision inference
Ministral Vision Notes¶
- Uses FastVisionModel for native multimodal support
- Requires UnslothVisionDataCollator for vision data
- Dataset must include actual images with
{"type": "image"}format
Comparison with Text-Only¶
| Aspect | Text-Only | Vision |
|---|---|---|
| Model Class | FastLanguageModel | FastVisionModel |
| Data Collator | None | UnslothVisionDataCollator |
| Dataset | Synthetic text | LaTeX_OCR (images) |
| LoRA Layers | Language only | Vision + Language |
In [8]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Shutting down kernel to release GPU memory...
Out[8]:
{'status': 'ok', 'restart': False}