GRPO Training Test: Ministral (Text-Only)¶
Tests Group Relative Policy Optimization (GRPO) reinforcement learning with Unsloth on Ministral-3B using text-only mode.
Model Variant: Text-only (FastLanguageModel) Expected Result: Testing - Ministral is multimodal architecture
Key features tested:
- FastLanguageModel loading with 4-bit quantization
- LoRA adapter configuration
- GRPOTrainer with synthetic reward function
- Post-training inference verification
GRPO Overview: GRPO is a reinforcement learning method that optimizes language models using relative policy gradients. It compares multiple completions per prompt and learns from their relative rewards.
Key Differences from Qwen:
- Uses
unsloth/Ministral-3-3B-Reasoning-2512(multimodal architecture) - Chat template uses multimodal format:
{"type": "text", "text": "..."}
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [10]:
Copied!
# Environment Setup
import os
# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported
import torch
from trl import GRPOConfig, GRPOTrainer
from datasets import Dataset
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os # FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch from trl import GRPOConfig, GRPOTrainer from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
Out[10]:
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
Out[10]:
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
Out[10]:
🦥 Unsloth Zoo will now patch everything to make training faster!
Out[10]:
Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER ACCELERATE_MIXED_PRECISION: bf16 HF_TOKEN loaded: Yes
In [11]:
Copied!
# Load Ministral-3B with 4-bit quantization (using FastLanguageModel for text-only)
MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512"
print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastLanguageModel (text-only mode)...")
model, tokenizer = FastLanguageModel.from_pretrained(
MODEL_NAME,
max_seq_length=512,
load_in_4bit=True,
dtype=None, # Auto-detect
)
# Ensure pad token is set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Model loaded: {type(model).__name__}")
# Load Ministral-3B with 4-bit quantization (using FastLanguageModel for text-only) MODEL_NAME = "unsloth/Ministral-3-3B-Reasoning-2512" print(f"\nLoading {MODEL_NAME.split('/')[-1]} with FastLanguageModel (text-only mode)...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=512, load_in_4bit=True, dtype=None, # Auto-detect ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")
Out[11]:
Loading Ministral-3-3B-Reasoning-2512 with FastLanguageModel (text-only mode)...
Out[11]:
==((====))== Unsloth 2025.12.10: Fast Ministral3 patching. Transformers: 5.0.0rc1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130. \\ /| NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Out[11]:
Loading weights: 0%| | 0/458 [00:00<?, ?it/s]
Out[11]:
Model loaded: Mistral3ForConditionalGeneration
In [12]:
Copied!
# Apply LoRA adapters for GRPO training
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for GRPO training model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
Out[12]:
Unsloth: Making `model.base_model.model.model.vision_tower.transformer` require gradients
Out[12]:
LoRA applied: 33,751,040 trainable / 2,160,030,720 total (1.56%)
In [13]:
Copied!
# Create minimal synthetic prompt dataset for GRPO (5 prompts)
# Using Ministral's multimodal chat format for text-only content
prompts = [
"Explain the concept of recursion in programming.",
"What are the benefits of using version control?",
"Describe how a hash table works.",
"What is the difference between a stack and a queue?",
"Explain what an API is to a beginner.",
]
# Format prompts for GRPO using Ministral's multimodal format
dataset = Dataset.from_dict({
"prompt": [
tokenizer.apply_chat_template(
[{"role": "user", "content": [{"type": "text", "text": p}]}],
tokenize=False,
add_generation_prompt=True
) for p in prompts
]
})
print(f"Dataset created: {len(dataset)} prompts")
print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")
# Create minimal synthetic prompt dataset for GRPO (5 prompts) # Using Ministral's multimodal chat format for text-only content prompts = [ "Explain the concept of recursion in programming.", "What are the benefits of using version control?", "Describe how a hash table works.", "What is the difference between a stack and a queue?", "Explain what an API is to a beginner.", ] # Format prompts for GRPO using Ministral's multimodal format dataset = Dataset.from_dict({ "prompt": [ tokenizer.apply_chat_template( [{"role": "user", "content": [{"type": "text", "text": p}]}], tokenize=False, add_generation_prompt=True ) for p in prompts ] }) print(f"Dataset created: {len(dataset)} prompts") print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")
Out[13]:
Dataset created: 5 prompts Sample prompt: <s>[SYSTEM_PROMPT]# HOW YOU SHOULD THINK AND ANSWER First draft your thinking process (inner monologue) until you arrive at a response. Format your r...
In [14]:
Copied!
# Define a simple length-based reward function for testing
def simple_reward_fn(completions, prompts=None, **kwargs):
"""
Simple reward function for testing GRPO.
Rewards longer, more informative responses.
"""
rewards = []
for completion in completions:
length = len(completion.split())
if length < 5:
reward = -1.0 # Too short
elif length < 20:
reward = 0.5 # Acceptable
elif length < 50:
reward = 1.0 # Good length
else:
reward = 0.8 # Slightly penalize very long
rewards.append(reward)
return rewards
print("Reward function defined: simple_reward_fn")
# Define a simple length-based reward function for testing def simple_reward_fn(completions, prompts=None, **kwargs): """ Simple reward function for testing GRPO. Rewards longer, more informative responses. """ rewards = [] for completion in completions: length = len(completion.split()) if length < 5: reward = -1.0 # Too short elif length < 20: reward = 0.5 # Acceptable elif length < 50: reward = 1.0 # Good length else: reward = 0.8 # Slightly penalize very long rewards.append(reward) return rewards print("Reward function defined: simple_reward_fn")
Out[14]:
Reward function defined: simple_reward_fn
In [ ]:
Copied!
# GRPO Training Configuration (minimal steps for testing)
grpo_config = GRPOConfig(
output_dir="outputs_grpo_ministral_text_test",
per_device_train_batch_size=2,
gradient_accumulation_steps=1,
max_steps=2, # Minimal steps for testing
warmup_steps=0,
learning_rate=1e-5,
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
max_completion_length=64,
num_generations=2,
beta=0.1,
seed=42,
)
print("Starting GRPO training (2 steps)...")
try:
trainer = GRPOTrainer(
model=model,
args=grpo_config,
train_dataset=dataset,
processing_class=tokenizer,
reward_funcs=simple_reward_fn,
)
trainer_stats = trainer.train()
print(f"GRPO training completed!")
GRPO_TEXT_SUPPORTED = True
except Exception as e:
print(f"GRPO training failed: {e}")
GRPO_TEXT_SUPPORTED = False
# GRPO Training Configuration (minimal steps for testing) grpo_config = GRPOConfig( output_dir="outputs_grpo_ministral_text_test", per_device_train_batch_size=2, gradient_accumulation_steps=1, max_steps=2, # Minimal steps for testing warmup_steps=0, learning_rate=1e-5, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", max_completion_length=64, num_generations=2, beta=0.1, seed=42, ) print("Starting GRPO training (2 steps)...") try: trainer = GRPOTrainer( model=model, args=grpo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=simple_reward_fn, ) trainer_stats = trainer.train() print(f"GRPO training completed!") GRPO_TEXT_SUPPORTED = True except Exception as e: print(f"GRPO training failed: {e}") GRPO_TEXT_SUPPORTED = False
In [16]:
Copied!
# Post-training inference test
FastLanguageModel.for_inference(model)
test_prompt = "Explain what machine learning is in simple terms."
messages = [{"role": "user", "content": [{"type": "text", "text": test_prompt}]}]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(None, input_text, add_special_tokens=False, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=64,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Clean up BPE artifacts from Ministral tokenizer (Ġ=space, Ċ=newline)
response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip()
# Clear success/failure banner
print("=" * 60)
if GRPO_TEXT_SUPPORTED:
print("GRPO Training: SUPPORTED for Ministral (Text-Only)")
print("Model: FastLanguageModel + Ministral-3-3B-Reasoning-2512")
else:
print("GRPO Training: NOT SUPPORTED for Ministral (Text-Only)")
print("Reason: See error above")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test FastLanguageModel.for_inference(model) test_prompt = "Explain what machine learning is in simple terms." messages = [{"role": "user", "content": [{"type": "text", "text": test_prompt}]}] input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(None, input_text, add_special_tokens=False, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=64, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) # Clean up BPE artifacts from Ministral tokenizer (Ġ=space, Ċ=newline) response = response.replace('Ġ', ' ').replace('Ċ', '\n').strip() # Clear success/failure banner print("=" * 60) if GRPO_TEXT_SUPPORTED: print("GRPO Training: SUPPORTED for Ministral (Text-Only)") print("Model: FastLanguageModel + Ministral-3-3B-Reasoning-2512") else: print("GRPO Training: NOT SUPPORTED for Ministral (Text-Only)") print("Reason: See error above") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")
Out[16]:
============================================================ GRPO Training: SUPPORTED for Ministral (Text-Only) Model: FastLanguageModel + Ministral-3-3B-Reasoning-2512 ============================================================ Sample generation: n. First, what is machine learning? It's about computers learning from data, right? But how do I put that in simple terms? Maybe I can compare it to how humans learn. When we learn, we see something
Test Complete¶
The GRPO Training Pipeline test for Ministral (Text-Only) has completed. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- FastLanguageModel loading with 4-bit quantization (Ministral-3B)
- LoRA adapter configuration for RL training
- Synthetic prompt dataset with Ministral's multimodal format
- Simple reward function integration
- GRPOTrainer training loop (2 steps)
- Post-training inference generation
GRPO Concepts Demonstrated¶
- Group Relative Policy Optimization: Multiple completions per prompt
- Reward-based learning: Custom reward function integration
- KL Penalty (beta): Prevents policy from diverging too far from reference
Next Steps¶
- Compare with
04_GRPO_Training_Ministral_Vision.ipynbfor vision GRPO
In [17]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Out[17]:
Shutting down kernel to release GPU memory...
Out[17]:
{'status': 'ok', 'restart': False}