GRPO Training Test: Qwen3-4B¶

Tests Group Relative Policy Optimization (GRPO) reinforcement learning with Unsloth on Qwen3-4B.

Key features tested:

FastLanguageModel loading with 4-bit quantization
LoRA adapter configuration
GRPOTrainer with synthetic reward function
Post-training inference verification

GRPO Overview: GRPO is a reinforcement learning method that optimizes language models using relative policy gradients. It compares multiple completions per prompt and learns from their relative rewards.

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:

  Copied!     
 
# Environment Setup
import os

# FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth
# This ensures autocast dtype matches model dtype (bfloat16)
os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16'

from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch
from trl import GRPOConfig, GRPOTrainer
from datasets import Dataset

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os # FIX: Set ACCELERATE_MIXED_PRECISION BEFORE importing unsloth # This ensures autocast dtype matches model dtype (bfloat16) os.environ['ACCELERATE_MIXED_PRECISION'] = 'bf16' from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch from trl import GRPOConfig, GRPOTrainer from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"ACCELERATE_MIXED_PRECISION: {os.environ.get('ACCELERATE_MIXED_PRECISION', 'not set')}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
ACCELERATE_MIXED_PRECISION: bf16
HF_TOKEN loaded: Yes

In [2]:

  Copied!     
 
# Load Qwen3-4B with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=512,
    load_in_4bit=True,
    dtype=None,  # Auto-detect - let Unsloth manage dtype
)

# Ensure pad token is set
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B with 4-bit quantization MODEL_NAME = "unsloth/Qwen3-4B-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]}...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=512, load_in_4bit=True, dtype=None, # Auto-detect - let Unsloth manage dtype ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")

Loading Qwen3-4B-unsloth-bnb-4bit...==((====))==  Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

Model loaded: Qwen3ForCausalLM

In [3]:

  Copied!     
 
# Apply LoRA adapters for GRPO training
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",  # Use Unsloth-optimized checkpointing
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for GRPO training model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", # Use Unsloth-optimized checkpointing random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

Unsloth 2025.12.10 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

LoRA applied: 33,030,144 trainable / 2,541,616,640 total (1.30%)

In [4]:

  Copied!     
 
# Create minimal synthetic prompt dataset for GRPO (5 prompts)
# GRPO requires prompts only - completions are generated during training

prompts = [
    "Explain the concept of recursion in programming.",
    "What are the benefits of using version control?",
    "Describe how a hash table works.",
    "What is the difference between a stack and a queue?",
    "Explain what an API is to a beginner.",
]

# Format prompts for GRPO (requires "prompt" field)
dataset = Dataset.from_dict({
    "prompt": [
        tokenizer.apply_chat_template(
            [{"role": "user", "content": p}],
            tokenize=False,
            add_generation_prompt=True
        ) for p in prompts
    ]
})

print(f"Dataset created: {len(dataset)} prompts")
print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")
# Create minimal synthetic prompt dataset for GRPO (5 prompts) # GRPO requires prompts only - completions are generated during training prompts = [ "Explain the concept of recursion in programming.", "What are the benefits of using version control?", "Describe how a hash table works.", "What is the difference between a stack and a queue?", "Explain what an API is to a beginner.", ] # Format prompts for GRPO (requires "prompt" field) dataset = Dataset.from_dict({ "prompt": [ tokenizer.apply_chat_template( [{"role": "user", "content": p}], tokenize=False, add_generation_prompt=True ) for p in prompts ] }) print(f"Dataset created: {len(dataset)} prompts") print(f"Sample prompt:\n{dataset[0]['prompt'][:150]}...")

Dataset created: 5 prompts
Sample prompt:
<|im_start|>user
Explain the concept of recursion in programming.<|im_end|>
<|im_start|>assistant
...

In [5]:

  Copied!     
 
# Define a simple length-based reward function for testing
# In production, this would be a learned reward model or rule-based system

def simple_reward_fn(completions, prompts=None, **kwargs):
    """
    Simple reward function for testing GRPO.
    Rewards longer, more informative responses.
    """
    rewards = []
    for completion in completions:
        # Basic heuristics for testing:
        # - Reward longer responses (up to a point)
        # - Penalize very short responses
        length = len(completion.split())
        
        if length < 5:
            reward = -1.0  # Too short
        elif length < 20:
            reward = 0.5   # Acceptable
        elif length < 50:
            reward = 1.0   # Good length
        else:
            reward = 0.8   # Slightly penalize very long
        
        rewards.append(reward)
    
    return rewards

print("Reward function defined: simple_reward_fn")
print("Rewards based on response length (testing only)")
# Define a simple length-based reward function for testing # In production, this would be a learned reward model or rule-based system def simple_reward_fn(completions, prompts=None, **kwargs): """ Simple reward function for testing GRPO. Rewards longer, more informative responses. """ rewards = [] for completion in completions: # Basic heuristics for testing: # - Reward longer responses (up to a point) # - Penalize very short responses length = len(completion.split()) if length < 5: reward = -1.0 # Too short elif length < 20: reward = 0.5 # Acceptable elif length < 50: reward = 1.0 # Good length else: reward = 0.8 # Slightly penalize very long rewards.append(reward) return rewards print("Reward function defined: simple_reward_fn") print("Rewards based on response length (testing only)")

Reward function defined: simple_reward_fn
Rewards based on response length (testing only)

In [ ]:

  Copied!     
 
# GRPO Training Configuration (minimal steps for testing)
grpo_config = GRPOConfig(
    output_dir="outputs_grpo_qwen_test",
    per_device_train_batch_size=2,  # Match num_generations
    gradient_accumulation_steps=1,
    max_steps=2,  # Minimal steps for testing
    warmup_steps=0,
    learning_rate=1e-5,  # Lower LR for RL
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    max_completion_length=64,
    num_generations=2,  # Completions per prompt
    beta=0.1,  # KL penalty coefficient
    seed=42,
)

# Initialize GRPO Trainer
trainer = GRPOTrainer(
    model=model,
    args=grpo_config,
    train_dataset=dataset,
    processing_class=tokenizer,
    reward_funcs=simple_reward_fn,
)

print("Starting GRPO training (2 steps)...")
trainer_stats = trainer.train()
print(f"GRPO training completed!")
# GRPO Training Configuration (minimal steps for testing) grpo_config = GRPOConfig( output_dir="outputs_grpo_qwen_test", per_device_train_batch_size=2, # Match num_generations gradient_accumulation_steps=1, max_steps=2, # Minimal steps for testing warmup_steps=0, learning_rate=1e-5, # Lower LR for RL logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", max_completion_length=64, num_generations=2, # Completions per prompt beta=0.1, # KL penalty coefficient seed=42, ) # Initialize GRPO Trainer trainer = GRPOTrainer( model=model, args=grpo_config, train_dataset=dataset, processing_class=tokenizer, reward_funcs=simple_reward_fn, ) print("Starting GRPO training (2 steps)...") trainer_stats = trainer.train() print(f"GRPO training completed!")

In [7]:

  Copied!     
 
# Post-training inference test
FastLanguageModel.for_inference(model)

test_prompt = "Explain what machine learning is in simple terms."
messages = [{"role": "user", "content": test_prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("=" * 60)
print("GRPO Training Pipeline Test PASSED")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test FastLanguageModel.for_inference(model) test_prompt = "Explain what machine learning is in simple terms." messages = [{"role": "user", "content": test_prompt}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=64, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print("=" * 60) print("GRPO Training Pipeline Test PASSED") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")

============================================================
GRPO Training Pipeline Test PASSED
============================================================
Sample generation:
ling what I know. Machine learning is a subset of artificial intelligence, right? So I need to make sure I connect it to AI but keep it simple.

First, I should define machine learning in a way that's

Test Complete¶

The GRPO Training Pipeline test has completed successfully. The kernel will now shut down to release all GPU memory.

What Was Verified¶

FastLanguageModel loading with 4-bit quantization (Qwen3-4B)
LoRA adapter configuration for RL training
Synthetic prompt dataset creation
Simple reward function integration
GRPOTrainer training loop (2 steps)
Post-training inference generation

GRPO Concepts Demonstrated¶

Group Relative Policy Optimization: Multiple completions per prompt
Reward-based learning: Custom reward function integration
KL Penalty (beta): Prevents policy from diverging too far from reference

Ready for Production¶

If this test passed, your environment is ready for:

GRPO training with learned reward models
RLHF pipelines
Preference optimization workflows

In [ ]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)