DPO Training Test: Qwen3-4B-Thinking-2507¶
Tests Direct Preference Optimization (DPO) with Unsloth on Qwen3-4B-Thinking-2507.
Key features tested:
- FastLanguageModel loading with 4-bit quantization
- LoRA adapter configuration
- DPOTrainer with preference pairs that reward thinking quality
- Chosen responses include self-questioning
<think>blocks - Rejected responses have poor/no thinking
DPO Overview: DPO learns from preference pairs (chosen vs rejected responses) without an explicit reward model. It directly optimizes the policy using the Bradley-Terry preference model.
Thinking Preference:
- Chosen: Responses with quality self-questioning reasoning in
<think>blocks - Rejected: Responses with poor, minimal, or no thinking
Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.
In [1]:
Copied!
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"
# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported
import torch
from trl import DPOConfig, DPOTrainer
from datasets import Dataset
# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch from trl import DPOConfig, DPOTrainer from datasets import Dataset # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues. if is_vllm_available():
🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER HF_TOKEN loaded: Yes
In [2]:
Copied!
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")
model, tokenizer = FastLanguageModel.from_pretrained(
MODEL_NAME,
max_seq_length=1024, # Increased for thinking content
load_in_4bit=True,
dtype=None,
)
# Ensure pad token is set
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B-Thinking-2507 with 4-bit quantization MODEL_NAME = "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]}...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=1024, # Increased for thinking content load_in_4bit=True, dtype=None, ) # Ensure pad token is set if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token tokenizer.pad_token_id = tokenizer.eos_token_id print(f"Model loaded: {type(model).__name__}")
Out[2]:
Loading Qwen3-4B-Thinking-2507-unsloth-bnb-4bit...
Out[2]:
==((====))== Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130. \\ /| NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux. O^O/ \_/ \ Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1 \ / Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False] "-____-" Free license: http://github.com/unslothai/unsloth Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Out[2]:
Loading weights: 0%| | 0/398 [00:00<?, ?it/s]
Out[2]:
Model loaded: Qwen3ForCausalLM
In [3]:
Copied!
# Apply LoRA adapters for DPO training
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
use_gradient_checkpointing="unsloth",
random_state=42,
)
trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters for DPO training model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
Out[3]:
Unsloth 2025.12.10 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
Out[3]:
LoRA applied: 33,030,144 trainable / 2,526,543,360 total (1.31%)
In [4]:
Copied!
# Create minimal synthetic preference dataset with thinking content (5 samples)
# DPO requires: prompt, chosen response (with quality thinking), rejected response (poor/no thinking)
preference_data = [
{
"prompt": "Explain recursion in programming.",
"chosen": "<think>\nWhat is recursion exactly? It's when a function calls itself. But why would you do that? To break down problems into smaller pieces. What's the key thing users need to understand? The base case - without it you get infinite loops.\n</think>\n\nRecursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.",
"rejected": "Recursion is just loops."
},
{
"prompt": "What is an API?",
"chosen": "<think>\nHow do I explain API to someone? What does it stand for? Application Programming Interface. But what does that mean practically? It's like a contract between software systems. What's a good analogy? Like a waiter taking orders between you and the kitchen.\n</think>\n\nAn API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.",
"rejected": "API is code."
},
{
"prompt": "Describe version control.",
"chosen": "<think>\nWhat's the core purpose of version control? Tracking changes over time. Why is that useful? You can go back to previous versions, see who changed what. What systems exist? Git is the most popular. How should I frame this?\n</think>\n\nVersion control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.",
"rejected": "Version control saves files."
},
{
"prompt": "What is a database?",
"chosen": "<think>\nWhat is the essential definition of a database? It stores data, but that's too simple. What makes it different from just files? It's organized and structured. What manages it? A DBMS. How do I make this clear?\n</think>\n\nA database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).",
"rejected": "A database stores stuff."
},
{
"prompt": "Explain object-oriented programming.",
"chosen": "<think>\nWhat are the key concepts of OOP? Objects, classes, encapsulation, inheritance, polymorphism. But what's the core idea? Organizing code around objects that have both data and behavior. How do I explain this simply?\n</think>\n\nObject-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).",
"rejected": "OOP uses objects."
},
]
# Format for DPO
def format_for_dpo(sample):
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": sample["prompt"]}],
tokenize=False,
add_generation_prompt=True
)
return {
"prompt": prompt,
"chosen": sample["chosen"],
"rejected": sample["rejected"],
}
dataset = Dataset.from_list(preference_data)
dataset = dataset.map(format_for_dpo)
print(f"Dataset created: {len(dataset)} preference pairs")
print(f"Chosen responses include <think> blocks with self-questioning")
print(f"Rejected responses have minimal/no reasoning")
# Create minimal synthetic preference dataset with thinking content (5 samples) # DPO requires: prompt, chosen response (with quality thinking), rejected response (poor/no thinking) preference_data = [ { "prompt": "Explain recursion in programming.", "chosen": "\nWhat is recursion exactly? It's when a function calls itself. But why would you do that? To break down problems into smaller pieces. What's the key thing users need to understand? The base case - without it you get infinite loops.\n \n\nRecursion is when a function calls itself with a simpler version of the problem, including a base case to stop infinite loops.", "rejected": "Recursion is just loops." }, { "prompt": "What is an API?", "chosen": "\nHow do I explain API to someone? What does it stand for? Application Programming Interface. But what does that mean practically? It's like a contract between software systems. What's a good analogy? Like a waiter taking orders between you and the kitchen.\n \n\nAn API (Application Programming Interface) is a set of protocols that allows different software applications to communicate with each other.", "rejected": "API is code." }, { "prompt": "Describe version control.", "chosen": "\nWhat's the core purpose of version control? Tracking changes over time. Why is that useful? You can go back to previous versions, see who changed what. What systems exist? Git is the most popular. How should I frame this?\n \n\nVersion control is a system that records changes to files over time, allowing you to recall specific versions and collaborate with others.", "rejected": "Version control saves files." }, { "prompt": "What is a database?", "chosen": "\nWhat is the essential definition of a database? It stores data, but that's too simple. What makes it different from just files? It's organized and structured. What manages it? A DBMS. How do I make this clear?\n \n\nA database is an organized collection of structured data stored electronically, typically managed by a database management system (DBMS).", "rejected": "A database stores stuff." }, { "prompt": "Explain object-oriented programming.", "chosen": "\nWhat are the key concepts of OOP? Objects, classes, encapsulation, inheritance, polymorphism. But what's the core idea? Organizing code around objects that have both data and behavior. How do I explain this simply?\n \n\nObject-oriented programming (OOP) is a paradigm that organizes code into objects containing data (attributes) and behavior (methods).", "rejected": "OOP uses objects." }, ] # Format for DPO def format_for_dpo(sample): prompt = tokenizer.apply_chat_template( [{"role": "user", "content": sample["prompt"]}], tokenize=False, add_generation_prompt=True ) return { "prompt": prompt, "chosen": sample["chosen"], "rejected": sample["rejected"], } dataset = Dataset.from_list(preference_data) dataset = dataset.map(format_for_dpo) print(f"Dataset created: {len(dataset)} preference pairs") print(f"Chosen responses include blocks with self-questioning") print(f"Rejected responses have minimal/no reasoning")
Out[4]:
Map: 0%| | 0/5 [00:00<?, ? examples/s]
Out[4]:
Dataset created: 5 preference pairs Chosen responses include <think> blocks with self-questioning Rejected responses have minimal/no reasoning
In [ ]:
Copied!
# DPO Training Configuration (minimal steps for testing)
dpo_config = DPOConfig(
output_dir="outputs_dpo_qwen_think_test",
per_device_train_batch_size=1,
gradient_accumulation_steps=1,
max_steps=2, # Minimal steps for testing
warmup_steps=0,
learning_rate=5e-6, # Lower LR for DPO
logging_steps=1,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
beta=0.1, # DPO temperature
max_length=1024, # Increased for thinking content
max_prompt_length=256,
seed=42,
)
# Initialize DPO Trainer
trainer = DPOTrainer(
model=model,
args=dpo_config,
train_dataset=dataset,
processing_class=tokenizer,
)
print("Starting DPO training with thinking preferences (2 steps)...")
trainer_stats = trainer.train()
print(f"DPO training completed!")
# DPO Training Configuration (minimal steps for testing) dpo_config = DPOConfig( output_dir="outputs_dpo_qwen_think_test", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=2, # Minimal steps for testing warmup_steps=0, learning_rate=5e-6, # Lower LR for DPO logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", beta=0.1, # DPO temperature max_length=1024, # Increased for thinking content max_prompt_length=256, seed=42, ) # Initialize DPO Trainer trainer = DPOTrainer( model=model, args=dpo_config, train_dataset=dataset, processing_class=tokenizer, ) print("Starting DPO training with thinking preferences (2 steps)...") trainer_stats = trainer.train() print(f"DPO training completed!")
In [6]:
Copied!
# Post-training inference test
FastLanguageModel.for_inference(model)
test_prompt = "What is machine learning?"
messages = [{"role": "user", "content": test_prompt}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1024, # Increased to allow full thinking + response
temperature=0.6,
top_p=0.95,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
)
# Get generated token IDs only (exclude prompt)
input_length = inputs["input_ids"].shape[1]
generated_ids = outputs[0][input_length:].tolist()
# Token-based parsing using </think> token ID
THINK_END_TOKEN_ID = 151668
if THINK_END_TOKEN_ID in generated_ids:
end_idx = generated_ids.index(THINK_END_TOKEN_ID)
thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip()
final_resp = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip()
think_tag_found = True
else:
thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip()
final_resp = "(Model did not complete thinking - increase max_new_tokens)"
think_tag_found = False
print("=" * 60)
print("DPO Training Pipeline Test (Thinking Mode)")
print("=" * 60)
print(f"</think> token found: {'✅ YES' if think_tag_found else '❌ NO'}")
print(f"Output tokens: {len(generated_ids)}")
print(f"\nTHINKING: {thinking[:300]}..." if len(thinking) > 300 else f"\nTHINKING: {thinking}")
print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}")
if think_tag_found and thinking and final_resp:
print("\n✅ DPO Training Pipeline Test PASSED")
else:
print("\n⚠️ Test completed but output may need review")
# Post-training inference test FastLanguageModel.for_inference(model) test_prompt = "What is machine learning?" messages = [{"role": "user", "content": test_prompt}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=1024, # Increased to allow full thinking + response temperature=0.6, top_p=0.95, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) # Get generated token IDs only (exclude prompt) input_length = inputs["input_ids"].shape[1] generated_ids = outputs[0][input_length:].tolist() # Token-based parsing using token ID THINK_END_TOKEN_ID = 151668 if THINK_END_TOKEN_ID in generated_ids: end_idx = generated_ids.index(THINK_END_TOKEN_ID) thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True).strip() final_resp = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True).strip() think_tag_found = True else: thinking = tokenizer.decode(generated_ids, skip_special_tokens=True).strip() final_resp = "(Model did not complete thinking - increase max_new_tokens)" think_tag_found = False print("=" * 60) print("DPO Training Pipeline Test (Thinking Mode)") print("=" * 60) print(f" token found: {'✅ YES' if think_tag_found else '❌ NO'}") print(f"Output tokens: {len(generated_ids)}") print(f"\nTHINKING: {thinking[:300]}..." if len(thinking) > 300 else f"\nTHINKING: {thinking}") print(f"\nRESPONSE: {final_resp[:200]}..." if len(final_resp) > 200 else f"\nRESPONSE: {final_resp}") if think_tag_found and thinking and final_resp: print("\n✅ DPO Training Pipeline Test PASSED") else: print("\n⚠️ Test completed but output may need review")
Out[6]:
============================================================ DPO Training Pipeline Test (Thinking Mode) ============================================================ </think> token found: ✅ YES Output tokens: 1024 THINKING: Okay, the user is asking "What is machine learning?" Hmm, this seems like a beginner-level question. They might be completely new to the field, or maybe they've heard the term somewhere but don't understand it deeply. First, I should consider who this user could be. Could be a student, a curious p... RESPONSE: Machine learning (ML) is a **subset of artificial intelligence (AI)** where computer programs "learn" from data instead of being explicitly programmed for every single task. Instead of hardcoding rule... ✅ DPO Training Pipeline Test PASSED
Test Complete¶
The DPO Training Pipeline test with thinking preferences has completed successfully. The kernel will now shut down to release all GPU memory.
What Was Verified¶
- FastLanguageModel loading with 4-bit quantization (Qwen3-4B-Thinking-2507)
- LoRA adapter configuration for preference learning
- Preference dataset with thinking quality contrast (chosen vs rejected)
- DPOTrainer training loop (2 steps)
- Post-training inference with thinking output
DPO Concepts with Thinking¶
- Thinking Preference: Chosen responses have quality self-questioning
<think>blocks - Contrast Learning: Rejected responses have poor/no reasoning
- Beta Parameter: Controls strength of preference signal
Ready for Production¶
If this test passed, your environment is ready for:
- DPO training with real preference data including thinking quality
- Human preference alignment for chain-of-thought reasoning
- Post-SFT thinking preference optimization
In [7]:
Copied!
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)
Out[7]:
Shutting down kernel to release GPU memory...
Out[7]:
{'status': 'ok', 'restart': False}