Reward Model Training¶
Overview¶
Reward models learn to score responses based on human preferences. They're used in RLHF pipelines (PPO, GRPO, RLOO) to provide reward signals for policy optimization. The model outputs a scalar reward for each response. This skill includes patterns for scoring thinking/reasoning quality.
Quick Reference¶
| Component | Purpose |
|---|---|
RewardTrainer | Trainer for reward model |
RewardConfig | Training hyperparameters |
AutoModelForSequenceClassification | Model with num_labels=1 |
task_type="SEQ_CLS" | LoRA task type for reward models |
| Preference pairs | Training data format |
| Token ID 151668 | </think> boundary for Qwen3-Thinking models |
Critical Environment Setup¶
import os
from dotenv import load_dotenv
load_dotenv()
# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"
Critical Import Order¶
# Standard transformers for reward models (not Unsloth)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
import torch
Reward Model Concepts¶
How Reward Models Work¶
- Take prompt + response as input
- Output scalar reward score
- Trained on preference pairs (chosen > rejected)
- Used to guide RL policy optimization
Architecture¶
Input: [prompt + response]
↓
Base LLM (frozen or LoRA)
↓
Classification Head (Linear → Scalar)
↓
Output: Reward score (float)
Dataset Format¶
Required Fields¶
dataset = [
{
"prompt": "What is recursion?",
"chosen": "Recursion is a function calling itself with a base case.",
"rejected": "Recursion is loops."
},
# ... more preference pairs
]
Preprocessing¶
def format_for_reward(sample):
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": sample["prompt"]}],
tokenize=False, add_generation_prompt=True
)
return {
"input_ids_chosen": tokenizer(prompt + sample["chosen"])["input_ids"],
"input_ids_rejected": tokenizer(prompt + sample["rejected"])["input_ids"],
}
Thinking Quality Preference Dataset¶
Train reward model to score thinking quality:
# Chosen = Good thinking, Rejected = Poor/no thinking
thinking_preference_data = [
{
"prompt": "Explain recursion in programming.",
"chosen": """<think>
What is recursion exactly? It's when a function calls itself.
Why would we use this? To break down problems into smaller pieces.
What's a good example? Factorial: 5! = 5 * 4!
</think>
Recursion is a technique where a function calls itself with a simpler version of the problem.""",
"rejected": "Recursion is just loops."
},
{
"prompt": "What is 15 + 27?",
"chosen": """<think>
I need to add 15 and 27.
Let me break it down: 15 + 27 = 15 + 20 + 7 = 35 + 7 = 42.
</think>
15 + 27 = 42""",
"rejected": "42"
},
]
dataset = Dataset.from_list(thinking_preference_data)
Setup¶
Load Reward Model¶
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
# Quantization config
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_quant_type="nf4",
)
# Load as sequence classification model
model = AutoModelForSequenceClassification.from_pretrained(
"Qwen/Qwen3-4B-Thinking-2507", # Non-quantized base
num_labels=1, # Single scalar reward output
quantization_config=bnb_config,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")
# Setup pad token
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
Apply LoRA¶
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
bias="none",
task_type="SEQ_CLS",
)
model = get_peft_model(model, lora_config)
RewardTrainer Configuration¶
Basic Configuration¶
from trl import RewardConfig
reward_config = RewardConfig(
output_dir="./reward_output",
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
max_steps=100,
learning_rate=1e-5,
fp16=not is_bf16_supported(),
bf16=is_bf16_supported(),
optim="adamw_8bit",
max_length=512,
)
Key Parameters¶
| Parameter | Typical Values | Effect |
|---|---|---|
learning_rate | 1e-5 to 5e-5 | Training speed |
max_length | 512-1024 | Input truncation |
center_rewards_coefficient | 0.0-0.1 | Reward centering |
Training¶
Basic Training¶
from trl import RewardTrainer
trainer = RewardTrainer(
model=model,
args=reward_config,
train_dataset=dataset,
processing_class=tokenizer,
)
trainer.train()
Using the Reward Model¶
Score Responses¶
def get_reward(prompt, response):
text = prompt + response
inputs = tokenizer(text, return_tensors="pt", truncation=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
reward = outputs.logits[0, 0].item()
return reward
# Example
score = get_reward("What is Python?", "A programming language.")
print(f"Reward: {score:.3f}")
Batch Scoring¶
def get_rewards_batch(prompts, responses):
texts = [p + r for p, r in zip(prompts, responses)]
inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
outputs = model(**inputs)
rewards = outputs.logits[:, 0].tolist()
return rewards
In GRPO/RLOO¶
def reward_fn(completions, prompts):
return get_rewards_batch(prompts, completions)
grpo_trainer = GRPOTrainer(
model=policy_model,
args=grpo_config,
train_dataset=dataset,
reward_funcs=reward_fn,
)
Reward Scaling¶
Normalize Rewards¶
def normalized_reward(completions, prompts):
raw_rewards = get_rewards_batch(prompts, completions)
mean = sum(raw_rewards) / len(raw_rewards)
std = (sum((r - mean) ** 2 for r in raw_rewards) / len(raw_rewards)) ** 0.5
return [(r - mean) / (std + 1e-8) for r in raw_rewards]
Clip Rewards¶
def clipped_reward(completions, prompts):
rewards = get_rewards_batch(prompts, completions)
return [max(-1.0, min(1.0, r)) for r in rewards]
Troubleshooting¶
Poor Discrimination¶
Symptom: Similar scores for chosen and rejected
Fix: - More training steps - Higher learning rate - Check data quality
Reward Hacking¶
Symptom: RL model exploits reward model
Fix: - Add diversity in training data - Ensemble multiple reward models - Regularization during RL
Overconfident Scores¶
Symptom: Extreme reward values
Fix: - Use center_rewards_coefficient - Normalize outputs - Clip reward range
Memory Issues¶
Symptom: OOM during training
Fix: - Use LoRA instead of full fine-tuning - Reduce max_length - Smaller batch size
Kernel Shutdown (Jupyter)¶
Reward model training uses significant GPU memory. Shutdown kernel to release memory:
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
Important: Always run this at the end of training notebooks before switching to different models.
When to Use This Skill
Use when:
- Building RLHF pipelines
- Need explicit reward signal
- Have preference data
- Want interpretable scoring
- Planning to use GRPO or RLOO
Cross-References¶
bazzite-ai-jupyter:grpo- Uses reward models for RLbazzite-ai-jupyter:rloo- Uses reward models for RLbazzite-ai-jupyter:dpo- Alternative that doesn't need reward modelbazzite-ai-jupyter:peft- LoRA for efficient reward trainingbazzite-ai-jupyter:sft- Pre-training before reward modelingbazzite-ai-jupyter:inference- Inference for reward scoring