Reward Model Training¶

Overview¶

Reward models learn to score responses based on human preferences. They're used in RLHF pipelines (PPO, GRPO, RLOO) to provide reward signals for policy optimization. The model outputs a scalar reward for each response. This skill includes patterns for scoring thinking/reasoning quality.

Quick Reference¶

Component	Purpose
`RewardTrainer`	Trainer for reward model
`RewardConfig`	Training hyperparameters
`AutoModelForSequenceClassification`	Model with `num_labels=1`
`task_type="SEQ_CLS"`	LoRA task type for reward models
Preference pairs	Training data format
Token ID 151668	`</think>` boundary for Qwen3-Thinking models

Critical Environment Setup¶

import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"

Critical Import Order¶

# Standard transformers for reward models (not Unsloth)
from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from trl import RewardTrainer, RewardConfig
from datasets import Dataset
import torch

Reward Model Concepts¶

How Reward Models Work¶

Take prompt + response as input
Output scalar reward score
Trained on preference pairs (chosen > rejected)
Used to guide RL policy optimization

Architecture¶

Input: [prompt + response]
  ↓
Base LLM (frozen or LoRA)
  ↓
Classification Head (Linear → Scalar)
  ↓
Output: Reward score (float)

Dataset Format¶

Required Fields¶

dataset = [
    {
        "prompt": "What is recursion?",
        "chosen": "Recursion is a function calling itself with a base case.",
        "rejected": "Recursion is loops."
    },
    # ... more preference pairs
]

Preprocessing¶

def format_for_reward(sample):
    prompt = tokenizer.apply_chat_template(
        [{"role": "user", "content": sample["prompt"]}],
        tokenize=False, add_generation_prompt=True
    )
    return {
        "input_ids_chosen": tokenizer(prompt + sample["chosen"])["input_ids"],
        "input_ids_rejected": tokenizer(prompt + sample["rejected"])["input_ids"],
    }

Thinking Quality Preference Dataset¶

Train reward model to score thinking quality:

# Chosen = Good thinking, Rejected = Poor/no thinking
thinking_preference_data = [
    {
        "prompt": "Explain recursion in programming.",
        "chosen": """<think>
What is recursion exactly? It's when a function calls itself.
Why would we use this? To break down problems into smaller pieces.
What's a good example? Factorial: 5! = 5 * 4!
</think>

Recursion is a technique where a function calls itself with a simpler version of the problem.""",
        "rejected": "Recursion is just loops."
    },
    {
        "prompt": "What is 15 + 27?",
        "chosen": """<think>
I need to add 15 and 27.
Let me break it down: 15 + 27 = 15 + 20 + 7 = 35 + 7 = 42.
</think>

15 + 27 = 42""",
        "rejected": "42"
    },
]

dataset = Dataset.from_list(thinking_preference_data)

Setup¶

Load Reward Model¶

from transformers import AutoModelForSequenceClassification, AutoTokenizer, BitsAndBytesConfig

# Quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4",
)

# Load as sequence classification model
model = AutoModelForSequenceClassification.from_pretrained(
    "Qwen/Qwen3-4B-Thinking-2507",  # Non-quantized base
    num_labels=1,  # Single scalar reward output
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Thinking-2507")

# Setup pad token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.pad_token_id = tokenizer.eos_token_id

Apply LoRA¶

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    task_type="SEQ_CLS",
)

model = get_peft_model(model, lora_config)

RewardTrainer Configuration¶

Basic Configuration¶

from trl import RewardConfig

reward_config = RewardConfig(
    output_dir="./reward_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    max_steps=100,
    learning_rate=1e-5,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    max_length=512,
)

Key Parameters¶

Parameter	Typical Values	Effect
`learning_rate`	1e-5 to 5e-5	Training speed
`max_length`	512-1024	Input truncation
`center_rewards_coefficient`	0.0-0.1	Reward centering

Training¶

Basic Training¶

from trl import RewardTrainer

trainer = RewardTrainer(
    model=model,
    args=reward_config,
    train_dataset=dataset,
    processing_class=tokenizer,
)

trainer.train()

Using the Reward Model¶

Score Responses¶

def get_reward(prompt, response):
    text = prompt + response
    inputs = tokenizer(text, return_tensors="pt", truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        reward = outputs.logits[0, 0].item()

    return reward

# Example
score = get_reward("What is Python?", "A programming language.")
print(f"Reward: {score:.3f}")

Batch Scoring¶

def get_rewards_batch(prompts, responses):
    texts = [p + r for p, r in zip(prompts, responses)]
    inputs = tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        outputs = model(**inputs)
        rewards = outputs.logits[:, 0].tolist()

    return rewards

In GRPO/RLOO¶

def reward_fn(completions, prompts):
    return get_rewards_batch(prompts, completions)

grpo_trainer = GRPOTrainer(
    model=policy_model,
    args=grpo_config,
    train_dataset=dataset,
    reward_funcs=reward_fn,
)

Reward Scaling¶

Normalize Rewards¶

def normalized_reward(completions, prompts):
    raw_rewards = get_rewards_batch(prompts, completions)
    mean = sum(raw_rewards) / len(raw_rewards)
    std = (sum((r - mean) ** 2 for r in raw_rewards) / len(raw_rewards)) ** 0.5
    return [(r - mean) / (std + 1e-8) for r in raw_rewards]

Clip Rewards¶

def clipped_reward(completions, prompts):
    rewards = get_rewards_batch(prompts, completions)
    return [max(-1.0, min(1.0, r)) for r in rewards]

Troubleshooting¶

Poor Discrimination¶

Symptom: Similar scores for chosen and rejected

Fix: - More training steps - Higher learning rate - Check data quality

Reward Hacking¶

Symptom: RL model exploits reward model

Fix: - Add diversity in training data - Ensemble multiple reward models - Regularization during RL

Overconfident Scores¶

Symptom: Extreme reward values

Fix: - Use center_rewards_coefficient - Normalize outputs - Clip reward range

Memory Issues¶

Symptom: OOM during training

Fix: - Use LoRA instead of full fine-tuning - Reduce max_length - Smaller batch size

Kernel Shutdown (Jupyter)¶

Reward model training uses significant GPU memory. Shutdown kernel to release memory:

import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Important: Always run this at the end of training notebooks before switching to different models.

When to Use This Skill

Use when:

Building RLHF pipelines
Need explicit reward signal
Have preference data
Want interpretable scoring
Planning to use GRPO or RLOO

Cross-References¶

bazzite-ai-jupyter:grpo - Uses reward models for RL
bazzite-ai-jupyter:rloo - Uses reward models for RL
bazzite-ai-jupyter:dpo - Alternative that doesn't need reward model
bazzite-ai-jupyter:peft - LoRA for efficient reward training
bazzite-ai-jupyter:sft - Pre-training before reward modeling
bazzite-ai-jupyter:inference - Inference for reward scoring