Supervised Fine-Tuning (SFT)¶

Overview¶

SFT adapts a pre-trained LLM to follow instructions by training on instruction-response pairs. Unsloth provides an optimized SFTTrainer for 2x faster training with reduced memory usage. This skill includes patterns for training thinking/reasoning models.

Quick Reference¶

Component	Purpose
`FastLanguageModel`	Load model with Unsloth optimizations
`SFTTrainer`	Trainer for instruction tuning
`SFTConfig`	Training hyperparameters
`dataset_text_field`	Column containing formatted text
Token ID 151668	`</think>` boundary for Qwen3-Thinking models

Critical Environment Setup¶

import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress in Jupyter
os.environ["TQDM_NOTEBOOK"] = "false"

Critical Import Order¶

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

# Then other imports
from trl import SFTTrainer, SFTConfig
from datasets import Dataset
import torch

Warning: Importing TRL before Unsloth will disable optimizations and may cause errors.

Dataset Formats¶

Instruction-Response Format¶

dataset = [
    {"instruction": "What is Python?", "response": "A programming language."},
    {"instruction": "Explain ML.", "response": "Machine learning is..."},
]

Chat/Conversation Format¶

dataset = [
    {"messages": [
        {"role": "user", "content": "What is Python?"},
        {"role": "assistant", "content": "A programming language."}
    ]},
]

Using Chat Templates¶

def format_conversation(sample):
    messages = [
        {"role": "user", "content": sample["instruction"]},
        {"role": "assistant", "content": sample["response"]}
    ]
    return {"text": tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )}

dataset = dataset.map(format_conversation)

Thinking Model Format¶

For models like Qwen3-Thinking, include <think> tags in the assistant response. Use self-questioning internal dialogue style:

def format_thinking_conversation(sample):
    """Format with thinking/reasoning tags."""
    # Combine thinking and response with tags
    assistant_content = f"<think>\n{sample['thinking']}\n</think>\n\n{sample['response']}"

    messages = [
        {"role": "user", "content": sample["instruction"]},
        {"role": "assistant", "content": assistant_content}
    ]
    return {"text": tokenizer.apply_chat_template(
        messages, tokenize=False, add_generation_prompt=False
    )}

# Sample dataset with self-questioning thinking style
thinking_data = [
    {
        "instruction": "What is machine learning?",
        "thinking": "What is the user asking here? They want to understand machine learning. What are the key concepts I should cover? It's a subset of AI... and it involves learning from data. How should I keep this accessible? Short and clear definition.",
        "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."
    },
    {
        "instruction": "Explain Python in one sentence.",
        "thinking": "One sentence only - what's most important about Python? Its readability and versatility are the defining features. How do I capture both in one sentence?",
        "response": "Python is a high-level programming language known for its readability and versatility."
    },
    {
        "instruction": "What is a neural network?",
        "thinking": "How do I explain neural networks simply? What's the core concept? They're inspired by biological neurons... they process information in layers. Should I mention deep learning? Maybe keep it basic for now.",
        "response": "A neural network is a computational model inspired by biological neurons that processes information through connected layers."
    },
]

dataset = Dataset.from_list(thinking_data)
dataset = dataset.map(format_thinking_conversation, remove_columns=["instruction", "thinking", "response"])

Thinking Style Patterns: - "What is the user asking here?" - "Let me think about the key concepts..." - "How should I structure this explanation?" - "What's most important about X?"

Unsloth SFT Setup¶

Load Model¶

from unsloth import FastLanguageModel

# Standard model
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-unsloth-bnb-4bit",
    max_seq_length=512,
    load_in_4bit=True,
)

# Thinking model (for reasoning tasks)
model, tokenizer = FastLanguageModel.from_pretrained(
    "unsloth/Qwen3-4B-Thinking-2507-unsloth-bnb-4bit",
    max_seq_length=1024,  # Increased for thinking content
    load_in_4bit=True,
)

Apply LoRA¶

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    use_gradient_checkpointing="unsloth",
)

Training Configuration¶

from trl import SFTConfig

sft_config = SFTConfig(
    output_dir="./sft_output",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=4,
    max_steps=100,
    learning_rate=2e-4,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    max_seq_length=512,
)

SFTTrainer Usage¶

Basic Training¶

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    args=sft_config,
)

trainer.train()

With Custom Formatting¶

def formatting_func(examples):
    texts = []
    for instruction, response in zip(examples["instruction"], examples["response"]):
        text = f"### Instruction:\n{instruction}\n\n### Response:\n{response}"
        texts.append(text)
    return texts

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    formatting_func=formatting_func,
    args=sft_config,
)

Key Parameters¶

Parameter	Typical Values	Effect
`learning_rate`	2e-4 to 2e-5	Training speed vs stability
`per_device_train_batch_size`	1-4	Memory usage
`gradient_accumulation_steps`	2-8	Effective batch size
`max_seq_length`	512-2048	Context window
`optim`	"adamw_8bit"	Memory-efficient optimizer

Save and Load¶

Save Model¶

# Save LoRA adapters only (small)
model.save_pretrained("./sft_lora")

# Save merged model (full size)
model.save_pretrained_merged("./sft_merged", tokenizer)

Load for Inference¶

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained("./sft_lora")
FastLanguageModel.for_inference(model)

Thinking Model Inference¶

Parse thinking content from model output using token IDs:

THINK_END_TOKEN_ID = 151668  # </think> token for Qwen3-Thinking

def generate_with_thinking(model, tokenizer, prompt):
    """Generate and parse thinking model output."""
    messages = [{"role": "user", "content": prompt}]
    inputs = tokenizer.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True, return_tensors="pt"
    ).to(model.device)

    # Setup pad token if needed
    if tokenizer.pad_token_id is None:
        tokenizer.pad_token_id = tokenizer.eos_token_id

    outputs = model.generate(
        input_ids=inputs,
        max_new_tokens=1024,
        temperature=0.6,
        top_p=0.95,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

    # Extract only generated tokens
    input_length = inputs.shape[1]
    generated_ids = outputs[0][input_length:].tolist()

    # Parse thinking and response
    if THINK_END_TOKEN_ID in generated_ids:
        end_idx = generated_ids.index(THINK_END_TOKEN_ID)
        thinking = tokenizer.decode(generated_ids[:end_idx], skip_special_tokens=True)
        response = tokenizer.decode(generated_ids[end_idx + 1:], skip_special_tokens=True)
    else:
        thinking = tokenizer.decode(generated_ids, skip_special_tokens=True)
        response = "(incomplete - increase max_new_tokens)"

    return thinking.strip(), response.strip()

# Usage
FastLanguageModel.for_inference(model)
thinking, response = generate_with_thinking(model, tokenizer, "What is 15 + 27?")
print(f"Thinking: {thinking}")
print(f"Response: {response}")

Ollama Integration¶

Export to GGUF¶

# Export to GGUF for Ollama
model.save_pretrained_gguf(
    "model",
    tokenizer,
    quantization_method="q4_k_m"
)

Deploy to Ollama¶

ollama create mymodel -f Modelfile
ollama run mymodel

Troubleshooting¶

Out of Memory¶

Symptom: CUDA out of memory error

Fix: - Use gradient_checkpointing="unsloth" - Reduce per_device_train_batch_size to 1 - Use 4-bit quantization (load_in_4bit=True)

NaN Loss¶

Symptom: Loss becomes NaN during training

Fix: - Lower learning_rate to 1e-5 - Check data quality (no empty samples) - Use gradient clipping

Slow Training¶

Symptom: Training slower than expected

Fix: - Ensure Unsloth is imported FIRST (before TRL) - Use bf16=True if supported - Enable use_gradient_checkpointing="unsloth"

Kernel Shutdown (Jupyter)¶

SFT training uses significant GPU memory. Shutdown kernel to release memory:

import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)

Important: Always run this at the end of training notebooks before switching to different models.

When to Use This Skill

Use when:

Creating instruction-following models
Fine-tuning for chat/conversation
Adapting to domain-specific tasks
Building custom assistants
First step before preference optimization (DPO/GRPO)

Cross-References¶

bazzite-ai-jupyter:peft - LoRA configuration details
bazzite-ai-jupyter:qlora - Advanced QLoRA experiments (alpha, rank, modules)
bazzite-ai-jupyter:finetuning - General fine-tuning concepts
bazzite-ai-jupyter:dpo - Direct Preference Optimization after SFT
bazzite-ai-jupyter:grpo - GRPO reinforcement learning after SFT
bazzite-ai-jupyter:inference - Fast inference with vLLM
bazzite-ai-jupyter:vision - Vision model fine-tuning
bazzite-ai-ollama:api - Ollama deployment