SFT Training Test: Qwen3-4B¶

Tests Supervised Fine-Tuning with Unsloth's optimized SFTTrainer on Qwen3-4B.

Key features tested:

FastLanguageModel loading with 4-bit quantization
LoRA adapter configuration
SFTTrainer with minimal synthetic dataset
Post-training inference verification

Important: This notebook includes a kernel shutdown cell at the end to release all GPU memory.

In [1]:

  Copied!     
 
# Environment Setup
import os
from dotenv import load_dotenv
load_dotenv()

# Force text-based progress instead of HTML widgets
os.environ["TQDM_NOTEBOOK"] = "false"

# CRITICAL: Import unsloth FIRST for proper TRL patching
import unsloth
from unsloth import FastLanguageModel, is_bf16_supported

import torch

# Environment summary
gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU"
print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}")
print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")
# Environment Setup import os from dotenv import load_dotenv load_dotenv() # Force text-based progress instead of HTML widgets os.environ["TQDM_NOTEBOOK"] = "false" # CRITICAL: Import unsloth FIRST for proper TRL patching import unsloth from unsloth import FastLanguageModel, is_bf16_supported import torch # Environment summary gpu = torch.cuda.get_device_name(0) if torch.cuda.is_available() else "CPU" print(f"Environment: unsloth {unsloth.__version__}, PyTorch {torch.__version__}, {gpu}") print(f"HF_TOKEN loaded: {'Yes' if os.environ.get('HF_TOKEN') else 'No'}")

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/trl/__init__.py:203: UserWarning: TRL currently supports vLLM versions: 0.10.2, 0.11.0, 0.11.1, 0.11.2. You have version 0.14.0rc1.dev201+gadcf682fc.cu130 installed. We recommend installing a supported version to avoid compatibility issues.
  if is_vllm_available():

🦥 Unsloth Zoo will now patch everything to make training faster!Environment: unsloth 2025.12.10, PyTorch 2.9.1+cu130, NVIDIA GeForce RTX 4080 SUPER
HF_TOKEN loaded: Yes

In [2]:

  Copied!     
 
# Load Qwen3-4B with 4-bit quantization
MODEL_NAME = "unsloth/Qwen3-4B-unsloth-bnb-4bit"
print(f"\nLoading {MODEL_NAME.split('/')[-1]}...")

model, tokenizer = FastLanguageModel.from_pretrained(
    MODEL_NAME,
    max_seq_length=512,
    load_in_4bit=True,
    dtype=None,  # Auto-detect
)
print(f"Model loaded: {type(model).__name__}")
# Load Qwen3-4B with 4-bit quantization MODEL_NAME = "unsloth/Qwen3-4B-unsloth-bnb-4bit" print(f"\nLoading {MODEL_NAME.split('/')[-1]}...") model, tokenizer = FastLanguageModel.from_pretrained( MODEL_NAME, max_seq_length=512, load_in_4bit=True, dtype=None, # Auto-detect ) print(f"Model loaded: {type(model).__name__}")

Loading Qwen3-4B-unsloth-bnb-4bit...==((====))==  Unsloth 2025.12.10: Fast Qwen3 patching. Transformers: 5.0.0.1. vLLM: 0.14.0rc1.dev201+gadcf682fc.cu130.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

Model loaded: Qwen3ForCausalLM

In [3]:

  Copied!     
 
# Apply LoRA adapters
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=42,
)

trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
total = sum(p.numel() for p in model.parameters())
print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")
# Apply LoRA adapters model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=16, lora_dropout=0, target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], bias="none", use_gradient_checkpointing="unsloth", random_state=42, ) trainable = sum(p.numel() for p in model.parameters() if p.requires_grad) total = sum(p.numel() for p in model.parameters()) print(f"LoRA applied: {trainable:,} trainable / {total:,} total ({100*trainable/total:.2f}%)")

Unsloth 2025.12.10 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.

LoRA applied: 33,030,144 trainable / 2,541,616,640 total (1.30%)

In [4]:

  Copied!     
 
# Create minimal synthetic instruction dataset (5 samples)
from datasets import Dataset

# Synthetic instruction-response pairs for testing
synthetic_data = [
    {"instruction": "What is machine learning?",
     "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."},
    {"instruction": "Explain Python in one sentence.",
     "response": "Python is a high-level programming language known for its readability and versatility."},
    {"instruction": "What is a neural network?",
     "response": "A neural network is a computational model inspired by biological neurons that processes information."},
    {"instruction": "Define supervised learning.",
     "response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs."},
    {"instruction": "What is gradient descent?",
     "response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters."},
]

# Format as chat conversations
def format_conversation(sample):
    messages = [
        {"role": "user", "content": sample["instruction"]},
        {"role": "assistant", "content": sample["response"]}
    ]
    return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)}

dataset = Dataset.from_list(synthetic_data)
dataset = dataset.map(format_conversation, remove_columns=["instruction", "response"])
print(f"Dataset created: {len(dataset)} samples")
print(f"Sample: {dataset[0]['text'][:100]}...")
# Create minimal synthetic instruction dataset (5 samples) from datasets import Dataset # Synthetic instruction-response pairs for testing synthetic_data = [ {"instruction": "What is machine learning?", "response": "Machine learning is a subset of artificial intelligence where computers learn patterns from data."}, {"instruction": "Explain Python in one sentence.", "response": "Python is a high-level programming language known for its readability and versatility."}, {"instruction": "What is a neural network?", "response": "A neural network is a computational model inspired by biological neurons that processes information."}, {"instruction": "Define supervised learning.", "response": "Supervised learning is training a model on labeled data to predict outcomes for new inputs."}, {"instruction": "What is gradient descent?", "response": "Gradient descent is an optimization algorithm that minimizes loss by iteratively adjusting parameters."}, ] # Format as chat conversations def format_conversation(sample): messages = [ {"role": "user", "content": sample["instruction"]}, {"role": "assistant", "content": sample["response"]} ] return {"text": tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)} dataset = Dataset.from_list(synthetic_data) dataset = dataset.map(format_conversation, remove_columns=["instruction", "response"]) print(f"Dataset created: {len(dataset)} samples") print(f"Sample: {dataset[0]['text'][:100]}...")

Map:   0%|          | 0/5 [00:00<?, ? examples/s]

Dataset created: 5 samples
Sample: <|im_start|>user
What is machine learning?<|im_end|>
<|im_start|>assistant
<think>

</think>

Machin...

In [ ]:

  Copied!     
 
# SFT Training (minimal steps for testing)
from trl import SFTTrainer, SFTConfig

sft_config = SFTConfig(
    output_dir="outputs_sft_qwen_test",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=1,
    max_steps=3,  # Minimal steps for testing
    warmup_steps=1,
    learning_rate=2e-4,
    logging_steps=1,
    fp16=not is_bf16_supported(),
    bf16=is_bf16_supported(),
    optim="adamw_8bit",
    weight_decay=0.01,
    max_seq_length=512,
    seed=42,
)

trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    args=sft_config,
)

print("Starting SFT training (3 steps)...")
trainer_stats = trainer.train()
final_loss = trainer_stats.metrics.get('train_loss', 'N/A')
print(f"Training completed. Final loss: {final_loss:.4f}")
# SFT Training (minimal steps for testing) from trl import SFTTrainer, SFTConfig sft_config = SFTConfig( output_dir="outputs_sft_qwen_test", per_device_train_batch_size=1, gradient_accumulation_steps=1, max_steps=3, # Minimal steps for testing warmup_steps=1, learning_rate=2e-4, logging_steps=1, fp16=not is_bf16_supported(), bf16=is_bf16_supported(), optim="adamw_8bit", weight_decay=0.01, max_seq_length=512, seed=42, ) trainer = SFTTrainer( model=model, tokenizer=tokenizer, train_dataset=dataset, dataset_text_field="text", args=sft_config, ) print("Starting SFT training (3 steps)...") trainer_stats = trainer.train() final_loss = trainer_stats.metrics.get('train_loss', 'N/A') print(f"Training completed. Final loss: {final_loss:.4f}")

In [6]:

  Copied!     
 
# Post-training inference test
FastLanguageModel.for_inference(model)

messages = [{"role": "user", "content": "What is deep learning?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=64,
        temperature=0.7,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("=" * 60)
print("SFT Training Pipeline Test PASSED")
print("=" * 60)
print(f"Sample generation:\n{response[-200:]}")
# Post-training inference test FastLanguageModel.for_inference(model) messages = [{"role": "user", "content": "What is deep learning?"}] prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(prompt, return_tensors="pt").to("cuda") with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=64, temperature=0.7, top_p=0.9, do_sample=True, pad_token_id=tokenizer.pad_token_id, ) response = tokenizer.decode(outputs[0], skip_special_tokens=True) print("=" * 60) print("SFT Training Pipeline Test PASSED") print("=" * 60) print(f"Sample generation:\n{response[-200:]}")

============================================================
SFT Training Pipeline Test PASSED
============================================================
Sample generation:
g is a subset of machine learning, right? So I should mention that first. Then, I need to explain the key components. Neural networks, specifically deep neural networks, are the core of deep learning.

Test Complete¶

The SFT Training Pipeline test has completed successfully. The kernel will now shut down to release all GPU memory.

What Was Verified¶

FastLanguageModel loading with 4-bit quantization (Qwen3-4B)
LoRA adapter configuration (r=16, all projection modules)
Synthetic dataset creation and formatting
SFTTrainer training loop (3 steps)
Post-training inference generation

Ready for Production¶

If this test passed, your environment is ready for:

Full SFT fine-tuning on larger datasets
Chat/instruction tuning workflows
Model saving and deployment

In [ ]:

  Copied!     
 
# Shutdown kernel to release all GPU memory
import IPython
print("Shutting down kernel to release GPU memory...")
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shutdown kernel to release all GPU memory import IPython print("Shutting down kernel to release GPU memory...") app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)