D3 06 Unsloth

Unsloth: Optimizing Training and Inference Performance¶

For many software algorithms, the performance does not only depend on the number and kind of calculations performed. Instead, the exact order and the size of chunks has an enormous influence on the calculation speed. For large language models, a library called unsloth contains optimized GPU kernels created by manually deriving all compute heavy math steps. By using these optimized kernels, a significant speed-up can be obtained.

Key Techniques in Unsloth:¶

Efficient Data Loading: Optimizing data pipelines to reduce latency and improve throughput during training.
Batching and Padding Strategies: Dynamically adjusting batch sizes and minimizing padding to optimize memory usage.
Half-Precision and Quantized Inference: Using mixed precision or quantized models to speed up inference and reduce memory footprint.
Model Pruning and Distillation: Reducing the size of the model by removing redundant parameters or training smaller models to mimic larger ones.

Benefits of Unsloth:¶

Reduced Training Time: Optimizing data loading and model architecture reduces the time required for each epoch.
Lower Memory Usage: Using techniques like mixed precision and quantization reduces the amount of GPU memory required.
Faster Inference: Optimizing the model for deployment can significantly reduce latency during inference.

Hands-On Example: Efficient Data Loading and Mixed Precision Training¶

In this example, we take the example from the previous notebook ("PEFT") and adjust them to use unsloth.

Bazzite-AI Setup Required
Run D0_00_Bazzite_AI_Setup.ipynb first to verify GPU access.

vLLM Integration Status¶

Unsloth supports fast_inference=True which uses vLLM as a backend for 2x faster inference.

Current Status: vLLM 0.14.0 cu130 is installed, but fast_inference=True requires an Unsloth update to support the vLLM 0.14.x API. Use fast_inference=False for now.

# Once Unsloth updates to support vLLM 0.14.x, enable fast_inference:
# model, tokenizer = FastLanguageModel.from_pretrained(
#     "unsloth/tinyllama-chat-bnb-4bit",
#     fast_inference=True,  # Enable vLLM backend
#     gpu_memory_utilization=0.6,
# )
# outputs = model.fast_generate(["Hello!"], max_new_tokens=50)

In [ ]:

  Copied!     
 
# Verify vLLM installation
import vllm
print(f"vLLM version: {vllm.__version__}")
print("vLLM is installed and ready for standalone server mode")
# Verify vLLM installation import vllm print(f"vLLM version: {vllm.__version__}") print("vLLM is installed and ready for standalone server mode")

In [11]:

  Copied!     
 
# Import libraries
# Unsloth provides optimized model loading and LoRA implementation
from unsloth import FastLanguageModel

import torch
from datasets import load_dataset
from transformers import BitsAndBytesConfig
from trl import SFTTrainer, SFTConfig

torch.__version__
# Import libraries # Unsloth provides optimized model loading and LoRA implementation from unsloth import FastLanguageModel import torch from datasets import load_dataset from transformers import BitsAndBytesConfig from trl import SFTTrainer, SFTConfig torch.__version__

Out[11]:

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.

Out[11]:

🦥 Unsloth Zoo will now patch everything to make training faster!

Out[11]:

'2.9.1+cu130'

In [12]:

  Copied!     
 
# Use Unsloth's pre-quantized TinyLlama for consistency with D3_02-D3_05
# Unsloth models are optimized with custom CUDA kernels for faster training
HF_LLM_MODEL = "unsloth/tinyllama-chat-bnb-4bit"

model, tokenizer = FastLanguageModel.from_pretrained(
    HF_LLM_MODEL,
    max_seq_length=512,  # Reduced for memory efficiency on 16GB GPU
    load_in_4bit=True,
)

# Set padding for batch training
tokenizer.padding_side = 'right'

print(f"Model: {HF_LLM_MODEL}")
print(f"Tokenizer padding_side: {tokenizer.padding_side}")
# Use Unsloth's pre-quantized TinyLlama for consistency with D3_02-D3_05 # Unsloth models are optimized with custom CUDA kernels for faster training HF_LLM_MODEL = "unsloth/tinyllama-chat-bnb-4bit" model, tokenizer = FastLanguageModel.from_pretrained( HF_LLM_MODEL, max_seq_length=512, # Reduced for memory efficiency on 16GB GPU load_in_4bit=True, ) # Set padding for batch training tokenizer.padding_side = 'right' print(f"Model: {HF_LLM_MODEL}") print(f"Tokenizer padding_side: {tokenizer.padding_side}")

Out[12]:

==((====))==  Unsloth 2025.11.1: Fast Llama patching. Transformers: 4.57.2.
   \\   /|    NVIDIA GeForce RTX 4080 SUPER. Num GPUs = 1. Max memory: 15.568 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.9.1+cu130. CUDA: 8.9. CUDA Toolkit: 13.0. Triton: 3.5.1
\        /    Bfloat16 = TRUE. FA [Xformers = 0.0.33.post2. FA2 = False]
 "-____-"     Free license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!

Out[12]:

Model: unsloth/tinyllama-chat-bnb-4bit
Tokenizer padding_side: right

In [13]:

  Copied!     
 
# Load the guanaco dataset from HuggingFace Hub
guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')
# Load the guanaco dataset from HuggingFace Hub guanaco_train = load_dataset('timdettmers/openassistant-guanaco', split='train')

Out[13]:

Repo card metadata block was not found. Setting CardData to empty.

Out[13]:

[huggingface_hub.repocard|WARNING]Repo card metadata block was not found. Setting CardData to empty.

In [14]:

  Copied!     
 
def reformat_text(text, include_answer=True):
    question1 = text.split('###')[1].removeprefix(' Human: ')
    answer1 = text.split('###')[2].removeprefix(' Assistant: ')
    if include_answer:
        messages = [
            {'role': 'user', 'content': question1},
            {'role': 'assistant', 'content': answer1}
        ]
    else:
        messages = [
            {'role': 'user', 'content': question1}
        ]        
    reformatted_text = tokenizer.apply_chat_template(messages, tokenize=False)
    return reformatted_text
def reformat_text(text, include_answer=True): question1 = text.split('###')[1].removeprefix(' Human: ') answer1 = text.split('###')[2].removeprefix(' Assistant: ') if include_answer: messages = [ {'role': 'user', 'content': question1}, {'role': 'assistant', 'content': answer1} ] else: messages = [ {'role': 'user', 'content': question1} ] reformatted_text = tokenizer.apply_chat_template(messages, tokenize=False) return reformatted_text

Out[14]:

[No output generated]

In [ ]:

In [15]:

  Copied!     
 
# Now, apply reformat_train(..) to the dataset:
guanaco_train = guanaco_train.map(lambda entry: {
    'reformatted_text': reformat_text(entry['text'])
})
# Now, apply reformat_train(..) to the dataset: guanaco_train = guanaco_train.map(lambda entry: { 'reformatted_text': reformat_text(entry['text']) })

Out[15]:

[No output generated]

In [16]:

  Copied!     
 
# Apply LoRA adapters using Unsloth's optimized implementation
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    lora_alpha=32,  # rule: lora_alpha should be 2*r
    lora_dropout=0,  # Use 0 for Unsloth's optimized fast patching (all layers)
    bias='none',  # Unsloth supports any, but = 'none' is optimized
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj",],
    use_gradient_checkpointing='unsloth',  # True or 'unsloth' for very long context
)
# Apply LoRA adapters using Unsloth's optimized implementation model = FastLanguageModel.get_peft_model( model, r=16, lora_alpha=32, # rule: lora_alpha should be 2*r lora_dropout=0, # Use 0 for Unsloth's optimized fast patching (all layers) bias='none', # Unsloth supports any, but = 'none' is optimized target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",], use_gradient_checkpointing='unsloth', # True or 'unsloth' for very long context )

Out[16]:

Unsloth 2025.11.1 patched 22 layers with 22 QKV layers, 22 O layers and 22 MLP layers.

In [17]:

  Copied!     
 
training_arguments = SFTConfig(
    output_dir='output/unsloth-tinyllama-chat-guanaco',
    per_device_train_batch_size=2,  # Reduced from 8 for 16GB GPU
    gradient_accumulation_steps=4,  # Increased to maintain effective batch size
    gradient_checkpointing=True,
    gradient_checkpointing_kwargs={'use_reentrant': False},
    optim='adamw_torch',
    learning_rate=2e-4,
    logging_strategy='steps',
    logging_steps=10,
    save_strategy='no',
    max_steps=100,
    bf16=True,
    report_to='none',
    max_seq_length=512,  # Reduced from 1024 for memory efficiency
    dataset_text_field='reformatted_text',
)
training_arguments = SFTConfig( output_dir='output/unsloth-tinyllama-chat-guanaco', per_device_train_batch_size=2, # Reduced from 8 for 16GB GPU gradient_accumulation_steps=4, # Increased to maintain effective batch size gradient_checkpointing=True, gradient_checkpointing_kwargs={'use_reentrant': False}, optim='adamw_torch', learning_rate=2e-4, logging_strategy='steps', logging_steps=10, save_strategy='no', max_steps=100, bf16=True, report_to='none', max_seq_length=512, # Reduced from 1024 for memory efficiency dataset_text_field='reformatted_text', )

Out[17]:

[No output generated]

In [18]:

  Copied!     
 
trainer = SFTTrainer(
    model=model,
    args=training_arguments,
    train_dataset=guanaco_train,
    processing_class=tokenizer,
)
trainer = SFTTrainer( model=model, args=training_arguments, train_dataset=guanaco_train, processing_class=tokenizer, )

Out[18]:

[No output generated]

In [19]:

  Copied!     
 
train_result = trainer.train()
print("Training result:")
print(train_result)
train_result = trainer.train() print("Training result:") print(train_result)

Out[19]:

The model is already on multiple devices. Skipping the move to device specified in `args`.

Out[19]:

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs used = 1
   \\   /|    Num examples = 9,846 | Num Epochs = 1 | Total steps = 100
O^O/ \_/ \    Batch size per device = 2 | Gradient accumulation steps = 4
\        /    Data Parallel GPUs = 1 | Total batch size (2 x 4 x 1) = 8
 "-____-"     Trainable parameters = 12,615,680 of 1,112,664,064 (1.13% trained)

Out[19]:

<IPython.core.display.HTML object>

Out[19]:

Training result:
TrainOutput(global_step=100, training_loss=1.5187408638000488, metrics={'train_runtime': 41.0695, 'train_samples_per_second': 19.479, 'train_steps_per_second': 2.435, 'total_flos': 1889668246855680.0, 'train_loss': 1.5187408638000488, 'epoch': 0.08125126955108673})

In [ ]:

In [10]:

  Copied!     
 
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shut down the kernel to release memory import IPython app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Out[10]:

{'status': 'ok', 'restart': False}

In [ ]: