Fine-tuning an LLM with plain PyTorch¶

In this notebook we will:

Use the same model and dataset as in later notebooks.
Build a PyTorch Dataset and DataLoader for instruction → response pairs.
Fine-tune a pretrained language model using a vanilla PyTorch training loop (no LoRA, no DeepSpeed, or any other fancy technique).
Save the fine-tuned model.
Compare inference before and after fine-tuning.

Learning objectives¶

After this notebook, you should be able to:

Explain what supervised fine-tuning does to an LLM.
Describe in words what an epoch, batch size, and learning rate are.
Read and write a standard PyTorch training loop for an LLM.
Run inference with a base model vs. a fine-tuned version and interpret the difference.

Bazzite-AI Setup Required
Run D0_00_Bazzite_AI_Setup.ipynb first to verify GPU access.

1. What does fine-tuning actually do?¶

We assume we start from a pretrained language model. It has already learned:

Grammar and spelling.
General knowledge.
How to continue text in a plausible way.

Now we want the model to behave well on our task (for example: answer domain-specific instructions in a certain style).
We show it many pairs of:

Input (prompt, instruction, context) → Target (ideal response / completion)

The model assigns a probability to each possible next token. During fine-tuning, we change the weights to increase the probability of the correct tokens.

Mathematically, if

$x = (x_1, \dots, x_T)$ is the input sequence (tokens),
$y = (y_1, \dots, y_T)$ is the target sequence,

we minimize the cross-entropy loss:

$$ L(\theta) = - \sum_t \log p_\theta(y_t \mid x_{\le t}) $$

You don’t need to derive this formula; the key idea is:

During fine-tuning, the model is nudged so that the ideal answer becomes more likely on future inputs that look similar.

Imports¶

In [50]:

  Copied!     
 
import os
import math
import random
from dataclasses import dataclass

import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader

from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    get_linear_schedule_with_warmup,
)
from datasets import load_dataset

import pandas as pd

torch.__version__
import os import math import random from dataclasses import dataclass import torch from torch import nn from torch.utils.data import Dataset, DataLoader from transformers import ( AutoTokenizer, AutoModelForCausalLM, get_linear_schedule_with_warmup, ) from datasets import load_dataset import pandas as pd torch.__version__

Out[50]:

'2.9.1+cu130'

In [51]:

  Copied!     
 
@dataclass
class Config:
    
    # Data
    max_length: int = 256        # max tokens per example

    # Optimization
    batch_size: int = 1          # reduced for memory efficiency
    num_epochs: int = 1
    learning_rate: float = 5e-6
    weight_decay: float = 0.01

    warmup_ratio: float = 0.1
    gradient_accumulation_steps: int = 16  # increased to compensate for smaller batch

    seed: int = 42
    device: str = "cuda" if torch.cuda.is_available() else "cpu"

cfg = Config()
cfg
@dataclass class Config: # Data max_length: int = 256 # max tokens per example # Optimization batch_size: int = 1 # reduced for memory efficiency num_epochs: int = 1 learning_rate: float = 5e-6 weight_decay: float = 0.01 warmup_ratio: float = 0.1 gradient_accumulation_steps: int = 16 # increased to compensate for smaller batch seed: int = 42 device: str = "cuda" if torch.cuda.is_available() else "cpu" cfg = Config() cfg

Out[51]:

Config(max_length=256, batch_size=1, num_epochs=1, learning_rate=5e-06, weight_decay=0.01, warmup_ratio=0.1, gradient_accumulation_steps=16, seed=42, device='cuda')

Understanding the main hyperparameters¶

We will use the following hyperparameters during training:

Epoch:
One epoch means one full pass through the training dataset.
- If you have 1,000 training examples and batch_size = 10, there are 100 steps per epoch.
- If we train for 3 epochs, the model will see each example 3 times.
Batch size (batch_size):
Number of examples processed together in one forward & backward pass.
- Larger batch sizes give a more stable estimate of the gradient, but use more GPU memory.
- On limited GPU memory, we often use small batches and gradient accumulation (see below).
Learning rate (learning_rate):
How big a step we take in the direction suggested by the gradients.
- Too large → training may diverge (loss explodes).
- Too small → training is very slow and may get stuck in poor local minima.
Weight decay (weight_decay):
A regularization term that slowly pulls weights towards zero to avoid overfitting.
Maximum sequence length (max_length):
We truncate / pad sequences to this number of tokens.
- Longer sequences capture more context but cost more memory and time.
- Shorter sequences are cheaper but might cut off important text.
Warmup ratio (warmup_ratio):
Fraction of the total training steps where the learning rate increases linearly from 0 to the target value.
- Helps avoid instability at the beginning of training, especially for large models.
Gradient accumulation steps (gradient_accumulation_steps):
Instead of updating the model after every batch, we:
1. Compute gradients for several small batches, and
2. Accumulate them in memory,
3. Then apply one optimizer step.
This simulates a larger effective batch size:

$$ \text{effective batch size} = \text{batch\_size} \times \text{gradient\_accumulation\_steps} $$

It’s a common trick to work around GPU memory limits.

In [52]:

  Copied!     
 
def set_seed(seed: int):
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(cfg.seed)
def set_seed(seed: int): random.seed(seed) torch.manual_seed(seed) torch.cuda.manual_seed_all(seed) set_seed(cfg.seed)

Out[52]:

[No output generated]

In [53]:

  Copied!     
 
dataset = load_dataset("timdettmers/openassistant-guanaco")
dataset = load_dataset("timdettmers/openassistant-guanaco")

Out[53]:

Repo card metadata block was not found. Setting CardData to empty.

Dataset loading¶

We use the HuggingFace datasets library to load the OpenAssistant Guanaco dataset directly from the Hub. This handles downloading, caching, and provides a clean interface for accessing train/test splits.

The dataset contains conversation pairs in the format:

{"text": "### Human: ...### Assistant: ..."}

In [54]:

  Copied!     
 
train_data = dataset["train"].to_list()
val_data = dataset["test"].to_list()
train_data = dataset["train"].to_list() val_data = dataset["test"].to_list()

Out[54]:

[No output generated]

To speed up the training, we will reduce the dataset size considerably for demonstration purposes:

In [55]:

  Copied!     
 
train_data = train_data[0:500]
val_data = val_data[0:100]

train_df = pd.DataFrame(train_data)
val_df   = pd.DataFrame(val_data)

print("Columns:", train_df.columns.tolist())
print("Train size:", len(train_df))
print("Validation size:", len(val_df))

train_df.head()
train_data = train_data[0:500] val_data = val_data[0:100] train_df = pd.DataFrame(train_data) val_df = pd.DataFrame(val_data) print("Columns:", train_df.columns.tolist()) print("Train size:", len(train_df)) print("Validation size:", len(val_df)) train_df.head()

Out[55]:

Columns: ['text']
Train size: 500
Validation size: 100

Out[55]:

                                                text
0  ### Human: Can you write a short introduction ...
1  ### Human: ¿CUales son las etapas del desarrol...
2  ### Human: Can you explain contrastive learnin...
3  ### Human: I want to start doing astrophotogra...
4  ### Human: Método del Perceptrón biclásico: de...

In [56]:

  Copied!     
 
len(train_data)
len(train_data)

Out[56]:

Dataset structure¶

We are using the Guanaco / OpenAssistant dataset (timdettmers/openassistant-guanaco).

Each line in openassistant_best_replies_train.jsonl is a JSON object with a single key: text.

The text value is a formatted conversation snippet, for example:

### Human: Hola### Assistant: ¡Hola! ¿En qué puedo ayudarte hoy?

In [57]:

  Copied!     
 
HF_LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
HF_LLM_MODEL = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

Out[57]:

[No output generated]

In [58]:

  Copied!     
 
tokenizer = AutoTokenizer.from_pretrained(HF_LLM_MODEL)
tokenizer = AutoTokenizer.from_pretrained(HF_LLM_MODEL)

Out[58]:

[No output generated]

Ensure we have a pad token (common for causal LMs):

In [59]:

  Copied!     
 
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

print("Pad token:", tokenizer.pad_token, "ID:", tokenizer.pad_token_id)
if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token print("Pad token:", tokenizer.pad_token, "ID:", tokenizer.pad_token_id)

Out[59]:

Pad token: </s> ID: 2

Prompt for inference function¶

The model was trained on Guanaco-style conversation strings of the form:

### Human: <instruction>### Assistant: <response>

At inference time we only have a new user instruction, so we must recreate the same format the model saw during training. This function builds that template and leaves the assistant part empty, so the model can generate the response naturally.

In [60]:

  Copied!     
 
def build_prompt_for_inference(user_instruction: str) -> str:
    """
    Build a Guanaco-style prompt for a NEW instruction at inference time.
    The dataset format looks like:
    "### Human: ...### Assistant: ..."
    """
    return f"### Human: {user_instruction}### Assistant:"
def build_prompt_for_inference(user_instruction: str) -> str: """ Build a Guanaco-style prompt for a NEW instruction at inference time. The dataset format looks like: "### Human: ...### Assistant: ..." """ return f"### Human: {user_instruction}### Assistant:"

Out[60]:

[No output generated]

Tensorization Class¶

This class converts each training example (a single "text" string in Guanaco format) into the tensors needed for fine-tuning a causal language model.

For every row in the dataset it:

Reads the full conversation text (e.g. "### Human: ...### Assistant: ...").
Tokenizes it using the model’s tokenizer.
Pads or truncates the sequence to a fixed length.
Creates:
- input_ids → the tokenized input
- attention_mask → which tokens are real vs padding
- labels → a copy of input_ids used as training targets
Replaces padding positions in labels with -100 so they are ignored in the loss.

The result is a dictionary of tensors (input_ids, attention_mask, labels) that PyTorch’s DataLoader can batch and feed directly into the model during training.

In [61]:

  Copied!     
 
class SupervisedTextDataset(Dataset):
    """
    Each row in train_df/val_df has a 'text' field like:

        "### Human: ...### Assistant: ..."

    For supervised fine-tuning of a causal LM, we feed in the full text and
    ask the model to learn to predict the next token at every position.
    """
    def __init__(self, dataframe, tokenizer, max_length: int = 256):
        self.df = dataframe.reset_index(drop=True)
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        text = row["text"]

        enc = self.tokenizer(
            text,
            truncation=True,
            max_length=self.max_length,
            padding="max_length",
            return_tensors="pt",
        )

        input_ids = enc["input_ids"].squeeze(0)
        attention_mask = enc["attention_mask"].squeeze(0)

        # For causal LM SFT: labels = input_ids (shift is handled internally)
        labels = input_ids.clone()
        labels[attention_mask == 0] = -100  # ignore padding in loss

        return {
            "input_ids": input_ids,
            "attention_mask": attention_mask,
            "labels": labels,
        }
class SupervisedTextDataset(Dataset): """ Each row in train_df/val_df has a 'text' field like: "### Human: ...### Assistant: ..." For supervised fine-tuning of a causal LM, we feed in the full text and ask the model to learn to predict the next token at every position. """ def __init__(self, dataframe, tokenizer, max_length: int = 256): self.df = dataframe.reset_index(drop=True) self.tokenizer = tokenizer self.max_length = max_length def __len__(self): return len(self.df) def __getitem__(self, idx): row = self.df.iloc[idx] text = row["text"] enc = self.tokenizer( text, truncation=True, max_length=self.max_length, padding="max_length", return_tensors="pt", ) input_ids = enc["input_ids"].squeeze(0) attention_mask = enc["attention_mask"].squeeze(0) # For causal LM SFT: labels = input_ids (shift is handled internally) labels = input_ids.clone() labels[attention_mask == 0] = -100 # ignore padding in loss return { "input_ids": input_ids, "attention_mask": attention_mask, "labels": labels, }

Out[61]:

[No output generated]

In [62]:

  Copied!     
 
train_dataset = SupervisedTextDataset(train_df, tokenizer, max_length=cfg.max_length)
val_dataset   = SupervisedTextDataset(val_df, tokenizer, max_length=cfg.max_length)
train_dataset = SupervisedTextDataset(train_df, tokenizer, max_length=cfg.max_length) val_dataset = SupervisedTextDataset(val_df, tokenizer, max_length=cfg.max_length)

Out[62]:

[No output generated]

Dataloader¶

The PyTorch DataLoader is responsible for:

dividing the dataset into batches of size batch_size
shuffling the data each epoch (because shuffle=True)
fetching items by calling __getitem__ from our SupervisedTextDataset
returning ready-to-use batches during training

train_loader is therefore the object our training loop iterates over:

for batch in train_loader:
    ...

This is the standard PyTorch way to feed data into a model during training.

In [63]:

  Copied!     
 
train_loader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=cfg.batch_size)

len(train_loader), len(val_loader)
train_loader = DataLoader(train_dataset, batch_size=cfg.batch_size, shuffle=True) val_loader = DataLoader(val_dataset, batch_size=cfg.batch_size) len(train_loader), len(val_loader)

Out[63]:

(500, 100)

In [64]:

  Copied!     
 
batch = next(iter(train_loader))
{k: v.shape for k, v in batch.items()}
batch = next(iter(train_loader)) {k: v.shape for k, v in batch.items()}

Out[64]:

{'input_ids': torch.Size([1, 256]),
 'attention_mask': torch.Size([1, 256]),
 'labels': torch.Size([1, 256])}

What’s inside one batch?¶

batch = next(iter(train_loader))
{k: v.shape for k, v in batch.items()}

iter(train_loader) creates a Python iterator over the batches.
next(...) retrieves the first batch from the DataLoader.
batch is a dictionary containing tensors like:
- "input_ids"
- "attention_mask"
- "labels"

The second line builds a new dictionary showing the shape of each tensor in that batch. This is a quick way to inspect what one batch looks like and confirm that batching and padding work as expected.

For a batch size of B and sequence length L, we typically get:

input_ids: shape (B, L)
Integer token IDs that the model reads.
attention_mask: shape (B, L)
1 = real token, 0 = padding token.
labels: shape (B, L)
Token IDs that we want the model to predict, with -100 at positions to ignore in the loss (padding).

The model will compute a probability distribution over the vocabulary for each position and compare it against labels using cross-entropy loss.

Model, optimizer, and scheduler¶

In [65]:

  Copied!     
 
model = AutoModelForCausalLM.from_pretrained(
    HF_LLM_MODEL,
    dtype=torch.bfloat16,  # Use bfloat16 for memory efficiency
)

model.to(cfg.device)

# Enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()

n_params = sum(p.numel() for p in model.parameters())
print(f"Number of parameters: {n_params / 1e6:.1f}M")
print(f"Model dtype: {next(model.parameters()).dtype}")
model = AutoModelForCausalLM.from_pretrained( HF_LLM_MODEL, dtype=torch.bfloat16, # Use bfloat16 for memory efficiency ) model.to(cfg.device) # Enable gradient checkpointing for memory efficiency model.gradient_checkpointing_enable() n_params = sum(p.numel() for p in model.parameters()) print(f"Number of parameters: {n_params / 1e6:.1f}M") print(f"Model dtype: {next(model.parameters()).dtype}")

Out[65]:

Number of parameters: 1100.0M
Model dtype: torch.bfloat16

In [66]:

  Copied!     
 
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=cfg.learning_rate,
    weight_decay=cfg.weight_decay,
)
optimizer = torch.optim.AdamW( model.parameters(), lr=cfg.learning_rate, weight_decay=cfg.weight_decay, )

Out[66]:

[No output generated]

How many optimizer steps in total? Note: we divide by gradient_accumulation_steps

In [67]:

  Copied!     
 
steps_per_epoch = math.ceil(len(train_loader))
total_steps = (steps_per_epoch * cfg.num_epochs) // cfg.gradient_accumulation_steps

warmup_steps = int(cfg.warmup_ratio * total_steps)
steps_per_epoch = math.ceil(len(train_loader)) total_steps = (steps_per_epoch * cfg.num_epochs) // cfg.gradient_accumulation_steps warmup_steps = int(cfg.warmup_ratio * total_steps)

Out[67]:

[No output generated]

This creates a learning-rate scheduler that changes the optimizer’s learning rate during training.

In [68]:

  Copied!     
 
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps,
)

total_steps, warmup_steps
scheduler = get_linear_schedule_with_warmup( optimizer, num_warmup_steps=warmup_steps, num_training_steps=total_steps, ) total_steps, warmup_steps

Out[68]:

(31, 3)

Time to train a model¶

In [69]:

  Copied!     
 
model.train()

single_batch = next(iter(train_loader))
single_batch = {k: v.to(cfg.device) for k, v in single_batch.items()}

# Forward pass
out = model(
    input_ids=single_batch["input_ids"],
    attention_mask=single_batch["attention_mask"],
    labels=single_batch["labels"],
)

loss = out.loss
print("Single batch loss:", loss.item())

# Backward pass
loss.backward()

# Parameter update
optimizer.step()
scheduler.step()
optimizer.zero_grad()
model.train() single_batch = next(iter(train_loader)) single_batch = {k: v.to(cfg.device) for k, v in single_batch.items()} # Forward pass out = model( input_ids=single_batch["input_ids"], attention_mask=single_batch["attention_mask"], labels=single_batch["labels"], ) loss = out.loss print("Single batch loss:", loss.item()) # Backward pass loss.backward() # Parameter update optimizer.step() scheduler.step() optimizer.zero_grad()

Out[69]:

`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.

Out[69]:

Single batch loss: 2.5328707695007324

What just happened?¶

For one batch we did:

Forward pass:
out = model(...)
The model returns:
- out.logits: raw predictions for each token.
- out.loss: cross-entropy loss between logits and labels.
Loss computation:
loss = out.loss
A single scalar summarizing how “wrong” the model is on this batch.
Backward pass:
loss.backward()
Computes gradients of the loss with respect to all trainable parameters.
Optimizer step:
optimizer.step()
Updates the weights using the gradients (and the learning rate).
LR scheduler step:
scheduler.step()
Adjusts the learning rate according to the warmup + decay schedule.
Zero gradients:
optimizer.zero_grad()
Clears old gradients so they don’t accumulate accidentally.

The full training loop is just many repetitions of this pattern over all batches and epochs.

Evaluation function¶

This function computes the average validation loss of the model without updating its weights.

Step-by-step:

model.eval()
Puts the model in evaluation mode (disables dropout, layer norm behavior, etc.).
torch.no_grad()
Turns off gradient calculation → faster and uses less memory.
Iterate over the validation dataloader
- Move each batch to the correct device
- Run a forward pass with model(**batch) (The double asterisk ** unpacks the dictionary into keyword arguments.)
- Extract the loss (out.loss.item()) and store it
Return the model to training mode with model.train()
Compute and return the mean loss across all validation batches.

This function is used at the end of each epoch to check how well the model performs on unseen data.

In [70]:

  Copied!     
 
def evaluate(model, dataloader):
    model.eval()
    losses = []

    with torch.no_grad():
        for batch in dataloader:
            batch = {k: v.to(cfg.device) for k, v in batch.items()}
            out = model(**batch)
            losses.append(out.loss.item())

    model.train()
    return sum(losses) / len(losses)
def evaluate(model, dataloader): model.eval() losses = [] with torch.no_grad(): for batch in dataloader: batch = {k: v.to(cfg.device) for k, v in batch.items()} out = model(**batch) losses.append(out.loss.item()) model.train() return sum(losses) / len(losses)

Out[70]:

[No output generated]

Full training loop¶

In [71]:

  Copied!     
 
from tqdm.auto import tqdm

global_step = 0
best_val_loss = float("inf")
save_dir = "ft_model"

os.makedirs(save_dir, exist_ok=True)

for epoch in range(cfg.num_epochs):
    model.train()
    running_loss = 0.0

    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{cfg.num_epochs}")

    for step, batch in enumerate(progress_bar):
        batch = {k: v.to(cfg.device) for k, v in batch.items()}

        out = model(**batch)
        loss = out.loss / cfg.gradient_accumulation_steps
        loss.backward()

        running_loss += loss.item()

        if (step + 1) % cfg.gradient_accumulation_steps == 0:
            optimizer.step()
            scheduler.step()
            optimizer.zero_grad()
            global_step += 1

            avg_loss = running_loss / cfg.gradient_accumulation_steps
            running_loss = 0.0

            progress_bar.set_postfix(
                train_loss=f"{avg_loss:.4f}",
                lr=f"{scheduler.get_last_lr()[0]:.2e}",
            )

    # Validation at the end of the epoch
    val_loss = evaluate(model, val_loader)
    print(f"\nValidation loss after epoch {epoch+1}: {val_loss:.4f}")

    # Simple checkpointing: keep the best model
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        model.save_pretrained(save_dir)
        tokenizer.save_pretrained(save_dir)
        print(f"→ New best model saved to {save_dir}\n")
    else:
        print("No improvement, keeping previous best model.\n")
from tqdm.auto import tqdm global_step = 0 best_val_loss = float("inf") save_dir = "ft_model" os.makedirs(save_dir, exist_ok=True) for epoch in range(cfg.num_epochs): model.train() running_loss = 0.0 progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{cfg.num_epochs}") for step, batch in enumerate(progress_bar): batch = {k: v.to(cfg.device) for k, v in batch.items()} out = model(**batch) loss = out.loss / cfg.gradient_accumulation_steps loss.backward() running_loss += loss.item() if (step + 1) % cfg.gradient_accumulation_steps == 0: optimizer.step() scheduler.step() optimizer.zero_grad() global_step += 1 avg_loss = running_loss / cfg.gradient_accumulation_steps running_loss = 0.0 progress_bar.set_postfix( train_loss=f"{avg_loss:.4f}", lr=f"{scheduler.get_last_lr()[0]:.2e}", ) # Validation at the end of the epoch val_loss = evaluate(model, val_loader) print(f"\nValidation loss after epoch {epoch+1}: {val_loss:.4f}") # Simple checkpointing: keep the best model if val_loss < best_val_loss: best_val_loss = val_loss model.save_pretrained(save_dir) tokenizer.save_pretrained(save_dir) print(f"→ New best model saved to {save_dir}\n") else: print("No improvement, keeping previous best model.\n")

Out[71]:

Epoch 1/1:   0%|          | 0/500 [00:00<?, ?it/s]

Out[71]:

Validation loss after epoch 1: 1.8335

Out[71]:

→ New best model saved to ft_model

In [72]:

  Copied!     
 
# Fine-tuned model – load from checkpoint
ft_model = AutoModelForCausalLM.from_pretrained(
    save_dir,
    dtype=torch.bfloat16,
).to(cfg.device)
ft_model.eval()
# Fine-tuned model – load from checkpoint ft_model = AutoModelForCausalLM.from_pretrained( save_dir, dtype=torch.bfloat16, ).to(cfg.device) ft_model.eval()

Out[72]:

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 2048)
    (layers): ModuleList(
      (0-21): 22 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=256, bias=False)
          (v_proj): Linear(in_features=2048, out_features=256, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
          (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=32000, bias=False)
)

Time for inference¶

This function runs inference: it takes a new user instruction and makes the model generate a reply.

In [73]:

  Copied!     
 
def generate_response(model, instruction: str, max_new_tokens: int = 128):
    """
    Generate a reply from the model given a human instruction.
    We create a Guanaco-style prompt:
        "### Human: ...### Assistant:"
    and let the model continue.
    """
    prompt_text = build_prompt_for_inference(instruction)

    inputs = tokenizer(prompt_text, return_tensors="pt").to(cfg.device)

    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,  # greedy for clarity
            pad_token_id=tokenizer.pad_token_id,
        )

    text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
    return text
def generate_response(model, instruction: str, max_new_tokens: int = 128): """ Generate a reply from the model given a human instruction. We create a Guanaco-style prompt: "### Human: ...### Assistant:" and let the model continue. """ prompt_text = build_prompt_for_inference(instruction) inputs = tokenizer(prompt_text, return_tensors="pt").to(cfg.device) with torch.no_grad(): output_ids = model.generate( **inputs, max_new_tokens=max_new_tokens, do_sample=False, # greedy for clarity pad_token_id=tokenizer.pad_token_id, ) text = tokenizer.decode(output_ids[0], skip_special_tokens=True) return text

Out[73]:

[No output generated]

Let's find an example text that is actually in english

In [74]:

  Copied!     
 
example_text = val_df.iloc[11]["text"]
example_text
example_text = val_df.iloc[11]["text"] example_text

Out[74]:

"### Human: How do I build a PC?### Assistant: Building a PC can be a rewarding experience, but it requires some technical knowledge and patience. Here are the general steps you can follow:\n\n1. Determine your budget and what you want to use the PC for (gaming, work, etc.). This will help you choose the appropriate components.\n\n2. Choose the components you want to use (CPU, motherboard, RAM, storage, GPU, power supply, case, and peripherals).\n\n3. Make sure all the components are compatible with each other. Check the manufacturer's website and online forums for compatibility information.\n\n4. Gather the necessary tools (screwdriver, thermal paste, etc.).\n\n5. Install the CPU onto the motherboard, following the manufacturer's instructions.\n\n6. Install the RAM onto the motherboard, following the manufacturer's instructions.\n\n7. Install the storage (hard drive or solid-state drive) onto the motherboard or into the case, following the manufacturer's instructions.\n\n8. Install the power supply into the case, following the manufacturer's instructions.\n\n9. Install the GPU onto the motherboard, following the manufacturer's instructions.\n\n10. Connect all the necessary cables to the motherboard, power supply, and peripherals.\n\n11. Turn on the PC and make sure it boots up properly.\n\n12. Install the operating system and necessary drivers.\n\n13. Test the PC to make sure all the components are working properly.\n\nThese are the basic steps, but there may be some variation depending on the specific components you choose. It's important to follow the manufacturer's instructions and take your time to avoid damaging any components. There are also many online resources, such as YouTube tutorials and PC building forums, that can provide additional guidance and tips.### Human: Thank you. Is it better to build my own PC or to just buy one off the shelf? Give me the pros and cons of each approach.### Assistant: Building a computer \nPros:\n* Cheaper in the long run\n* Easier to fix\n* Better overall quality\n\nCons:\n* Can be more expensive upfront\n* Higher chance of user error\n* You need to figure out part compatibility\n\n---\n\nOff-the-Shelf computer\nPros:\n* Faster to buy than to build\n* Plug and Play\n* Normally Cheaper to buy rather than build\n\nCons:\n* Repairs are harder to do\n* Build Quality can be lower\n* Limited configuration available\n\nThere are plenty of other reasons that can influence your decisions but it comes down to how soon you need a computer, and how confident you are working on a computer."

Let's also load the original model that has not been fine-tuned:

In [75]:

  Copied!     
 
base_model = AutoModelForCausalLM.from_pretrained(
    HF_LLM_MODEL,
    dtype=torch.bfloat16,
).to(cfg.device)
base_model = AutoModelForCausalLM.from_pretrained( HF_LLM_MODEL, dtype=torch.bfloat16, ).to(cfg.device)

Out[75]:

[No output generated]

In [76]:

  Copied!     
 
# crude split to get the human message
if "### Human:" in example_text and "### Assistant:" in example_text:
    human_part = example_text.split("### Human:")[1].split("### Assistant:")[0].strip()
    assistant_part = example_text.split("### Assistant:")[1].strip()
else:
    human_part = example_text
    assistant_part = ""

print("### HUMAN (PROMPT) ###")
print(human_part)

print("\n### GROUND TRUTH ASSISTANT ###")
print(assistant_part)

print("\n### BASE MODEL ###")
print(generate_response(base_model, human_part))

print("\n### FINE-TUNED MODEL ###")
print(generate_response(ft_model, human_part))
# crude split to get the human message if "### Human:" in example_text and "### Assistant:" in example_text: human_part = example_text.split("### Human:")[1].split("### Assistant:")[0].strip() assistant_part = example_text.split("### Assistant:")[1].strip() else: human_part = example_text assistant_part = "" print("### HUMAN (PROMPT) ###") print(human_part) print("\n### GROUND TRUTH ASSISTANT ###") print(assistant_part) print("\n### BASE MODEL ###") print(generate_response(base_model, human_part)) print("\n### FINE-TUNED MODEL ###") print(generate_response(ft_model, human_part))

Out[76]:

### HUMAN (PROMPT) ###
How do I build a PC?

### GROUND TRUTH ASSISTANT ###
Building a PC can be a rewarding experience, but it requires some technical knowledge and patience. Here are the general steps you can follow:

1. Determine your budget and what you want to use the PC for (gaming, work, etc.). This will help you choose the appropriate components.

2. Choose the components you want to use (CPU, motherboard, RAM, storage, GPU, power supply, case, and peripherals).

3. Make sure all the components are compatible with each other. Check the manufacturer's website and online forums for compatibility information.

4. Gather the necessary tools (screwdriver, thermal paste, etc.).

5. Install the CPU onto the motherboard, following the manufacturer's instructions.

6. Install the RAM onto the motherboard, following the manufacturer's instructions.

7. Install the storage (hard drive or solid-state drive) onto the motherboard or into the case, following the manufacturer's instructions.

8. Install the power supply into the case, following the manufacturer's instructions.

9. Install the GPU onto the motherboard, following the manufacturer's instructions.

10. Connect all the necessary cables to the motherboard, power supply, and peripherals.

11. Turn on the PC and make sure it boots up properly.

12. Install the operating system and necessary drivers.

13. Test the PC to make sure all the components are working properly.

These are the basic steps, but there may be some variation depending on the specific components you choose. It's important to follow the manufacturer's instructions and take your time to avoid damaging any components. There are also many online resources, such as YouTube tutorials and PC building forums, that can provide additional guidance and tips.### Human: Thank you. Is it better to build my own PC or to just buy one off the shelf? Give me the pros and cons of each approach.

### BASE MODEL ###

Out[76]:

### Human: How do I build a PC?### Assistant: Sure, here's a step-by-step guide on how to build a PC:

1. Choose your components

- Choose the right components for your needs. For example, if you're a gamer, you'll want a powerful CPU, GPU, and RAM.
- Consider the budget. You don't want to spend too much on components that won't be used.

2. Choose your motherboard

- Choose a motherboard that meets your needs. Some motherboards come with pre-installed components, while others require additional components.
- Consider

### FINE-TUNED MODEL ###

Out[76]:

### Human: How do I build a PC?### Assistant: Sure, here's a step-by-step guide on how to build a PC:

1. Choose your components

- CPU: Choose a processor that meets your needs. Intel Core i5 or AMD Ryzen 5 are good options.
- RAM: Choose a RAM speed that matches your processor speed. DDR4 is the most common speed.
- Motherboard: Choose a motherboard that meets your needs. Intel and AMD motherboards are the most common.
- Graphics card: Choose a graphics card that meets your needs. Nvidia GeForce GTX

In [77]:

  Copied!     
 
my_instruction = "WRITE YOUR INSTRUCTION HERE"

print("\nFINE-TUNED MODEL:\n")
print(generate_response(ft_model, my_instruction))
my_instruction = "WRITE YOUR INSTRUCTION HERE" print("\nFINE-TUNED MODEL:\n") print(generate_response(ft_model, my_instruction))

Out[77]:

FINE-TUNED MODEL:

Out[77]:

### Human: WRITE YOUR INSTRUCTION HERE### Assistant: Here's an example:

1. Start by setting up your project on your computer.

2. Install the necessary software and tools, such as a text editor, a development environment, and a database management system.

3. Create a database schema and design the database tables.

4. Write the database queries and stored procedures to retrieve and update data.

5. Implement security measures, such as user authentication and authorization, to ensure data privacy and security.

6. Test the application thoroughly to ensure it works as expected and meets the requirements.

7. Deploy the

In [78]:

  Copied!     
 
use_amp = torch.cuda.is_available()  # only useful on GPU
print("Using AMP (mixed precision)?", use_amp)

scaler = torch.amp.GradScaler('cuda', enabled=use_amp)

def train_one_epoch_amp(model, dataloader, optimizer, scheduler, scaler, epoch_idx: int):
    model.train()
    running_loss = 0.0
    progress_bar = tqdm(dataloader, desc=f"[AMP] Epoch {epoch_idx+1}/{cfg.num_epochs}")

    for step, batch in enumerate(progress_bar):
        batch = {k: v.to(cfg.device) for k, v in batch.items()}

        with torch.amp.autocast('cuda', enabled=use_amp):
            out = model(**batch)
            loss = out.loss / cfg.gradient_accumulation_steps

        scaler.scale(loss).backward()
        running_loss += loss.item()

        if (step + 1) % cfg.gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            scheduler.step()

            avg_loss = running_loss / cfg.gradient_accumulation_steps
            running_loss = 0.0

            progress_bar.set_postfix(
                train_loss=f"{avg_loss:.4f}",
                lr=f"{scheduler.get_last_lr()[0]:.2e}",
            )
use_amp = torch.cuda.is_available() # only useful on GPU print("Using AMP (mixed precision)?", use_amp) scaler = torch.amp.GradScaler('cuda', enabled=use_amp) def train_one_epoch_amp(model, dataloader, optimizer, scheduler, scaler, epoch_idx: int): model.train() running_loss = 0.0 progress_bar = tqdm(dataloader, desc=f"[AMP] Epoch {epoch_idx+1}/{cfg.num_epochs}") for step, batch in enumerate(progress_bar): batch = {k: v.to(cfg.device) for k, v in batch.items()} with torch.amp.autocast('cuda', enabled=use_amp): out = model(**batch) loss = out.loss / cfg.gradient_accumulation_steps scaler.scale(loss).backward() running_loss += loss.item() if (step + 1) % cfg.gradient_accumulation_steps == 0: scaler.step(optimizer) scaler.update() optimizer.zero_grad() scheduler.step() avg_loss = running_loss / cfg.gradient_accumulation_steps running_loss = 0.0 progress_bar.set_postfix( train_loss=f"{avg_loss:.4f}", lr=f"{scheduler.get_last_lr()[0]:.2e}", )

Out[78]:

Using AMP (mixed precision)? True

Note: For the main run we implemented above, we used full precision for simplicity.
If you switch to this AMP-based loop instead, you will usually see:

Lower GPU memory usage.

Faster training (especially on modern GPUs like A100/H100).

Summary¶

In this notebook you:

Loaded a pretrained language model and tokenizer.
Prepared a dataset of (prompt, response) pairs.
Tokenized and formatted the data for causal language modeling.
Implemented a vanilla PyTorch training loop:
- Forward pass → loss
- Backward pass → gradients
- Optimizer + scheduler → weight updates
Saved the fine-tuned model.
Compared inference:
- Base model vs. fine-tuned model on real examples.

In [ ]:

  Copied!     
 
# Shut down the kernel to release memory
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shut down the kernel to release memory import IPython app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)