Transformer Anatomy¶

The Transformer architecture, introduced in the paper "Attention is All You Need" by Vaswani et al., 2017, has revolutionized the field of NLP. Unlike previous models that relied on recurrent or convolutional layers, Transformers use self-attention mechanisms to capture dependencies between words in a sentence, regardless of their distance.

Key Components of the Transformer Architecture:¶

Positional Encoding: Adds information about the position of words in a sequence since the model itself does not inherently understand word order.
Self-Attention Mechanism: Allows the model to weigh the importance of each word in a sentence relative to all other words.
Multi-Head Attention: Enables the model to focus on different parts of the input simultaneously.
Layer Normalization and Residual Connections: Stabilizes training and helps with gradient flow.
Feed-Forward Neural Networks: Applies a point-wise feed-forward layer to each position independently and destilles the information further to output probabilities.

Let's explore each of these components in detail and understand how they work together to create powerful language models.

Bazzite-AI Setup Required
Run D0_00_Bazzite_AI_Setup.ipynb first to verify GPU access.

Self-Attention Mechanism¶

The self-attention mechanism is the core component of the Transformer architecture. It allows the model to dynamically assign different levels of importance to different words in a sentence when encoding a particular word.

How Self-Attention Works:¶

Input Embeddings: Before we can apply self-attention, the input words must first be converted into embeddings (dense vector representations).
Query, Key, and Value Vectors: For each word, the model creates three vectors: a Query vector (Q), a Key vector (K), and a Value vector (V).
Attention Scores: The attention score is computed as the dot product of the Query vector of a word with the Key vectors of all words. This score determines how much focus should be on the other words.
Weighted Sum: Each word's output representation is computed as a weighted sum of the Value vectors, where the weights are the normalized attention scores.
Softmax Normalization: After the attention heads, the scores are passed through a softmax function to convert them into probabilities.

Visualizing Self-Attention¶

Let's visualize how the self-attention mechanism works for a simple sentence.

In [19]:

  Copied!     
 
# Import required libraries
import torch
import torch.nn.functional as F
import math
# Import required libraries import torch import torch.nn.functional as F import math

Out[19]:

[No output generated]

In [20]:

  Copied!     
 
# Example sentence and tokens
sentence = "Transformers are revolutionary in NLP."
tokens = ["Transformers", "are", "revolutionary", "in", "NLP"]
# Example sentence and tokens sentence = "Transformers are revolutionary in NLP." tokens = ["Transformers", "are", "revolutionary", "in", "NLP"]

Out[20]:

[No output generated]

In [21]:

  Copied!     
 
# Embedding dimension
embedding_dim = 8

# Random input embeddings (for illustration purposes)
torch.manual_seed(42)
input_embeddings = torch.randn(len(tokens), embedding_dim)
# Embedding dimension embedding_dim = 8 # Random input embeddings (for illustration purposes) torch.manual_seed(42) input_embeddings = torch.randn(len(tokens), embedding_dim)

Out[21]:

[No output generated]

In [22]:

  Copied!     
 
# In this example we end up with a 5x8 matrix
input_embeddings
# In this example we end up with a 5x8 matrix input_embeddings

Out[22]:

tensor([[ 1.9269,  1.4873,  0.9007, -2.1055,  0.6784, -1.2345, -0.0431, -1.6047],
        [-0.7521,  1.6487, -0.3925, -1.4036, -0.7279, -0.5594, -0.7688,  0.7624],
        [ 1.6423, -0.1596, -0.4974,  0.4396, -0.7581,  1.0783,  0.8008,  1.6806],
        [ 0.0349,  0.3211,  1.5736, -0.8455,  1.3123,  0.6872, -1.0892, -0.3553],
        [-1.4181,  0.8963,  0.0499,  2.2667,  1.1790, -0.4345, -1.3864, -1.2862]])

In [23]:

  Copied!     
 
# Initialize Query, Key, and Value weight matrices
Q = torch.randn(embedding_dim, embedding_dim)
K = torch.randn(embedding_dim, embedding_dim)
V = torch.randn(embedding_dim, embedding_dim)
# Initialize Query, Key, and Value weight matrices Q = torch.randn(embedding_dim, embedding_dim) K = torch.randn(embedding_dim, embedding_dim) V = torch.randn(embedding_dim, embedding_dim)

Out[23]:

[No output generated]

In [24]:

  Copied!     
 
# Compute Query, Key, and Value vectors
queries = input_embeddings @ Q
keys = input_embeddings @ K
values = input_embeddings @ V
# Compute Query, Key, and Value vectors queries = input_embeddings @ Q keys = input_embeddings @ K values = input_embeddings @ V

Out[24]:

[No output generated]

In [25]:

  Copied!     
 
# Calculate attention scores using dot product of queries and keys
attention_scores = queries @ keys.T

# Apply softmax to normalize the scores
attention_weights = F.softmax(attention_scores/math.sqrt(embedding_dim), dim=-1)
# Calculate attention scores using dot product of queries and keys attention_scores = queries @ keys.T # Apply softmax to normalize the scores attention_weights = F.softmax(attention_scores/math.sqrt(embedding_dim), dim=-1)

Out[25]:

[No output generated]

In [26]:

  Copied!     
 
# Compute the weighted sum of values
output = attention_weights @ values

# Display the attention weights
print("Attention Weights:\n", attention_weights)
# Compute the weighted sum of values output = attention_weights @ values # Display the attention weights print("Attention Weights:\n", attention_weights)

Out[26]:

Attention Weights:
 tensor([[9.1079e-01, 4.6710e-03, 4.2964e-08, 6.3779e-02, 2.0756e-02],
        [5.4914e-05, 7.8533e-02, 3.1973e-10, 5.2326e-02, 8.6909e-01],
        [1.0379e-02, 9.8962e-01, 7.4063e-10, 8.1366e-09, 2.0209e-14],
        [7.0323e-01, 9.9272e-02, 7.0980e-08, 1.2708e-01, 7.0416e-02],
        [4.7277e-11, 7.3541e-17, 5.1235e-01, 1.4105e-05, 4.8764e-01]])

Multi-Head Attention¶

The multi-head attention mechanism allows the model to focus on different parts of the input simultaneously. Instead of having a single attention mechanism, the model uses multiple attention "heads" in parallel. Each head can learn different aspects of the input.

How Multi-Head Attention Works:¶

The input is projected into multiple sets of Query, Key, and Value vectors.
Each set of vectors is processed independently through a self-attention mechanism.
The outputs from each head are concatenated and projected back into a single vector space.

This approach provides the model with a richer understanding of the input by capturing different types of relationships between words.

Example: Multi-Head Attention¶

Let's visualize how multi-head attention works using multiple attention heads.

In [27]:

  Copied!     
 
# Number of attention heads
num_heads = 2

# Initialize weight matrices for each head
Q_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)]
K_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)]
V_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)]
# Number of attention heads num_heads = 2 # Initialize weight matrices for each head Q_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)] K_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)] V_heads = [torch.randn(embedding_dim, embedding_dim) for _ in range(num_heads)]

Out[27]:

[No output generated]

In [28]:

  Copied!     
 
# Compute outputs for each head
head_outputs = []
for i in range(num_heads):
    queries = input_embeddings @ Q_heads[i]
    keys = input_embeddings @ K_heads[i]
    values = input_embeddings @ V_heads[i]
    
    # Calculate attention scores and apply softmax
    attention_scores = queries @ keys.T
    attention_weights = F.softmax(attention_scores/math.sqrt(embedding_dim), dim=-1)
    
    # Compute the weighted sum of values
    output = attention_weights @ values
    head_outputs.append(output)
# Compute outputs for each head head_outputs = [] for i in range(num_heads): queries = input_embeddings @ Q_heads[i] keys = input_embeddings @ K_heads[i] values = input_embeddings @ V_heads[i] # Calculate attention scores and apply softmax attention_scores = queries @ keys.T attention_weights = F.softmax(attention_scores/math.sqrt(embedding_dim), dim=-1) # Compute the weighted sum of values output = attention_weights @ values head_outputs.append(output)

Out[28]:

[No output generated]

In [29]:

  Copied!     
 
# Concatenate outputs from all heads
multi_head_output = torch.cat(head_outputs, dim=-1)

# Display multi-head attention output
print("Multi-Head Attention Output:\n", multi_head_output)
# Concatenate outputs from all heads multi_head_output = torch.cat(head_outputs, dim=-1) # Display multi-head attention output print("Multi-Head Attention Output:\n", multi_head_output)

Out[29]:

Multi-Head Attention Output:
 tensor([[-1.9480, -0.9693,  1.8384, -0.4820, -4.0619, -0.5366,  0.5428, -5.0823,
         -2.2401,  0.1107, -0.2224, -4.6833,  2.7012, -0.7170, -3.8741,  4.0510],
        [-2.7226,  4.1570,  3.3591,  0.2399, -3.3297, -0.8993,  1.6060, -5.8826,
         -0.5336, -0.5165, -0.5723,  1.5201, -1.9711,  4.3885, -2.2671,  0.1774],
        [-2.6901,  3.5823, -1.4147, -0.0957,  4.6669,  5.1801,  4.0165,  4.3587,
          2.5703, -0.4536,  0.4015,  3.6330, -2.1548,  3.2844,  1.0156, -4.4502],
        [-1.9242,  2.0470,  2.4765, -2.4018, -1.3216,  0.0932, -0.7773,  3.1542,
          5.0544,  0.6194,  3.0915,  6.3284, -1.5117,  2.3767, -1.4484, -5.5771],
        [-0.8583, -5.6953,  0.2093,  5.0692, -2.0687, -4.6651, -8.9582, -3.1550,
          5.0470,  0.6191,  3.0886,  6.3169, -1.5068,  2.3731, -1.4518, -5.5668]])

Feed-Forward Neural Networks¶

Each position's output from the multi-head attention mechanism is passed through a point-wise feed-forward neural network. This consists of two linear transformations with a ReLU activation in between.

Example: Feed-Forward Network¶

Let's implement a simple feed-forward network.

In [30]:

  Copied!     
 
# Define feed-forward neural network
class FeedForwardNN(torch.nn.Module):
    def __init__(self, input_dim, hidden_dim):
        super(FeedForwardNN, self).__init__()
        self.linear1 = torch.nn.Linear(input_dim, hidden_dim)
        self.relu = torch.nn.ReLU()
        self.linear2 = torch.nn.Linear(hidden_dim, input_dim)
    
    def forward(self, x):
        return self.linear2(self.relu(self.linear1(x)))
# Define feed-forward neural network class FeedForwardNN(torch.nn.Module): def __init__(self, input_dim, hidden_dim): super(FeedForwardNN, self).__init__() self.linear1 = torch.nn.Linear(input_dim, hidden_dim) self.relu = torch.nn.ReLU() self.linear2 = torch.nn.Linear(hidden_dim, input_dim) def forward(self, x): return self.linear2(self.relu(self.linear1(x)))

Out[30]:

[No output generated]

In [31]:

  Copied!     
 
# Instantiate and apply the feed-forward network
ffn = FeedForwardNN(input_dim=embedding_dim * num_heads, hidden_dim=32)
ffn_output = ffn(multi_head_output)

# Display the feed-forward network output
print("Feed-Forward Network Output:\n", ffn_output)
# Instantiate and apply the feed-forward network ffn = FeedForwardNN(input_dim=embedding_dim * num_heads, hidden_dim=32) ffn_output = ffn(multi_head_output) # Display the feed-forward network output print("Feed-Forward Network Output:\n", ffn_output)

Out[31]:

Feed-Forward Network Output:
 tensor([[-1.2893,  1.9233, -1.0605,  0.6202,  0.1357, -0.5937,  0.8868, -0.2316,
         -0.4667, -0.2131,  0.3348, -0.0262, -0.1767,  0.6221,  0.0415, -0.3378],
        [-1.2682,  1.5913, -0.8217,  0.3472, -0.4825, -1.0574,  1.2204, -0.3321,
         -0.0194, -0.3422,  0.7382, -0.3878, -0.1983,  0.1369, -0.0458,  0.4435],
        [ 0.3856, -0.3332, -0.1363, -0.0926,  1.2902,  0.3517,  0.1051, -1.2283,
         -1.2062, -1.1978,  1.9378, -0.0443,  0.3887, -0.2704, -0.5029,  1.6625],
        [-0.0733,  0.7897, -0.6835, -0.4999,  1.0672,  0.1900,  0.4367, -1.2058,
         -0.2689, -0.0920,  0.7118,  0.0686,  1.1384, -0.0678,  0.4759,  1.3752],
        [-0.3010,  2.2048,  0.3056,  0.1502, -0.0891, -0.3435, -0.2976, -1.4355,
          0.6019,  2.2246,  1.3221,  0.4200,  0.6200, -0.1875,  1.0654,  0.8433]],
       grad_fn=<AddmmBackward0>)

Layer Normalization and Residual Connections¶

Layer normalization is used to stabilize training by normalizing the inputs to each layer. Residual connections help maintain gradient flow through the network, enabling deeper architectures.

Example: Adding Layer Normalization and Residual Connections¶

Let's see how these components are added to the Transformer block.

In [32]:

  Copied!     
 
# Define Layer Normalization
layer_norm = torch.nn.LayerNorm(embedding_dim * num_heads)

# Add residual connection and apply layer normalization
residual_output = layer_norm(multi_head_output + ffn_output)

# Display the final output with residual connection
print("Output with Residual Connections:\n", residual_output)
# Define Layer Normalization layer_norm = torch.nn.LayerNorm(embedding_dim * num_heads) # Add residual connection and apply layer normalization residual_output = layer_norm(multi_head_output + ffn_output) # Display the final output with residual connection print("Output with Residual Connections:\n", residual_output)

Out[32]:

Output with Residual Connections:
 tensor([[-0.8733,  0.7359,  0.6683,  0.4227, -1.1378, -0.0643,  0.9185, -1.6705,
         -0.6696,  0.3303,  0.4128, -1.4385,  1.3389,  0.3332, -1.1018,  1.7952],
        [-1.2147,  1.9334,  0.8954,  0.2650, -1.1570, -0.5572,  0.9889, -1.9336,
         -0.1035, -0.2023,  0.1289,  0.4413, -0.6260,  1.5380, -0.6724,  0.2759],
        [-1.3421,  0.6645, -1.0699, -0.5775,  1.6430,  1.4893,  0.9798,  0.6216,
         -0.0166, -1.1062,  0.3358,  0.7872, -1.1476,  0.5796, -0.3242, -1.5168],
        [-1.0462,  0.7284,  0.3452, -1.3782, -0.4064, -0.2090, -0.4380,  0.4023,
          1.4437, -0.1194,  1.0831,  2.0353, -0.4500,  0.5346, -0.6700, -1.8554],
        [-0.2054, -0.7356,  0.1754,  1.2453, -0.4325, -1.0808, -2.0468, -0.9857,
          1.3430,  0.7050,  1.0614,  1.5904, -0.1434,  0.5553, -0.0296, -1.0160]],
       grad_fn=<NativeLayerNormBackward0>)

Positional Encoding¶

Since the Transformer does not inherently capture the order of words, positional encoding is added to provide the model with information about the relative position of words in a sentence.

Example: Implementing Positional Encoding¶

Let's implement positional encoding for a sequence of words.

In [33]:

  Copied!     
 
import numpy as np

def positional_encoding(seq_len, model_dim):
    pos_enc = np.zeros((seq_len, model_dim))
    for pos in range(seq_len):
        for i in range(0, model_dim, 2):
            pos_enc[pos, i] = np.sin(pos / (10000 ** (2 * i / model_dim)))
            pos_enc[pos, i + 1] = np.cos(pos / (10000 ** (2 * i / model_dim)))
    return torch.tensor(pos_enc, dtype=torch.float)
import numpy as np def positional_encoding(seq_len, model_dim): pos_enc = np.zeros((seq_len, model_dim)) for pos in range(seq_len): for i in range(0, model_dim, 2): pos_enc[pos, i] = np.sin(pos / (10000 ** (2 * i / model_dim))) pos_enc[pos, i + 1] = np.cos(pos / (10000 ** (2 * i / model_dim))) return torch.tensor(pos_enc, dtype=torch.float)

Out[33]:

[No output generated]

In [34]:

  Copied!     
 
# Random input embeddings (for illustration purposes)
torch.manual_seed(42)
input_embeddings = torch.randn(len(tokens), embedding_dim)
# Random input embeddings (for illustration purposes) torch.manual_seed(42) input_embeddings = torch.randn(len(tokens), embedding_dim)

Out[34]:

[No output generated]

In [35]:

  Copied!     
 
# Apply positional encoding
position_encodings = positional_encoding(len(tokens), embedding_dim)

# Add positional encoding to input embeddings
encoded_input = input_embeddings + position_encodings

print("Positional Encodings:\n", position_encodings)
print("Encoded Input with Positional Information:\n", encoded_input)
# Apply positional encoding position_encodings = positional_encoding(len(tokens), embedding_dim) # Add positional encoding to input embeddings encoded_input = input_embeddings + position_encodings print("Positional Encodings:\n", position_encodings) print("Encoded Input with Positional Information:\n", encoded_input)

Out[35]:

Positional Encodings:
 tensor([[ 0.0000e+00,  1.0000e+00,  0.0000e+00,  1.0000e+00,  0.0000e+00,
          1.0000e+00,  0.0000e+00,  1.0000e+00],
        [ 8.4147e-01,  5.4030e-01,  9.9998e-03,  9.9995e-01,  1.0000e-04,
          1.0000e+00,  1.0000e-06,  1.0000e+00],
        [ 9.0930e-01, -4.1615e-01,  1.9999e-02,  9.9980e-01,  2.0000e-04,
          1.0000e+00,  2.0000e-06,  1.0000e+00],
        [ 1.4112e-01, -9.8999e-01,  2.9996e-02,  9.9955e-01,  3.0000e-04,
          1.0000e+00,  3.0000e-06,  1.0000e+00],
        [-7.5680e-01, -6.5364e-01,  3.9989e-02,  9.9920e-01,  4.0000e-04,
          1.0000e+00,  4.0000e-06,  1.0000e+00]])
Encoded Input with Positional Information:
 tensor([[ 1.9269,  2.4873,  0.9007, -1.1055,  0.6784, -0.2345, -0.0431, -0.6047],
        [ 0.0893,  2.1890, -0.3825, -0.4037, -0.7278,  0.4406, -0.7688,  1.7624],
        [ 2.5516, -0.5757, -0.4774,  1.4394, -0.7579,  2.0783,  0.8008,  2.6806],
        [ 0.1760, -0.6689,  1.6036,  0.1541,  1.3126,  1.6872, -1.0892,  0.6447],
        [-2.1749,  0.2426,  0.0899,  3.2659,  1.1794,  0.5655, -1.3864, -0.2862]])

Conclusion¶

In this notebook, we explored the key components of the Transformer architecture, including self-attention, multi-head attention, feed-forward networks, layer normalization, and positional encoding. These components work together to form the basis of modern NLP models.

In [18]:

  Copied!     
 
# Shut down the kernel to release memory
import IPython

app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# Shut down the kernel to release memory import IPython app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)

Out[18]:

{'status': 'ok', 'restart': False}