RAG Introduction with Ollama OpenAI API¶
This notebook demonstrates building a RAG (Retrieval Augmented Generation) application from scratch using:
- Ollama as the LLM backend (OpenAI-compatible API)
- llama3.2 for both embeddings and chat completions
- Pandas DataFrame as a simple vector database
How RAG Works¶
- Chunk the source document into smaller pieces
- Embed each chunk into a vector representation
- Store embeddings in a vector database
- Query: When a user asks a question:
- Embed the question
- Find the most similar chunks (cosine similarity)
- Include those chunks as context in the LLM prompt
- Generate a response using the LLM with retrieved context
Sample Document¶
We use a sample excerpt about COVID-19 variants to demonstrate RAG capabilities.
The notebook will automatically pull required models if they're not already available.
Bazzite-AI Setup Required
RunD0_00_Bazzite_AI_Setup.ipynbfirst to configure Ollama, pull models, and verify GPU access.
1. Setup & Configuration¶
import os
import requests
import numpy as np
import pandas as pd
from textwrap import wrap
from math import sqrt
from openai import OpenAI
# === Configuration ===
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
# === Model Configuration ===
OLLAMA_EMBEDDING_MODEL = "llama3.2:latest"
OLLAMA_LLM_MODEL = "llama3.2:latest"
# Initialize OpenAI client pointing to Ollama
client = OpenAI(
base_url=f"{OLLAMA_HOST}/v1",
api_key="ollama" # Required by library but ignored by Ollama
)
print(f"Ollama host: {OLLAMA_HOST}")
print(f"Embedding model: {OLLAMA_EMBEDDING_MODEL}")
print(f"LLM model: {OLLAMA_LLM_MODEL}")
Ollama host: http://ollama:11434 Embedding model: llama3.2:latest LLM model: llama3.2:latest
2. Verify Models¶
Models should already be pulled by D0_00. If you see errors below, run D0_00_Bazzite_AI_Setup.ipynb first.
3. Load and Chunk Document¶
For this demo, we use a sample excerpt about COVID-19 variants. The text is embedded directly in the notebook to make it self-contained.
The wrap function from textwrap splits text into chunks of a specified character length, breaking at word boundaries.
# Sample document: COVID-19 Omicron variant information
# (Embedded for self-contained demo - based on scientific literature)
SAMPLE_TEXT = """
The Omicron variant of SARS-CoV-2, first identified in South Africa in November 2021,
rapidly spread across the globe and became the dominant variant in many countries by early 2022.
This variant exhibited significant mutations in the spike protein, raising concerns about
vaccine efficacy and therapeutic interventions.
In France, the emergence of Omicron led to a rapid replacement of the Delta variant during
the winter of 2021-2022. Epidemiological surveillance showed that Omicron cases doubled
approximately every two to three days during its initial spread, significantly faster than
previous variants.
The Omicron variant is characterized by approximately 30 mutations in the spike protein alone,
including mutations at positions K417N, N440K, G446S, S477N, T478K, E484A, Q493R, G496S,
Q498R, N501Y, and Y505H. Many of these mutations are located in the receptor-binding domain
(RBD), which is crucial for viral entry into host cells.
Studies in France demonstrated that while Omicron showed increased transmissibility compared
to Delta, it was associated with reduced severity of disease. Hospitalization rates and
intensive care unit admissions were lower per infection compared to the Delta wave, though
the sheer number of cases still strained healthcare systems.
The immune evasion properties of Omicron were substantial. Research showed reduced neutralization
by antibodies elicited by previous infection with earlier variants or by primary vaccination
series. However, booster doses significantly improved protection against severe disease.
Mathematical modeling of the Omicron invasion in France utilized multi-variant epidemiological
models to understand the dynamics of variant replacement. These models incorporated factors
such as cross-immunity between variants, vaccine coverage, and waning immunity over time.
The basic reproduction number (R0) of Omicron was estimated to be significantly higher than
Delta, with estimates ranging from 8 to 15 depending on the population and setting. This
high transmissibility was a key factor in its rapid global spread.
French public health authorities responded to the Omicron wave with enhanced testing capacity,
acceleration of booster vaccination campaigns, and implementation of sanitary passes requiring
up-to-date vaccination status for access to certain venues and activities.
Subsequent sub-lineages of Omicron, including BA.2, BA.4, BA.5, and later BQ and XBB variants,
continued to evolve with additional mutations conferring further immune evasion properties.
This ongoing evolution necessitated updates to vaccine formulations and continued surveillance.
The experience with Omicron in France and globally highlighted the importance of genomic
surveillance, rapid response capabilities, and adaptable public health strategies in managing
emerging variants of concern during a pandemic.
"""
# Chunk the text into smaller pieces for embedding
wrapped_text = wrap(SAMPLE_TEXT.strip(), 1000)
print(f"Document chunked into {len(wrapped_text)} pieces")
Document chunked into 3 pieces
The text is wrapped into chunks of maximum 1000 characters each. The wrap function breaks at word boundaries, so actual chunk sizes vary slightly.
len(wrapped_text)
3
4. Generate Embeddings¶
First, let's test embedding a single chunk using the OpenAI-compatible API:
# Test embedding a single chunk
response = client.embeddings.create(
model=OLLAMA_EMBEDDING_MODEL,
input=wrapped_text[0]
)
embedding = response.data[0].embedding
print(f"Text chunk (first 100 chars): {wrapped_text[0][:100]}...")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
Text chunk (first 100 chars): The Omicron variant of SARS-CoV-2, first identified in South Africa in November 2021, rapidly sprea... Embedding dimensions: 3072 First 5 values: [-0.005387044046074152, 0.0025414149276912212, -0.02499299868941307, -0.006784075405448675, -0.008193740621209145]
Now let's create embeddings for all text chunks. Each embedding is stored as a numpy array in a list.
# Generate embeddings for all chunks
embeddings = []
for i, text in enumerate(wrapped_text):
response = client.embeddings.create(
model=OLLAMA_EMBEDDING_MODEL,
input=text
)
embedding = np.array(response.data[0].embedding)
embeddings.append(embedding)
print(f"Created {len(embeddings)} embeddings")
Created 3 embeddings
Each embedding vector contains multiple dimensions (the exact count depends on the model):
len(embeddings[0])
3072
The total number of embeddings matches our chunk count:
len(embeddings)
3
5. Create Vector Database¶
For simplicity, we use a Pandas DataFrame as our vector database. Each row contains a text chunk and its corresponding embedding vector.
import pandas as pd
vector_data_base = pd.DataFrame({ 'text': wrapped_text,
'embeddings': embeddings })
[No output generated]
vector_data_base.head()
text \
0 The Omicron variant of SARS-CoV-2, first ident...
1 Omicron showed increased transmissibility comp...
2 depending on the population and setting. This ...
embeddings
0 [-0.005387044046074152, 0.0025414149276912212,...
1 [0.020577766001224518, -0.0041334908455610275,...
2 [-0.001117408974096179, -0.004059193190187216,... vector_data_base['embeddings'][0]
array([-0.00538704, 0.00254141, -0.024993 , ..., -0.02018412,
-0.02819293, 0.03002225], shape=(3072,)) 6. Query Functions¶
To search the vector database, we use cosine distance to find the most similar chunks to a query embedding.
def cosine_distance(a,b):
return(float(1.-((np.dot(a,b))/(sqrt(np.dot(a,a))*sqrt(np.dot(b,b))))))
def best_answers(n,query,database):
distances = []
for i in range(len(vector_data_base)):
distances.append(cosine_distance(database['embeddings'][i],query))
local_db = database.copy()
local_db['distances'] = distances
local_db = local_db.nsmallest(n,'distances')
return(list(local_db['text']))
[No output generated]
7. Query Encoding¶
To search for relevant context, we need to encode user questions into embeddings. For long queries, we chunk them, embed each chunk, and combine by averaging and normalizing.
def encode_into_single_embedding(intext):
"""Encode text into a single embedding vector using Ollama."""
embedded_chunks = []
wrapped = wrap(intext, 1000)
for text in wrapped:
response = client.embeddings.create(
model=OLLAMA_EMBEDDING_MODEL,
input=text
)
embedding = np.array(response.data[0].embedding)
embedded_chunks.append(embedding)
# If only one chunk, return it directly
if len(embedded_chunks) == 1:
return embedded_chunks[0]
# Combine multiple embeddings by averaging
combined = np.mean(embedded_chunks, axis=0)
# Normalize the combined embedding
norm = np.linalg.norm(combined)
if norm > 0:
combined = combined / norm
return combined
[No output generated]
# Test query encoding
encoded = encode_into_single_embedding("What do you know about the Omicron variant?")
print(f"Query embedding shape: {len(encoded)} dimensions")
Query embedding shape: 3072 dimensions
# Find the most relevant chunks
relevant_chunks = best_answers(3, encoded, vector_data_base)
print("Top 3 relevant chunks:")
for i, chunk in enumerate(relevant_chunks):
print(f"\n--- Chunk {i+1} ---")
print(chunk[:200] + "...")
Top 3 relevant chunks: --- Chunk 1 --- depending on the population and setting. This high transmissibility was a key factor in its rapid global spread. French public health authorities responded to the Omicron wave with enhanced testing ... --- Chunk 2 --- The Omicron variant of SARS-CoV-2, first identified in South Africa in November 2021, rapidly spread across the globe and became the dominant variant in many countries by early 2022. This variant ex... --- Chunk 3 --- Omicron showed increased transmissibility compared to Delta, it was associated with reduced severity of disease. Hospitalization rates and intensive care unit admissions were lower per infection com...
8. RAG Chat System¶
With the OpenAI-compatible API, we don't need to manually construct prompt tokens. The API handles the chat template automatically through the messages format.
# System prompt for RAG
SYSTEM_PROMPT = """You are a helpful AI assistant. Answer questions based on the provided context.
If the answer is not in the context, say so clearly. Be concise but thorough."""
# Conversation history for multi-turn chat
conversation_history = []
[No output generated]
The build_messages function constructs the messages list with retrieved context for the LLM.
def build_messages(context_chunks, user_query, history=None):
"""Build messages list for chat completion with RAG context."""
context = "\n\n---\n\n".join(context_chunks)
system_content = f"""{SYSTEM_PROMPT}
## Retrieved Context:
{context}
"""
messages = [{"role": "system", "content": system_content}]
# Add conversation history if provided
if history:
messages.extend(history)
# Add current user query
messages.append({"role": "user", "content": user_query})
return messages
[No output generated]
The get_llm_response function calls the LLM using the OpenAI-compatible chat completions API.
def get_llm_response(messages, max_tokens=500):
"""Get response from LLM via Ollama's OpenAI-compatible API."""
response = client.chat.completions.create(
model=OLLAMA_LLM_MODEL,
messages=messages,
max_tokens=max_tokens,
temperature=0.7
)
return response.choices[0].message.content
[No output generated]
The main chat function ties everything together: encode query → retrieve context → build messages → get response → update history.
def chat(user_query, n_docs=5):
"""RAG chat function with conversation memory."""
global conversation_history
# Encode query and retrieve relevant chunks
query_embedding = encode_into_single_embedding(user_query)
relevant_chunks = best_answers(n_docs, query_embedding, vector_data_base)
# Build messages with context and history
messages = build_messages(relevant_chunks, user_query, conversation_history)
# Get response from LLM
response = get_llm_response(messages)
# Update conversation history (without the system message)
conversation_history.append({"role": "user", "content": user_query})
conversation_history.append({"role": "assistant", "content": response})
print(response)
return response
[No output generated]
9. Try It Out!¶
Now let's chat with our RAG-enabled assistant. The conversation history is maintained across calls.
# First question about the Omicron variant
chat("What do you know about the Omicron variant in France?")
Based on the provided context, here's what I know about the Omicron variant in France: 1. **Rapid spread**: The Omicron variant rapidly spread across France and became the dominant variant by early 2022. 2. **High transmissibility**: The variant exhibited significant mutations in the spike protein, raising concerns about vaccine efficacy and therapeutic interventions. Omicron was associated with increased transmissibility compared to the Delta variant. 3. **Reduced severity of disease**: Despite its high transmissibility, Omicron showed reduced severity of disease in France, with lower hospitalization rates and intensive care unit admissions per infection compared to the Delta wave. 4. **Immune evasion properties**: The immune evasion properties of Omicron were substantial, with research showing reduced neutralization by antibodies elicited by previous infection with earlier variants or by primary vaccination series. 5. **Booster doses improved protection**: Booster doses significantly improved protection against severe disease caused by Omicron. 6. **Enhanced testing capacity and vaccination campaigns**: French public health authorities responded to the Omicron wave by enhancing testing capacity, accelerating booster vaccination campaigns, and implementing sanitary passes requiring up-to-date vaccination status for access to certain venues and activities. Overall, France's experience with the Omicron variant highlighted the importance of genomic surveillance, rapid response capabilities, and adaptable public health strategies in managing emerging variants of concern during a pandemic.
"Based on the provided context, here's what I know about the Omicron variant in France:\n\n1. **Rapid spread**: The Omicron variant rapidly spread across France and became the dominant variant by early 2022.\n2. **High transmissibility**: The variant exhibited significant mutations in the spike protein, raising concerns about vaccine efficacy and therapeutic interventions. Omicron was associated with increased transmissibility compared to the Delta variant.\n3. **Reduced severity of disease**: Despite its high transmissibility, Omicron showed reduced severity of disease in France, with lower hospitalization rates and intensive care unit admissions per infection compared to the Delta wave.\n4. **Immune evasion properties**: The immune evasion properties of Omicron were substantial, with research showing reduced neutralization by antibodies elicited by previous infection with earlier variants or by primary vaccination series.\n5. **Booster doses improved protection**: Booster doses significantly improved protection against severe disease caused by Omicron.\n6. **Enhanced testing capacity and vaccination campaigns**: French public health authorities responded to the Omicron wave by enhancing testing capacity, accelerating booster vaccination campaigns, and implementing sanitary passes requiring up-to-date vaccination status for access to certain venues and activities.\n\nOverall, France's experience with the Omicron variant highlighted the importance of genomic surveillance, rapid response capabilities, and adaptable public health strategies in managing emerging variants of concern during a pandemic."
# Follow-up question (uses conversation history)
chat("What mutations does it have?")
According to the provided context, the Omicron variant has approximately 30 mutations in the spike protein alone, including: 1. K417N 2. N440K 3. G446S 4. S477N 5. T478K 6. E484A 7. Q493R 8. G496S 9. Q498R 10. N501Y 11. Y505H These mutations are primarily located in the receptor-binding domain (RBD), which is crucial for viral entry into host cells.
'According to the provided context, the Omicron variant has approximately 30 mutations in the spike protein alone, including:\n\n1. K417N\n2. N440K\n3. G446S\n4. S477N\n5. T478K\n6. E484A\n7. Q493R\n8. G496S\n9. Q498R\n10. N501Y\n11. Y505H\n\nThese mutations are primarily located in the receptor-binding domain (RBD), which is crucial for viral entry into host cells.'
# View conversation history
print("=== Conversation History ===")
for msg in conversation_history:
role = msg["role"].upper()
content = msg["content"][:200] + "..." if len(msg["content"]) > 200 else msg["content"]
print(f"\n[{role}]: {content}")
=== Conversation History === [USER]: What do you know about the Omicron variant in France? [ASSISTANT]: Based on the provided context, here's what I know about the Omicron variant in France: 1. **Rapid spread**: The Omicron variant rapidly spread across France and became the dominant variant by early 2... [USER]: What mutations does it have? [ASSISTANT]: According to the provided context, the Omicron variant has approximately 30 mutations in the spike protein alone, including: 1. K417N 2. N440K 3. G446S 4. S477N 5. T478K 6. E484A 7. Q493R 8. G496S 9....
## 10. Utility Functions
[No output generated]
def reset_conversation():
"""Reset conversation history to start fresh."""
global conversation_history
conversation_history = []
print("✓ Conversation history cleared")
# Uncomment to reset:
# reset_conversation()
[No output generated]
# Try another question!
# chat("How did France respond to the Omicron wave?")
[No output generated]
# === Unload Ollama Model & Shutdown Kernel ===
# Unloads the model from GPU memory before shutting down
try:
import ollama
print(f"Unloading Ollama model: {OLLAMA_LLM_MODEL}")
ollama.generate(model=OLLAMA_LLM_MODEL, prompt="", keep_alive=0)
print("Model unloaded from GPU memory")
except Exception as e:
print(f"Model unload skipped: {e}")
# Shut down the kernel to fully release resources
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)