RAG Introduction with Ollama OpenAI API¶

This notebook demonstrates building a RAG (Retrieval Augmented Generation) application from scratch using:

Ollama as the LLM backend (OpenAI-compatible API)
llama3.2 for both embeddings and chat completions
Pandas DataFrame as a simple vector database

How RAG Works¶

Chunk the source document into smaller pieces
Embed each chunk into a vector representation
Store embeddings in a vector database
Query: When a user asks a question:
- Embed the question
- Find the most similar chunks (cosine similarity)
- Include those chunks as context in the LLM prompt
Generate a response using the LLM with retrieved context

Sample Document¶

We use a sample excerpt about COVID-19 variants to demonstrate RAG capabilities.

The notebook will automatically pull required models if they're not already available.

Bazzite-AI Setup Required
Run D0_00_Bazzite_AI_Setup.ipynb first to configure Ollama, pull models, and verify GPU access.

1. Setup & Configuration¶

In [45]:

  Copied!     
 
import os
import requests
import numpy as np
import pandas as pd
from textwrap import wrap
from math import sqrt
from openai import OpenAI

# === Configuration ===
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")

# === Model Configuration ===
OLLAMA_EMBEDDING_MODEL = "llama3.2:latest"
OLLAMA_LLM_MODEL = "llama3.2:latest"

# Initialize OpenAI client pointing to Ollama
client = OpenAI(
    base_url=f"{OLLAMA_HOST}/v1",
    api_key="ollama"  # Required by library but ignored by Ollama
)

print(f"Ollama host: {OLLAMA_HOST}")
print(f"Embedding model: {OLLAMA_EMBEDDING_MODEL}")
print(f"LLM model: {OLLAMA_LLM_MODEL}")
import os import requests import numpy as np import pandas as pd from textwrap import wrap from math import sqrt from openai import OpenAI # === Configuration === OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434") # === Model Configuration === OLLAMA_EMBEDDING_MODEL = "llama3.2:latest" OLLAMA_LLM_MODEL = "llama3.2:latest" # Initialize OpenAI client pointing to Ollama client = OpenAI( base_url=f"{OLLAMA_HOST}/v1", api_key="ollama" # Required by library but ignored by Ollama ) print(f"Ollama host: {OLLAMA_HOST}") print(f"Embedding model: {OLLAMA_EMBEDDING_MODEL}") print(f"LLM model: {OLLAMA_LLM_MODEL}")

Out[45]:

Ollama host: http://ollama:11434
Embedding model: llama3.2:latest
LLM model: llama3.2:latest

2. Verify Models¶

Models should already be pulled by D0_00. If you see errors below, run D0_00_Bazzite_AI_Setup.ipynb first.

3. Load and Chunk Document¶

For this demo, we use a sample excerpt about COVID-19 variants. The text is embedded directly in the notebook to make it self-contained.

The wrap function from textwrap splits text into chunks of a specified character length, breaking at word boundaries.

In [46]:

  Copied!     
 
# Sample document: COVID-19 Omicron variant information
# (Embedded for self-contained demo - based on scientific literature)

SAMPLE_TEXT = """
The Omicron variant of SARS-CoV-2, first identified in South Africa in November 2021, 
rapidly spread across the globe and became the dominant variant in many countries by early 2022. 
This variant exhibited significant mutations in the spike protein, raising concerns about 
vaccine efficacy and therapeutic interventions.

In France, the emergence of Omicron led to a rapid replacement of the Delta variant during 
the winter of 2021-2022. Epidemiological surveillance showed that Omicron cases doubled 
approximately every two to three days during its initial spread, significantly faster than 
previous variants.

The Omicron variant is characterized by approximately 30 mutations in the spike protein alone, 
including mutations at positions K417N, N440K, G446S, S477N, T478K, E484A, Q493R, G496S, 
Q498R, N501Y, and Y505H. Many of these mutations are located in the receptor-binding domain 
(RBD), which is crucial for viral entry into host cells.

Studies in France demonstrated that while Omicron showed increased transmissibility compared 
to Delta, it was associated with reduced severity of disease. Hospitalization rates and 
intensive care unit admissions were lower per infection compared to the Delta wave, though 
the sheer number of cases still strained healthcare systems.

The immune evasion properties of Omicron were substantial. Research showed reduced neutralization 
by antibodies elicited by previous infection with earlier variants or by primary vaccination 
series. However, booster doses significantly improved protection against severe disease.

Mathematical modeling of the Omicron invasion in France utilized multi-variant epidemiological 
models to understand the dynamics of variant replacement. These models incorporated factors 
such as cross-immunity between variants, vaccine coverage, and waning immunity over time.

The basic reproduction number (R0) of Omicron was estimated to be significantly higher than 
Delta, with estimates ranging from 8 to 15 depending on the population and setting. This 
high transmissibility was a key factor in its rapid global spread.

French public health authorities responded to the Omicron wave with enhanced testing capacity, 
acceleration of booster vaccination campaigns, and implementation of sanitary passes requiring 
up-to-date vaccination status for access to certain venues and activities.

Subsequent sub-lineages of Omicron, including BA.2, BA.4, BA.5, and later BQ and XBB variants, 
continued to evolve with additional mutations conferring further immune evasion properties. 
This ongoing evolution necessitated updates to vaccine formulations and continued surveillance.

The experience with Omicron in France and globally highlighted the importance of genomic 
surveillance, rapid response capabilities, and adaptable public health strategies in managing 
emerging variants of concern during a pandemic.
"""

# Chunk the text into smaller pieces for embedding
wrapped_text = wrap(SAMPLE_TEXT.strip(), 1000)
print(f"Document chunked into {len(wrapped_text)} pieces")
# Sample document: COVID-19 Omicron variant information # (Embedded for self-contained demo - based on scientific literature) SAMPLE_TEXT = """ The Omicron variant of SARS-CoV-2, first identified in South Africa in November 2021, rapidly spread across the globe and became the dominant variant in many countries by early 2022. This variant exhibited significant mutations in the spike protein, raising concerns about vaccine efficacy and therapeutic interventions. In France, the emergence of Omicron led to a rapid replacement of the Delta variant during the winter of 2021-2022. Epidemiological surveillance showed that Omicron cases doubled approximately every two to three days during its initial spread, significantly faster than previous variants. The Omicron variant is characterized by approximately 30 mutations in the spike protein alone, including mutations at positions K417N, N440K, G446S, S477N, T478K, E484A, Q493R, G496S, Q498R, N501Y, and Y505H. Many of these mutations are located in the receptor-binding domain (RBD), which is crucial for viral entry into host cells. Studies in France demonstrated that while Omicron showed increased transmissibility compared to Delta, it was associated with reduced severity of disease. Hospitalization rates and intensive care unit admissions were lower per infection compared to the Delta wave, though the sheer number of cases still strained healthcare systems. The immune evasion properties of Omicron were substantial. Research showed reduced neutralization by antibodies elicited by previous infection with earlier variants or by primary vaccination series. However, booster doses significantly improved protection against severe disease. Mathematical modeling of the Omicron invasion in France utilized multi-variant epidemiological models to understand the dynamics of variant replacement. These models incorporated factors such as cross-immunity between variants, vaccine coverage, and waning immunity over time. The basic reproduction number (R0) of Omicron was estimated to be significantly higher than Delta, with estimates ranging from 8 to 15 depending on the population and setting. This high transmissibility was a key factor in its rapid global spread. French public health authorities responded to the Omicron wave with enhanced testing capacity, acceleration of booster vaccination campaigns, and implementation of sanitary passes requiring up-to-date vaccination status for access to certain venues and activities. Subsequent sub-lineages of Omicron, including BA.2, BA.4, BA.5, and later BQ and XBB variants, continued to evolve with additional mutations conferring further immune evasion properties. This ongoing evolution necessitated updates to vaccine formulations and continued surveillance. The experience with Omicron in France and globally highlighted the importance of genomic surveillance, rapid response capabilities, and adaptable public health strategies in managing emerging variants of concern during a pandemic. """ # Chunk the text into smaller pieces for embedding wrapped_text = wrap(SAMPLE_TEXT.strip(), 1000) print(f"Document chunked into {len(wrapped_text)} pieces")

Out[46]:

Document chunked into 3 pieces

The text is wrapped into chunks of maximum 1000 characters each. The wrap function breaks at word boundaries, so actual chunk sizes vary slightly.

In [47]:

  Copied!     
 
len(wrapped_text)
len(wrapped_text)

Out[47]:

4. Generate Embeddings¶

First, let's test embedding a single chunk using the OpenAI-compatible API:

In [48]:

  Copied!     
 
# Test embedding a single chunk
response = client.embeddings.create(
    model=OLLAMA_EMBEDDING_MODEL,
    input=wrapped_text[0]
)

embedding = response.data[0].embedding
print(f"Text chunk (first 100 chars): {wrapped_text[0][:100]}...")
print(f"Embedding dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")
# Test embedding a single chunk response = client.embeddings.create( model=OLLAMA_EMBEDDING_MODEL, input=wrapped_text[0] ) embedding = response.data[0].embedding print(f"Text chunk (first 100 chars): {wrapped_text[0][:100]}...") print(f"Embedding dimensions: {len(embedding)}") print(f"First 5 values: {embedding[:5]}")

Out[48]:

Text chunk (first 100 chars): The Omicron variant of SARS-CoV-2, first identified in South Africa in November 2021,  rapidly sprea...
Embedding dimensions: 3072
First 5 values: [-0.005387044046074152, 0.0025414149276912212, -0.02499299868941307, -0.006784075405448675, -0.008193740621209145]

Now let's create embeddings for all text chunks. Each embedding is stored as a numpy array in a list.

In [49]:

  Copied!     
 
# Generate embeddings for all chunks
embeddings = []

for i, text in enumerate(wrapped_text):
    response = client.embeddings.create(
        model=OLLAMA_EMBEDDING_MODEL,
        input=text
    )
    embedding = np.array(response.data[0].embedding)
    embeddings.append(embedding)

print(f"Created {len(embeddings)} embeddings")
# Generate embeddings for all chunks embeddings = [] for i, text in enumerate(wrapped_text): response = client.embeddings.create( model=OLLAMA_EMBEDDING_MODEL, input=text ) embedding = np.array(response.data[0].embedding) embeddings.append(embedding) print(f"Created {len(embeddings)} embeddings")

Out[49]:

Created 3 embeddings

Each embedding vector contains multiple dimensions (the exact count depends on the model):

In [50]:

  Copied!     
 
len(embeddings[0])
len(embeddings[0])

Out[50]:

The total number of embeddings matches our chunk count:

In [51]:

  Copied!     
 
len(embeddings)
len(embeddings)

Out[51]:

5. Create Vector Database¶

For simplicity, we use a Pandas DataFrame as our vector database. Each row contains a text chunk and its corresponding embedding vector.

In [52]:

  Copied!     
 
import pandas as pd

vector_data_base = pd.DataFrame({ 'text': wrapped_text,
                                  'embeddings': embeddings })
import pandas as pd vector_data_base = pd.DataFrame({ 'text': wrapped_text, 'embeddings': embeddings })

Out[52]:

[No output generated]

In [53]:

  Copied!     
 
vector_data_base.head()
vector_data_base.head()

Out[53]:

                                                text  \
0  The Omicron variant of SARS-CoV-2, first ident...   
1  Omicron showed increased transmissibility comp...   
2  depending on the population and setting. This ...   

                                          embeddings  
0  [-0.005387044046074152, 0.0025414149276912212,...  
1  [0.020577766001224518, -0.0041334908455610275,...  
2  [-0.001117408974096179, -0.004059193190187216,...

In [54]:

  Copied!     
 
vector_data_base['embeddings'][0]
vector_data_base['embeddings'][0]

Out[54]:

array([-0.00538704,  0.00254141, -0.024993  , ..., -0.02018412,
       -0.02819293,  0.03002225], shape=(3072,))

6. Query Functions¶

To search the vector database, we use cosine distance to find the most similar chunks to a query embedding.

In [55]:

  Copied!     
 
def cosine_distance(a,b):
    return(float(1.-((np.dot(a,b))/(sqrt(np.dot(a,a))*sqrt(np.dot(b,b))))))

def best_answers(n,query,database):
    distances = []
    for i in range(len(vector_data_base)):
        distances.append(cosine_distance(database['embeddings'][i],query))
    local_db = database.copy()
    local_db['distances'] = distances
    local_db = local_db.nsmallest(n,'distances')
    return(list(local_db['text']))
def cosine_distance(a,b): return(float(1.-((np.dot(a,b))/(sqrt(np.dot(a,a))*sqrt(np.dot(b,b)))))) def best_answers(n,query,database): distances = [] for i in range(len(vector_data_base)): distances.append(cosine_distance(database['embeddings'][i],query)) local_db = database.copy() local_db['distances'] = distances local_db = local_db.nsmallest(n,'distances') return(list(local_db['text']))

Out[55]:

[No output generated]

7. Query Encoding¶

To search for relevant context, we need to encode user questions into embeddings. For long queries, we chunk them, embed each chunk, and combine by averaging and normalizing.

In [56]:

  Copied!     
 
def encode_into_single_embedding(intext):
    """Encode text into a single embedding vector using Ollama."""
    embedded_chunks = []
    wrapped = wrap(intext, 1000)
    
    for text in wrapped:
        response = client.embeddings.create(
            model=OLLAMA_EMBEDDING_MODEL,
            input=text
        )
        embedding = np.array(response.data[0].embedding)
        embedded_chunks.append(embedding)
    
    # If only one chunk, return it directly
    if len(embedded_chunks) == 1:
        return embedded_chunks[0]
    
    # Combine multiple embeddings by averaging
    combined = np.mean(embedded_chunks, axis=0)
    
    # Normalize the combined embedding
    norm = np.linalg.norm(combined)
    if norm > 0:
        combined = combined / norm
    
    return combined
def encode_into_single_embedding(intext): """Encode text into a single embedding vector using Ollama.""" embedded_chunks = [] wrapped = wrap(intext, 1000) for text in wrapped: response = client.embeddings.create( model=OLLAMA_EMBEDDING_MODEL, input=text ) embedding = np.array(response.data[0].embedding) embedded_chunks.append(embedding) # If only one chunk, return it directly if len(embedded_chunks) == 1: return embedded_chunks[0] # Combine multiple embeddings by averaging combined = np.mean(embedded_chunks, axis=0) # Normalize the combined embedding norm = np.linalg.norm(combined) if norm > 0: combined = combined / norm return combined

Out[56]:

[No output generated]

In [57]:

  Copied!     
 
# Test query encoding
encoded = encode_into_single_embedding("What do you know about the Omicron variant?")
print(f"Query embedding shape: {len(encoded)} dimensions")
# Test query encoding encoded = encode_into_single_embedding("What do you know about the Omicron variant?") print(f"Query embedding shape: {len(encoded)} dimensions")

Out[57]:

Query embedding shape: 3072 dimensions

In [58]:

  Copied!     
 
# Find the most relevant chunks
relevant_chunks = best_answers(3, encoded, vector_data_base)
print("Top 3 relevant chunks:")
for i, chunk in enumerate(relevant_chunks):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk[:200] + "...")
# Find the most relevant chunks relevant_chunks = best_answers(3, encoded, vector_data_base) print("Top 3 relevant chunks:") for i, chunk in enumerate(relevant_chunks): print(f"\n--- Chunk {i+1} ---") print(chunk[:200] + "...")

Out[58]:

Top 3 relevant chunks:

--- Chunk 1 ---
depending on the population and setting. This  high transmissibility was a key factor in its rapid global spread.  French public health authorities responded to the Omicron wave with enhanced testing ...

--- Chunk 2 ---
The Omicron variant of SARS-CoV-2, first identified in South Africa in November 2021,  rapidly spread across the globe and became the dominant variant in many countries by early 2022.  This variant ex...

--- Chunk 3 ---
Omicron showed increased transmissibility compared  to Delta, it was associated with reduced severity of disease. Hospitalization rates and  intensive care unit admissions were lower per infection com...

8. RAG Chat System¶

With the OpenAI-compatible API, we don't need to manually construct prompt tokens. The API handles the chat template automatically through the messages format.

In [59]:

  Copied!     
 
# System prompt for RAG
SYSTEM_PROMPT = """You are a helpful AI assistant. Answer questions based on the provided context.
If the answer is not in the context, say so clearly. Be concise but thorough."""

# Conversation history for multi-turn chat
conversation_history = []
# System prompt for RAG SYSTEM_PROMPT = """You are a helpful AI assistant. Answer questions based on the provided context. If the answer is not in the context, say so clearly. Be concise but thorough.""" # Conversation history for multi-turn chat conversation_history = []

Out[59]:

[No output generated]

The build_messages function constructs the messages list with retrieved context for the LLM.

In [60]:

  Copied!     
 
def build_messages(context_chunks, user_query, history=None):
    """Build messages list for chat completion with RAG context."""
    context = "\n\n---\n\n".join(context_chunks)
    
    system_content = f"""{SYSTEM_PROMPT}

## Retrieved Context:
{context}
"""
    
    messages = [{"role": "system", "content": system_content}]
    
    # Add conversation history if provided
    if history:
        messages.extend(history)
    
    # Add current user query
    messages.append({"role": "user", "content": user_query})
    
    return messages
def build_messages(context_chunks, user_query, history=None): """Build messages list for chat completion with RAG context.""" context = "\n\n---\n\n".join(context_chunks) system_content = f"""{SYSTEM_PROMPT} ## Retrieved Context: {context} """ messages = [{"role": "system", "content": system_content}] # Add conversation history if provided if history: messages.extend(history) # Add current user query messages.append({"role": "user", "content": user_query}) return messages

Out[60]:

[No output generated]

The get_llm_response function calls the LLM using the OpenAI-compatible chat completions API.

In [61]:

  Copied!     
 
def get_llm_response(messages, max_tokens=500):
    """Get response from LLM via Ollama's OpenAI-compatible API."""
    response = client.chat.completions.create(
        model=OLLAMA_LLM_MODEL,
        messages=messages,
        max_tokens=max_tokens,
        temperature=0.7
    )
    return response.choices[0].message.content
def get_llm_response(messages, max_tokens=500): """Get response from LLM via Ollama's OpenAI-compatible API.""" response = client.chat.completions.create( model=OLLAMA_LLM_MODEL, messages=messages, max_tokens=max_tokens, temperature=0.7 ) return response.choices[0].message.content

Out[61]:

[No output generated]

The main chat function ties everything together: encode query → retrieve context → build messages → get response → update history.

In [62]:

  Copied!     
 
def chat(user_query, n_docs=5):
    """RAG chat function with conversation memory."""
    global conversation_history
    
    # Encode query and retrieve relevant chunks
    query_embedding = encode_into_single_embedding(user_query)
    relevant_chunks = best_answers(n_docs, query_embedding, vector_data_base)
    
    # Build messages with context and history
    messages = build_messages(relevant_chunks, user_query, conversation_history)
    
    # Get response from LLM
    response = get_llm_response(messages)
    
    # Update conversation history (without the system message)
    conversation_history.append({"role": "user", "content": user_query})
    conversation_history.append({"role": "assistant", "content": response})
    
    print(response)
    return response
def chat(user_query, n_docs=5): """RAG chat function with conversation memory.""" global conversation_history # Encode query and retrieve relevant chunks query_embedding = encode_into_single_embedding(user_query) relevant_chunks = best_answers(n_docs, query_embedding, vector_data_base) # Build messages with context and history messages = build_messages(relevant_chunks, user_query, conversation_history) # Get response from LLM response = get_llm_response(messages) # Update conversation history (without the system message) conversation_history.append({"role": "user", "content": user_query}) conversation_history.append({"role": "assistant", "content": response}) print(response) return response

Out[62]:

[No output generated]

9. Try It Out!¶

Now let's chat with our RAG-enabled assistant. The conversation history is maintained across calls.

In [63]:

  Copied!     
 
# First question about the Omicron variant
chat("What do you know about the Omicron variant in France?")
# First question about the Omicron variant chat("What do you know about the Omicron variant in France?")

Out[63]:

Based on the provided context, here's what I know about the Omicron variant in France:

1. **Rapid spread**: The Omicron variant rapidly spread across France and became the dominant variant by early 2022.
2. **High transmissibility**: The variant exhibited significant mutations in the spike protein, raising concerns about vaccine efficacy and therapeutic interventions. Omicron was associated with increased transmissibility compared to the Delta variant.
3. **Reduced severity of disease**: Despite its high transmissibility, Omicron showed reduced severity of disease in France, with lower hospitalization rates and intensive care unit admissions per infection compared to the Delta wave.
4. **Immune evasion properties**: The immune evasion properties of Omicron were substantial, with research showing reduced neutralization by antibodies elicited by previous infection with earlier variants or by primary vaccination series.
5. **Booster doses improved protection**: Booster doses significantly improved protection against severe disease caused by Omicron.
6. **Enhanced testing capacity and vaccination campaigns**: French public health authorities responded to the Omicron wave by enhancing testing capacity, accelerating booster vaccination campaigns, and implementing sanitary passes requiring up-to-date vaccination status for access to certain venues and activities.

Overall, France's experience with the Omicron variant highlighted the importance of genomic surveillance, rapid response capabilities, and adaptable public health strategies in managing emerging variants of concern during a pandemic.

Out[63]:

"Based on the provided context, here's what I know about the Omicron variant in France:\n\n1. **Rapid spread**: The Omicron variant rapidly spread across France and became the dominant variant by early 2022.\n2. **High transmissibility**: The variant exhibited significant mutations in the spike protein, raising concerns about vaccine efficacy and therapeutic interventions. Omicron was associated with increased transmissibility compared to the Delta variant.\n3. **Reduced severity of disease**: Despite its high transmissibility, Omicron showed reduced severity of disease in France, with lower hospitalization rates and intensive care unit admissions per infection compared to the Delta wave.\n4. **Immune evasion properties**: The immune evasion properties of Omicron were substantial, with research showing reduced neutralization by antibodies elicited by previous infection with earlier variants or by primary vaccination series.\n5. **Booster doses improved protection**: Booster doses significantly improved protection against severe disease caused by Omicron.\n6. **Enhanced testing capacity and vaccination campaigns**: French public health authorities responded to the Omicron wave by enhancing testing capacity, accelerating booster vaccination campaigns, and implementing sanitary passes requiring up-to-date vaccination status for access to certain venues and activities.\n\nOverall, France's experience with the Omicron variant highlighted the importance of genomic surveillance, rapid response capabilities, and adaptable public health strategies in managing emerging variants of concern during a pandemic."

In [64]:

  Copied!     
 
# Follow-up question (uses conversation history)
chat("What mutations does it have?")
# Follow-up question (uses conversation history) chat("What mutations does it have?")

Out[64]:

According to the provided context, the Omicron variant has approximately 30 mutations in the spike protein alone, including:

1. K417N
2. N440K
3. G446S
4. S477N
5. T478K
6. E484A
7. Q493R
8. G496S
9. Q498R
10. N501Y
11. Y505H

These mutations are primarily located in the receptor-binding domain (RBD), which is crucial for viral entry into host cells.

Out[64]:

'According to the provided context, the Omicron variant has approximately 30 mutations in the spike protein alone, including:\n\n1. K417N\n2. N440K\n3. G446S\n4. S477N\n5. T478K\n6. E484A\n7. Q493R\n8. G496S\n9. Q498R\n10. N501Y\n11. Y505H\n\nThese mutations are primarily located in the receptor-binding domain (RBD), which is crucial for viral entry into host cells.'

In [65]:

  Copied!     
 
# View conversation history
print("=== Conversation History ===")
for msg in conversation_history:
    role = msg["role"].upper()
    content = msg["content"][:200] + "..." if len(msg["content"]) > 200 else msg["content"]
    print(f"\n[{role}]: {content}")
# View conversation history print("=== Conversation History ===") for msg in conversation_history: role = msg["role"].upper() content = msg["content"][:200] + "..." if len(msg["content"]) > 200 else msg["content"] print(f"\n[{role}]: {content}")

Out[65]:

=== Conversation History ===

[USER]: What do you know about the Omicron variant in France?

[ASSISTANT]: Based on the provided context, here's what I know about the Omicron variant in France:

1. **Rapid spread**: The Omicron variant rapidly spread across France and became the dominant variant by early 2...

[USER]: What mutations does it have?

[ASSISTANT]: According to the provided context, the Omicron variant has approximately 30 mutations in the spike protein alone, including:

1. K417N
2. N440K
3. G446S
4. S477N
5. T478K
6. E484A
7. Q493R
8. G496S
9....

In [66]:

  Copied!     
 
## 10. Utility Functions
## 10. Utility Functions

Out[66]:

[No output generated]

In [67]:

  Copied!     
 
def reset_conversation():
    """Reset conversation history to start fresh."""
    global conversation_history
    conversation_history = []
    print("✓ Conversation history cleared")

# Uncomment to reset:
# reset_conversation()
def reset_conversation(): """Reset conversation history to start fresh.""" global conversation_history conversation_history = [] print("✓ Conversation history cleared") # Uncomment to reset: # reset_conversation()

Out[67]:

[No output generated]

In [68]:

  Copied!     
 
# Try another question!
# chat("How did France respond to the Omicron wave?")
# Try another question! # chat("How did France respond to the Omicron wave?")

Out[68]:

[No output generated]

In [ ]:

  Copied!     
 
# === Unload Ollama Model & Shutdown Kernel ===
# Unloads the model from GPU memory before shutting down

try:
    import ollama
    print(f"Unloading Ollama model: {OLLAMA_LLM_MODEL}")
    ollama.generate(model=OLLAMA_LLM_MODEL, prompt="", keep_alive=0)
    print("Model unloaded from GPU memory")
except Exception as e:
    print(f"Model unload skipped: {e}")

# Shut down the kernel to fully release resources
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
# === Unload Ollama Model & Shutdown Kernel === # Unloads the model from GPU memory before shutting down try: import ollama print(f"Unloading Ollama model: {OLLAMA_LLM_MODEL}") ollama.generate(model=OLLAMA_LLM_MODEL, prompt="", keep_alive=0) print("Model unloaded from GPU memory") except Exception as e: print(f"Model unload skipped: {e}") # Shut down the kernel to fully release resources import IPython app = IPython.Application.instance() app.kernel.do_shutdown(restart=False)