Ollama OpenAI Compatibility¶

Overview¶

Ollama provides an OpenAI-compatible API at /v1/* endpoints. This allows using the official openai Python library with Ollama, enabling:

Migration - Drop-in replacement for OpenAI API
Tool ecosystem - Works with LangChain, LlamaIndex, etc.
Familiar interface - Standard OpenAI patterns

Quick Reference¶

Endpoint	Method	Purpose
`/v1/models`	GET	List models
`/v1/completions`	POST	Text generation
`/v1/chat/completions`	POST	Chat completion
`/v1/embeddings`	POST	Generate embeddings

Limitations¶

The OpenAI compatibility layer does not support:

Show model details (/api/show)
List running models (/api/ps)
Copy model (/api/copy)
Delete model (/api/delete)

Use bazzite-ai-jupyter:chat or bazzite-ai-jupyter:ollama for these operations.

Setup¶

import os
from openai import OpenAI

OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://localhost:11434")

client = OpenAI(
    base_url=f"{OLLAMA_HOST}/v1",
    api_key="ollama"  # Required by library but ignored by Ollama
)

List Models¶

models = client.models.list()

for model in models.data:
    print(f"  - {model.id}")

Text Completions¶

response = client.completions.create(
    model="llama3.2:latest",
    prompt="Why is the sky blue? Answer in one sentence.",
    max_tokens=100
)

print(response.choices[0].text)
print(f"Tokens used: {response.usage.completion_tokens}")

Chat Completion¶

Single Turn¶

response = client.chat.completions.create(
    model="llama3.2:latest",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain machine learning in one sentence."}
    ],
    temperature=0.7,
    max_tokens=100
)

print(response.choices[0].message.content)
print(f"Tokens used: {response.usage.total_tokens}")

Multi-Turn Conversation¶

messages = [
    {"role": "system", "content": "You are a helpful math tutor."}
]

# Turn 1
messages.append({"role": "user", "content": "What is 2 + 2?"})
response = client.chat.completions.create(
    model="llama3.2:latest",
    messages=messages,
    max_tokens=50
)
assistant_msg = response.choices[0].message.content
messages.append({"role": "assistant", "content": assistant_msg})
print(f"User: What is 2 + 2?")
print(f"Assistant: {assistant_msg}")

# Turn 2
messages.append({"role": "user", "content": "And what is that multiplied by 3?"})
response = client.chat.completions.create(
    model="llama3.2:latest",
    messages=messages,
    max_tokens=50
)
print(f"User: And what is that multiplied by 3?")
print(f"Assistant: {response.choices[0].message.content}")

Streaming¶

stream = client.chat.completions.create(
    model="llama3.2:latest",
    messages=[{"role": "user", "content": "Count from 1 to 5."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Generate Embeddings¶

response = client.embeddings.create(
    model="llama3.2:latest",
    input="Ollama makes running LLMs locally easy."
)

embedding = response.data[0].embedding
print(f"Dimensions: {len(embedding)}")
print(f"First 5 values: {embedding[:5]}")

Error Handling¶

try:
    response = client.chat.completions.create(
        model="invalid-model",
        messages=[{"role": "user", "content": "Hello"}]
    )
except Exception as e:
    print(f"Error: {type(e).__name__}")

Migration from OpenAI¶

Before (OpenAI)¶

from openai import OpenAI

client = OpenAI()  # Uses OPENAI_API_KEY

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello!"}]
)

After (Ollama)¶

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"
)

response = client.chat.completions.create(
    model="llama3.2:latest",  # Change model name
    messages=[{"role": "user", "content": "Hello!"}]
)

LangChain Integration¶

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",
    model="llama3.2:latest"
)

response = llm.invoke("What is Python?")
print(response.content)

LlamaIndex Integration¶

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    api_base="http://localhost:11434/v1",
    api_key="ollama",
    model="llama3.2:latest"
)

response = llm.complete("What is Python?")
print(response.text)

Connection Health Check¶

import requests

def check_ollama_health(model="llama3.2:latest"):
    """Check if Ollama server is running and model is available."""
    OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://localhost:11434")
    try:
        response = requests.get(f"{OLLAMA_HOST}/api/tags", timeout=5)
        if response.status_code == 200:
            models = response.json()
            model_names = [m.get("name", "") for m in models.get("models", [])]
            return True, model in model_names
        return False, False
    except requests.exceptions.RequestException:
        return False, False

server_ok, model_ok = check_ollama_health()

When to Use This Skill

Use when:

Migrating from OpenAI to local LLMs
Using LangChain, LlamaIndex, or other OpenAI-based tools
You prefer the OpenAI client interface
Building applications that may switch between OpenAI and Ollama

Cross-References¶

bazzite-ai-jupyter:ollama - Native Ollama library (more features)
bazzite-ai-jupyter:chat - Direct REST API access