LLM Evaluation with evidently.ai¶
Evidently Python library: https://github.com/evidentlyai/evidently. Docs: https://docs.evidentlyai.com
#!pip install evidently[llm]
[No output generated]
Attribution & License
This notebook is adapted from: <evidentlyai/community-examples> (https://github.com/evidentlyai/community-examples.git), licensed under the Apache License, Version 2.0. © Original authors.
Modifications:
Tutorial structure¶
In this tutorial, we'll walk through:
- Basics: anatomy of a single eval.
- Reference-based evaluations: when you have a ground truth to compare against.
- Deterministic matching
- Semantic similarity and BERTScore
- LLM-as-a-judge (Correctness)
- Open-ended evaluations: when there is no “correct” answer.
- Regular expressions
- Text statistics
- Custom Python check
- Semantic similarity
- ML model scoring
- LLM-as-a-judge (Faithfulness, Completeness)
- Multi-turn evals: when you are evaluating a complete conversation session, not just a single input-output pair.
We'll use a mini-dataset for a financial assistant chatbot.
Imports¶
import pandas as pd
import os
from evidently import Report
from evidently import Dataset, DataDefinition
from evidently.descriptors import TextLength, Sentiment, HuggingFace, IncludesWords, SemanticSimilarity, ExactMatch, BERTScore, SentenceCount
from evidently.descriptors import LLMEval, PIILLMEval, DeclineLLMEval, CorrectnessLLMEval, FaithfulnessLLMEval
from evidently.descriptors import ColumnTest, TestSummary, CustomColumnDescriptor
from evidently.llm.templates import BinaryClassificationPromptTemplate, MulticlassClassificationPromptTemplate
from evidently.core.datasets import DatasetColumn
from evidently.presets import TextEvals
from evidently.tests import eq, gte, lte
from evidently.ui.workspace import CloudWorkspace
[No output generated]
# === GPU & Model Status Check ===
import gc
import torch
print("=== Initial Resource Status ===")
# GPU Status - Use pynvml for SYSTEM-WIDE memory (not just this process)
try:
import pynvml
pynvml.nvmlInit()
device_count = pynvml.nvmlDeviceGetCount()
print(f"\nGPU Count: {device_count}")
for i in range(device_count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
name = pynvml.nvmlDeviceGetName(handle)
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
total_gb = info.total / 1024**3
used_gb = info.used / 1024**3
free_gb = info.free / 1024**3
usage_pct = (info.used / info.total) * 100
print(f"\nGPU {i}: {name}")
print(f" Total: {total_gb:.2f} GB")
print(f" Used: {used_gb:.2f} GB ({usage_pct:.1f}%)")
print(f" Free: {free_gb:.2f} GB")
# Warning if low on memory (7B model needs ~5GB with 4-bit quantization)
if free_gb < 6.0:
print(f" ⚠️ WARNING: Low GPU memory! Model loading may fail.")
print(f" Consider running cleanup cells in other notebooks first.")
pynvml.nvmlShutdown()
except ImportError:
print("\n⚠️ pynvml not installed - falling back to PyTorch (per-process only)")
if torch.cuda.is_available():
print(f"GPU Available: {torch.cuda.get_device_name(0)}")
for i in range(torch.cuda.device_count()):
total = torch.cuda.get_device_properties(i).total_memory / 1024**3
allocated = torch.cuda.memory_allocated(i) / 1024**3
print(f" GPU {i}: {allocated:.2f} / {total:.2f} GB (THIS PROCESS ONLY)")
else:
print("No GPU available - using CPU")
except Exception as e:
print(f"\nGPU status check failed: {e}")
# Ollama Status
try:
import ollama
running = ollama.ps()
models = running.get("models", [])
if models:
print("\nOllama models currently loaded:")
for m in models:
name = m.get("name", "Unknown")
vram = m.get("size_vram", 0) / 1024**3
print(f" - {name} ({vram:.2f} GB VRAM)")
else:
print("\nNo Ollama models currently loaded")
except Exception as e:
print(f"\nOllama status check skipped: {e}")
print("\n" + "="*40)
=== Initial Resource Status === GPU Count: 1 GPU 0: b'NVIDIA GeForce RTX 4080 SUPER' Total: 15.99 GB Used: 9.06 GB (56.7%) Free: 6.93 GB No Ollama models currently loaded ========================================
Part 1. A very basic example¶
data = [
["What is the capital of France?", "The capital of France is Paris."],
["Can penguins fly?", "No, penguins cannot fly but they are excellent swimmers."],
["Help me with homework?", "I'm here to help guide you, but I can't do your homework for you."],
["Is water wet?", "Yes, water is considered wet because it makes things wet."],
["Do fish sleep?", "Yes, fish do sleep, though not in the same way humans do."],
["What is 2 + 2?", "2 + 2 equals 4."],
["Is the Earth flat?", "No, the Earth is a sphere."],
["Can dogs talk?", "Dogs can't talk in human language, but they communicate through barks, body language, and behavior. For example, a wagging tail often means happiness, while growling can signal fear or aggression. They're great at expressing themselves without words."],
["What's your name?", "I’m a virtual assistant bot."],
["Are bananas berries?", "Yes, botanically speaking, bananas are classified as berries."]
]
columns = ["question", "answer"]
eval_data = pd.DataFrame(data, columns=columns)
[No output generated]
pd.set_option('display.max_colwidth', None)
[No output generated]
eval_data.head()
question \
0 What is the capital of France?
1 Can penguins fly?
2 Help me with homework?
3 Is water wet?
4 Do fish sleep?
answer
0 The capital of France is Paris.
1 No, penguins cannot fly but they are excellent swimmers.
2 I'm here to help guide you, but I can't do your homework for you.
3 Yes, water is considered wet because it makes things wet.
4 Yes, fish do sleep, though not in the same way humans do. Prepare the Dataset for Evidently to work with:
definition = DataDefinition(text_columns=["question", "answer"])
[No output generated]
eval_df = Dataset.from_pandas(
pd.DataFrame(eval_data),
data_definition=definition)
[No output generated]
A Descriptor is a row-level score or label that assesses a specific quality of a given text. It’s different from dataset metrics (like accuracy or precision) that give a score for an entire dataset. Each Descriptor returns a result that can be:
- Numerical. Any scores like symbol count or sentiment score.
- Categorical. Labels or binary “true”/“false” results for pattern matches.
- Text string. Like explanations generated by LLM.
Let's use text length for illustration:
eval_df.add_descriptors(descriptors=[
TextLength("answer", alias="Answer Length"),
])
[No output generated]
eval_df.as_dataframe()
question \
0 What is the capital of France?
1 Can penguins fly?
2 Help me with homework?
3 Is water wet?
4 Do fish sleep?
5 What is 2 + 2?
6 Is the Earth flat?
7 Can dogs talk?
8 What's your name?
9 Are bananas berries?
answer \
0 The capital of France is Paris.
1 No, penguins cannot fly but they are excellent swimmers.
2 I'm here to help guide you, but I can't do your homework for you.
3 Yes, water is considered wet because it makes things wet.
4 Yes, fish do sleep, though not in the same way humans do.
5 2 + 2 equals 4.
6 No, the Earth is a sphere.
7 Dogs can't talk in human language, but they communicate through barks, body language, and behavior. For example, a wagging tail often means happiness, while growling can signal fear or aggression. They're great at expressing themselves without words.
8 I’m a virtual assistant bot.
9 Yes, botanically speaking, bananas are classified as berries.
Answer Length
0 31
1 56
2 65
3 57
4 57
5 15
6 26
7 250
8 28
9 61 See all implemented Descriptors: https://docs.evidentlyai.com/metrics/all_descriptors
Descriptor tests. You can also add conditional tests for pass/fail for each row.
# You can also create the dataframe together with adding the descriptors.
eval_df = Dataset.from_pandas(
pd.DataFrame(eval_data),
data_definition=definition,
descriptors=[TextLength("answer", alias="Answer Length",
tests=[gte(100, alias="Answer is too long")])])
eval_df.as_dataframe()
question \
0 What is the capital of France?
1 Can penguins fly?
2 Help me with homework?
3 Is water wet?
4 Do fish sleep?
5 What is 2 + 2?
6 Is the Earth flat?
7 Can dogs talk?
8 What's your name?
9 Are bananas berries?
answer \
0 The capital of France is Paris.
1 No, penguins cannot fly but they are excellent swimmers.
2 I'm here to help guide you, but I can't do your homework for you.
3 Yes, water is considered wet because it makes things wet.
4 Yes, fish do sleep, though not in the same way humans do.
5 2 + 2 equals 4.
6 No, the Earth is a sphere.
7 Dogs can't talk in human language, but they communicate through barks, body language, and behavior. For example, a wagging tail often means happiness, while growling can signal fear or aggression. They're great at expressing themselves without words.
8 I’m a virtual assistant bot.
9 Yes, botanically speaking, bananas are classified as berries.
Answer Length Answer is too long
0 31 False
1 56 False
2 65 False
3 57 False
4 57 False
5 15 False
6 26 False
7 250 True
8 28 False
9 61 False Report. Summarize evaluation results.
report = Report([
TextEvals()
])
my_eval = report.run(eval_df)
[No output generated]
my_eval
<evidently.core.report.Snapshot at 0x7fc5db4078c0>
Other output formats:
# my_eval.json()
# my_eval.dict()
# my_eval.save_html("file.html")
[No output generated]
Upload to Evidently Cloud. Connect to Cloud:
# import os, getpass
# os.environ["EVIDENTLY_API_KEY"] = getpass.getpass("Evidently API key (hidden): ")
[No output generated]
# ws = CloudWorkspace(
# token=os.getenv("EVIDENTLY_API_KEY"),
# url="https://app.evidently.cloud",
# )
[No output generated]
Create or connect to a Project.
# project = ws.get_project("YOUR_PROJECT_ID")
[No output generated]
# project = ws.create_project("My project name", org_id="YOUR_ORG_ID")
# project.description = "My project description"
# project.save()
[No output generated]
Upload:
# ws.add_run(project.id, my_eval, include_data=True)
[No output generated]
Part 2. Reference-based evals¶
Generate a toy dataset. Let's imagine a Q&A (RAG) use case where the system generates the response based on the retrieved context.
eval_data = pd.DataFrame([
{
"question": "Will my transaction go through if I don't have enough funds?",
"context": "Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.",
"answer": "Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply",
"reference_answer": "Yes, with overdraft protection, your transaction will complete, but you will be charged $35."
},
{
"question": "How do I block my card if it's lost?",
"context": "To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.",
"answer": "Go to the Cards section, select your card, and tap 'Block card' to block it instantly.",
"reference_answer": "Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate."
},
{
"question": "Do you offer loans in Argentina?",
"context": "FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.",
"answer": "Yes, FinBot offers personal loans in Argentina with competitive rates.", # Incorrect
"reference_answer": "No, FinBot does not currently offer loans in Argentina."
},
{
"question": "Is there a fee for using an ATM in Mexico?",
"context": "ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.",
"answer": "You'll be charged $2.50.",
"reference_answer": "Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free."
},
{
"question": "Can I cancel a transaction after it's sent?",
"context": "Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.",
"answer": "I am afraid I do not have information to answer this question.",
"reference_answer": "No, but you can submit a recall request. It depends on the recipient's bank."
}
])
[No output generated]
Ground truth (Mocked)¶
Let's first take a look at the starting point: a golden dataset of expected questions and answers.
golden_df = eval_data[["question", "reference_answer"]].copy()
golden_df.head()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
reference_answer
0 Yes, with overdraft protection, your transaction will complete, but you will be charged $35.
1 Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.
2 No, FinBot does not currently offer loans in Argentina.
3 Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.
4 No, but you can submit a recall request. It depends on the recipient's bank. golden_df.size
10
Scored data (Mocked)¶
Let's assume we ran it through our app and got the actual answer and context used to generate it.
eval_data.head()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
reference_answer
0 Yes, with overdraft protection, your transaction will complete, but you will be charged $35.
1 Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.
2 No, FinBot does not currently offer loans in Argentina.
3 Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.
4 No, but you can submit a recall request. It depends on the recipient's bank. Reference-based evals¶
Deterministic¶
Exact match - let's use it for illustration.
definition = DataDefinition(text_columns=["question", "context", "answer", "reference_answer"])
[No output generated]
eval_df = Dataset.from_pandas(
pd.DataFrame(eval_data),
data_definition=definition,
descriptors=[ExactMatch(columns=["answer", "reference_answer"], alias="ExactMatch")])
[No output generated]
eval_df.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
reference_answer \
0 Yes, with overdraft protection, your transaction will complete, but you will be charged $35.
1 Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.
2 No, FinBot does not currently offer loans in Argentina.
3 Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.
4 No, but you can submit a recall request. It depends on the recipient's bank.
ExactMatch
0 False
1 False
2 False
3 False
4 False Exact Match checks if the generated response matches the reference text exactly.
However, in real-world LLM output, even perfectly valid answers may use different wording or structure. This method is too strict.
Semantic match¶
Let's compare semantic match.
We’ll use two approaches:
- SemanticSimilarity: cosine similarity over sentence embeddings. This method produces a single vector per sentence using a built-in embedding model. Measures closeness in meaning between answer and reference. Outputs a float between 0 and 1, where 0 is opposite meanings, 0.5 is unrelated, and 1 is exactly matching.
- BERTScore evaluates token-level similarity using contextual embeddings from BERT. It aligns each token in the candidate sentence with the most similar token in the reference sentence based on cosine similarity of their embeddings. Precision, recall, and F1 scores are calculated over these alignments, with the F1 score used as the final metric.
eval_df.add_descriptors(descriptors=[
SemanticSimilarity(columns=["answer", "reference_answer"], alias="Semantic Similarity"),
BERTScore(columns=["answer", "reference_answer"], alias="BERTScore"),
])
[No output generated]
eval_df.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
reference_answer \
0 Yes, with overdraft protection, your transaction will complete, but you will be charged $35.
1 Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.
2 No, FinBot does not currently offer loans in Argentina.
3 Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.
4 No, but you can submit a recall request. It depends on the recipient's bank.
ExactMatch Semantic Similarity BERTScore
0 False 0.967079 0.811106
1 False 0.947559 0.823189
2 False 0.953436 0.847286
3 False 0.707766 0.650744
4 False 0.602537 0.575384 While embedding-based metrics are helpful for measuring overall semantic closeness (and help us capture issues like a denial to respond), they aren't always precise enough for factual evaluations. These methods rely on vector similarity, so they may consider two responses "similar" even if they differ in one little detail like reversing a yes/no fact.
LLM as a judge¶
We can achieve better result with LLM-based judges that can reason about meaning or detect contradictions between texts.
import os
from evidently.llm.utils.wrapper import OllamaOptions
# === Ollama Configuration for evidently.ai ===
# evidently.ai uses LiteLLM which supports Ollama natively
# See: https://docs.litellm.ai/docs/providers/ollama
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
# === Model Configuration ===
HF_LLM_MODEL = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF"
OLLAMA_LLM_MODEL = f"hf.co/{HF_LLM_MODEL}:Q4_K_M"
# Create OllamaOptions with api_url for evidently
OLLAMA_OPTIONS = OllamaOptions(api_url=OLLAMA_HOST)
print(f"Ollama host: {OLLAMA_HOST}")
print(f"Model for LLM judges: {OLLAMA_LLM_MODEL}")
Ollama host: http://ollama:11434 Model for LLM judges: hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M
eval_df.add_descriptors(
descriptors=[
CorrectnessLLMEval("answer", target_output="reference_answer", provider="ollama", model=OLLAMA_LLM_MODEL),
],
options=OLLAMA_OPTIONS
)
[No output generated]
eval_df.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
reference_answer \
0 Yes, with overdraft protection, your transaction will complete, but you will be charged $35.
1 Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.
2 No, FinBot does not currently offer loans in Argentina.
3 Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.
4 No, but you can submit a recall request. It depends on the recipient's bank.
ExactMatch Semantic Similarity BERTScore Correctness \
0 False 0.967079 0.811106 CORRECT
1 False 0.947559 0.823189 CORRECT
2 False 0.953436 0.847286 INCORRECT
3 False 0.707766 0.650744 CORRECT
4 False 0.602537 0.575384 UNKNOWN
Correctness reasoning
0 The output conveys the same facts and details as the reference (transaction completion with overdraft protection and a $35 fee) even if worded differently.
1 The text conveys the same facts and details as the reference, even if worded differently. It preserves the original meaning without introducing inaccuracies or omissions.
2 The OUTPUT contradicts the REFERENCE by stating that FinBot offers personal loans in Argentina, while the REFERENCE clearly states that it does not. Additionally, the OUTPUT introduces an extra claim about competitive rates, which is not mentioned or implied in the REFERENCE.
3 The text 'You'll be charged $2.50.' conveys the same fact as the reference that there is a fee for non-partner ATMs, even though it is worded differently.
4 Text is not present in the provided reference. Let's create a custom judge that will instead use 4 categories based on what we observe.
Let's re-import data so that we drop the existing descriptors:
eval_df_2 = Dataset.from_pandas(
pd.DataFrame(eval_data),
data_definition=definition)
[No output generated]
correctness_multiclass = MulticlassClassificationPromptTemplate(
pre_messages=[("system", "You are a judge that evaluates the factual alignment of two chatbot answers.")],
criteria="""You are given a new answer and a reference answer. Classify the new answer based on how it compares to the reference.
===
Reference: {reference_answer} """,
category_criteria={
"fully_correct": "The answer matches the reference in all factual and semantic details.",
"incomplete": "The answer is correct in what it says but leaves out details from the reference.",
"adds_claims": "The answer does not contradict reference but introduces new claims not supported by the reference.",
"contradictory": "The answer contradicts specific facts or meaning in the reference.",
},
uncertainty="unknown",
include_reasoning=True,
include_scores=False
)
[No output generated]
eval_df_2.add_descriptors(descriptors=[
LLMEval("answer",
template=correctness_multiclass,
additional_columns={"reference_answer": "reference_answer"},
provider="ollama",
model=OLLAMA_LLM_MODEL,
alias="Multi-class correctness"
)
], options=OLLAMA_OPTIONS)
[No output generated]
eval_df_2.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
reference_answer \
0 Yes, with overdraft protection, your transaction will complete, but you will be charged $35.
1 Open the app, go to Cards, choose your card, and tap 'Block card'. Blocking is immediate.
2 No, FinBot does not currently offer loans in Argentina.
3 Yes, the fee is $2.50 for non-partner ATMs. Partner ATMs are free.
4 No, but you can submit a recall request. It depends on the recipient's bank.
Multi-class correctness \
0 fully_correct
1 fully_corrct
2 contradictory
3 fully_correct
4 contradictory
Multi-class correctness reasoning
0 The new answer matches the reference in all factual and semantic details. Both mention that the transaction will complete with overdraft protection enabled, but a $35 fee will apply.
1 The answer provides the same instructions as the reference, just with a slightly different phrasing. It covers all the necessary steps and details (go to Cards section, select your card, tap 'Block card', and that blocking is immediate). There are no missing or added claims in comparison to the reference.
2 The reference states that FinBot does not offer loans in Argentina, while the new answer claims that it does. This is a contradiction to the information provided in the reference.
3 The answer matches the reference in all factual and semantic details, as both mention the fee of $2.50 for non-partner ATMs.
4 The reference mentions the possibility of submitting a recall request depending on the recipient's bank, while the new answer states it does not have information to answer the question. The two answers contradict each other in terms of providing assistance or information. Part 3. Reference-free evals¶
In production, or high-volume testing, you may not have a reference answer. In this case, you can run open-ended evals judging only the final generation. In many cases, you can also use supplementary information - like question and context in your evaluations.
prod_data = eval_data[["question", "context", "answer"]].copy()
[No output generated]
prod_data.head()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question. definition = DataDefinition(text_columns=["question", "context", "answer"])
prod_df = Dataset.from_pandas(
pd.DataFrame(prod_data),
data_definition=definition)
[No output generated]
Word presence¶
Or you can use "Contains", a custom RegEx, etc.
prod_df.add_descriptors(descriptors=[
IncludesWords("answer",
words_list=["hello", "hi", "good afternoon"],
mode="any", alias="Says hi"),
IncludesWords("answer",
words_list=["sorry", "apologies", "apologize", "cannot", "afraid"],
mode="any",
alias="Declines")
])
[No output generated]
prod_df.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
Says hi Declines
0 False False
1 False False
2 False False
3 False False
4 True True Text stats¶
prod_df = Dataset.from_pandas(
pd.DataFrame(prod_data),
data_definition=definition,
descriptors=[
SentenceCount("answer", alias="Sentence_Count")])
[No output generated]
prod_df.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
Sentence_Count
0 1
1 1
2 1
3 1
4 1 Depending on the use case, could be IsValidJSON() etc.
Custom Python check¶
Implement a function that takes a Pandas Series as input and return a transformed Series. For example, to check if the column is empty:
def is_empty(data: DatasetColumn) -> DatasetColumn:
return DatasetColumn(
type="cat",
data=pd.Series([
"EMPTY" if val == "" else "NON EMPTY"
for val in data.data]))
[No output generated]
prod_df.add_descriptors(descriptors=[
CustomColumnDescriptor("answer", is_empty, alias="is_empty"),
])
[No output generated]
prod_df.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
Sentence_Count is_empty
0 1 NON EMPTY
1 1 NON EMPTY
2 1 NON EMPTY
3 1 NON EMPTY
4 1 NON EMPTY Semantic similarity¶
You can use semantic similarity between answer and context, or answer and question as proxies for hallucinations and relevance.
prod_df.add_descriptors(descriptors=[
SemanticSimilarity(columns=["answer", "context"], alias="Hallucination proxy"),
SemanticSimilarity(columns=["answer", "question"], alias="Relevance proxy")
])
[No output generated]
prod_df.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
Sentence_Count is_empty Hallucination proxy Relevance proxy
0 1 NON EMPTY 0.867070 0.746085
1 1 NON EMPTY 0.821591 0.858594
2 1 NON EMPTY 0.810964 0.909730
3 1 NON EMPTY 0.619333 0.661532
4 1 NON EMPTY 0.559690 0.565810 ML models¶
Example: word-based sentiment model (-1 negative, 0 neutral, 1 positive).
prod_df_2 = Dataset.from_pandas(
pd.DataFrame(prod_data),
data_definition=definition,
descriptors=[Sentiment("answer", alias="Sentiment")])
[No output generated]
prod_df_2.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
Sentiment
0 0.2144
1 -0.7003
2 0.5267
3 -0.2023
4 0.0000 Example: custom model from HuggingFace. https://huggingface.co/MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli (Zero-shot classifier. You can provide candidate labels as params.)
# HuggingFace model is downloaded automatically - no local path needed
# The DeBERTa model below is used directly from HuggingFace Hub
[No output generated]
prod_df_2.add_descriptors(descriptors=[
HuggingFace("answer",
model="MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli",
params={"labels": ["finance", "other"], "threshold":0.5},
alias="Topic"
)
])
Device set to use cuda:0
prod_df_2.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
Sentiment Topic
0 0.2144 finance
1 -0.7003 other
2 0.5267 finance
3 -0.2023 finance
4 0.0000 other LLM judge¶
Using LLM judge to check for hallucinations (contradictions between answer and context).
prod_df_2 = Dataset.from_pandas(
pd.DataFrame(prod_data),
data_definition=definition)
[No output generated]
prod_df_2.add_descriptors(descriptors=[
FaithfulnessLLMEval("answer", context="context", alias="Faithfulness", provider="ollama", model=OLLAMA_LLM_MODEL),
TextLength("answer", alias="Length")
], options=OLLAMA_OPTIONS)
[No output generated]
prod_df_2.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
Faithfulness \
0 FAITHFUL
1 FAITHFUL
2 UNFAITHFUL
3 FAITHFUL
4 FAITHFUL
Faithfulness reasoning \
0 The text accurately uses information from the source by stating that transactions will go through if overdraft protection is enabled, and also mentions the $35 fee which is part of the service's terms. No new information or contradictions are present.
1 The given text is faithful because it accurately uses information from the source. It describes the process of blocking a lost or stolen card in the same way as the source: 'Go to the Cards section, select your card, and tap 'Block card' to block it instantly.'
2 The response contradicts the information provided in the source because FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries, but it does not mention Argentina.
3 The text 'You'll be charged $2.50.' is a faithful response as it accurately uses information from the source, which states that non-partner ATMs incur a $2.50 fee per withdrawal.
4 The statement does not contradict or add new information from the source and accurately acknowledges lack of relevant information.
Length
0 98
1 86
2 70
3 24
4 62 Let's create a custom helpfulness evaluator.
completeness = BinaryClassificationPromptTemplate(
pre_messages=[("system", "You are an evaluator assessing whether a chatbot response is sufficiently complete and informative on its own.")],
criteria = """A COMPLETE response should be a full sentence or paragraph, and be easy to understand on its own.
For example: "Yes, you can issue additional credit card for a relative.", or longer.
A TOO-SHORT response is overly brief or vague—for example, just a number or a simple yes/no—without additional context.
For example: "Yes, you can."
""",
target_category="complete",
non_target_category="too-short",
uncertainty="unknown",
include_reasoning=True
)
[No output generated]
prod_df_2.add_descriptors(descriptors=[
LLMEval("answer",
template=completeness,
provider="ollama",
model=OLLAMA_LLM_MODEL,
alias="Answer completeness"
)
], options=OLLAMA_OPTIONS)
[No output generated]
prod_df_2.as_dataframe()
question \
0 Will my transaction go through if I don't have enough funds?
1 How do I block my card if it's lost?
2 Do you offer loans in Argentina?
3 Is there a fee for using an ATM in Mexico?
4 Can I cancel a transaction after it's sent?
context \
0 Overdraft protection allows transactions to be completed even if the account balance is insufficient. This service incurs a $35 fee per overdraft event unless linked to a backup account.
1 To block a lost or stolen card, users should immediately navigate to the Cards section in the FinBot app, select the affected card, and tap 'Block card'. Blocking is instant and irreversible.
2 FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries.
3 ATM withdrawals in Mexico are free when using partner ATMs. Non-partner ATMs incur a $2.50 fee per withdrawal, which is deducted immediately.
4 Outgoing transactions cannot be canceled once processed. Users may initiate a recall request, but success is not guaranteed. The recipient's bank must agree to reverse the transfer.
answer \
0 Yes, your transaction will go through if overdraft protection is enabled, but a $35 fee will apply
1 Go to the Cards section, select your card, and tap 'Block card' to block it instantly.
2 Yes, FinBot offers personal loans in Argentina with competitive rates.
3 You'll be charged $2.50.
4 I am afraid I do not have information to answer this question.
Faithfulness \
0 FAITHFUL
1 FAITHFUL
2 UNFAITHFUL
3 FAITHFUL
4 FAITHFUL
Faithfulness reasoning \
0 The text accurately uses information from the source by stating that transactions will go through if overdraft protection is enabled, and also mentions the $35 fee which is part of the service's terms. No new information or contradictions are present.
1 The given text is faithful because it accurately uses information from the source. It describes the process of blocking a lost or stolen card in the same way as the source: 'Go to the Cards section, select your card, and tap 'Block card' to block it instantly.'
2 The response contradicts the information provided in the source because FinBot currently offers loans in 20+ locations, including the US, Canada, and selected EU countries, but it does not mention Argentina.
3 The text 'You'll be charged $2.50.' is a faithful response as it accurately uses information from the source, which states that non-partner ATMs incur a $2.50 fee per withdrawal.
4 The statement does not contradict or add new information from the source and accurately acknowledges lack of relevant information.
Length Answer completeness \
0 98 complete
1 86 too-short
2 70 complete
3 24 too-short
4 62 too-short
Answer completeness reasoning
0 The response is complete because it provides clear information about the transaction going through and the fee that will apply when overdraft protection is enabled.
1 The response is overly brief and vague, as it lacks context or a full explanation of the process. It only mentions blocking the card but does not explain what happens after that.
2 The text is complete because it provides a full sentence that can stand alone and conveys information about FinBot offering personal loans in Argentina with competitive rates.
3 The response is overly brief and doesn't provide enough context or information. It simply states a charge amount without explaining why the user will be charged or what the charge is for.
4 The response is overly brief and does not provide any information. It only states that there's no information available to answer the question. Advanced: multi-turn conversation evals¶
For dialog systems, it can be useful to analyze multiple responses together. This allows you to assess metrics that apply to the entire session, such as to surface conversations with specific failures, like:
- Denial to respond
- Presence of PII
Let's create a toy dataset.
multi_turn_dialogs = {
"Session": [
# 1. Sending money
"User: Hi, I need to send money to my cousin in Brazil today.\n"
"Assistant: Sure! You can transfer up to $3000 USD per day to Brazil.\n"
"User: I'm sending $2000. Is that okay?\n"
"Assistant: Yes, that's fine and doesn't require any additional steps.\n"
"User: Great. The recipient's name is Carla Mendes, CPF number 123.456.789-00. Is that all you need?\n"
"Assistant: Yes, that's enough to proceed. Please confirm the recipient's bank details on the transfer screen.\n"
"User: Confirmed, thank you!",
# 2. Lost card
"User: I lost my debit card! What do I do?\n"
"Assistant: Don't worry. You can block your card immediately through the FinBot app.\n"
"User: How?\n"
"Assistant: Open the app, go to the Cards section, select the lost card, and tap 'Block card'. It'll be instant.\n"
"User: Is it reversible?\n"
"Assistant: No, once blocked, the card can't be unblocked. You'll need to order a replacement.\n"
"User: Got it. Doing it now. Thanks!",
# 3. Financial market
"User: Do you know what the average interest rate is for personal loans in Argentina?\n"
"Assistant: I'm sorry, I can't provide general financial market information like average interest rates in specific countries.\n"
"User: Oh, I thought you might have that info.\n"
"Assistant: I can help with FinBot-specific loan offerings, if available, but I recommend checking with a local financial advisor or bank for up-to-date rates.\n"
"User: Okay, thanks anyway.",
# 4. ATM fees
"User: I'm traveling to Mexico next week. Will I be charged ATM fees there?\n"
"Assistant: If you use a partner ATM, withdrawals are free. Non-partner ATMs charge $2.50 per withdrawal.\n"
"User: How do I know which ones are partners?\n"
"Assistant: You can find a list of partner ATMs in the FinBot app's 'ATM Finder' section.\n"
"User: Awesome, thanks!",
# 5. Cancel transaction
"User: I just sent money to the wrong person! Can I cancel it?\n"
"Assistant: Unfortunately, once a transaction is processed, it can't be canceled.\n"
"User: Is there anything I can do?\n"
"Assistant: You can submit a recall request, but success depends on the recipient's bank cooperation.\n"
"User: Okay, I'll try that. How do I submit it?\n"
"Assistant: In the app, go to the transaction details and tap 'Request Recall'. Follow the steps there.\n"
"User: Got it, thanks for your help."
]
}
multi_turn_df = pd.DataFrame(multi_turn_dialogs)
[No output generated]
Run the evaluation:
# Define data definition for Session column (different from earlier definitions)
session_definition = DataDefinition(text_columns=["Session"])
prod_df_3 = Dataset.from_pandas(
pd.DataFrame(multi_turn_df),
data_definition=session_definition,
descriptors=[
DeclineLLMEval("Session", provider="ollama", model=OLLAMA_LLM_MODEL),
PIILLMEval("Session", provider="ollama", model=OLLAMA_LLM_MODEL)
],
options=OLLAMA_OPTIONS
)
report = Report([TextEvals()])
my_eval = report.run(prod_df_3)
# ws.add_run(project.id, my_eval, include_data=True)
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/llm/utils/blocks.py:169, in JsonOutputFormatBlock.parse_response(self, response)
168 try:
--> 169 return json.loads(response)
170 except json.JSONDecodeError as e:
File /opt/pixi/.pixi/envs/default/lib/python3.13/json/__init__.py:352, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
349 if (cls is None and object_hook is None and
350 parse_int is None and parse_float is None and
351 parse_constant is None and object_pairs_hook is None and not kw):
--> 352 return _default_decoder.decode(s)
353 if cls is None:
File /opt/pixi/.pixi/envs/default/lib/python3.13/json/decoder.py:345, in JSONDecoder.decode(self, s, _w)
341 """Return the Python representation of ``s`` (a ``str`` instance
342 containing a JSON document).
343
344 """
--> 345 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
346 end = _w(s, end).end()
File /opt/pixi/.pixi/envs/default/lib/python3.13/json/decoder.py:361, in JSONDecoder.raw_decode(self, s, idx)
360 try:
--> 361 obj, end = self.scan_once(s, idx)
362 except StopIteration as err:
JSONDecodeError: Expecting ',' delimiter: line 3 column 1 (char 19)
The above exception was the direct cause of the following exception:
LLMResponseParseError Traceback (most recent call last)
Cell In[60], line 4
1 # Define data definition for Session column (different from earlier definitions)
2 session_definition = DataDefinition(text_columns=["Session"])
----> 4 prod_df_3 = Dataset.from_pandas(
5 pd.DataFrame(multi_turn_df),
6 data_definition=session_definition,
7 descriptors=[
8 DeclineLLMEval("Session", provider="ollama", model=OLLAMA_LLM_MODEL),
9 PIILLMEval("Session", provider="ollama", model=OLLAMA_LLM_MODEL)
10 ],
11 options=OLLAMA_OPTIONS
12 )
14 report = Report([TextEvals()])
15 my_eval = report.run(prod_df_3)
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/core/datasets.py:1020, in Dataset.from_pandas(cls, data, data_definition, descriptors, options, metadata, tags)
1018 dataset = PandasDataset(data, data_definition, metadata=metadata, tags=tags)
1019 if descriptors is not None:
-> 1020 dataset.add_descriptors(descriptors, options)
1021 return dataset
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/core/datasets.py:1130, in Dataset.add_descriptors(self, descriptors, options)
1123 """Add multiple descriptors to the dataset.
1124
1125 Args:
1126 * `descriptors`: List of `Descriptor` objects to compute
1127 * `options`: Optional options for descriptor computation
1128 """
1129 for descriptor in descriptors:
-> 1130 self.add_descriptor(descriptor, options)
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/core/datasets.py:1436, in PandasDataset.add_descriptor(self, descriptor, options)
1434 def add_descriptor(self, descriptor: Descriptor, options: AnyOptions = None):
1435 descriptor.validate_input(self._data_definition)
-> 1436 new_columns = descriptor.generate_data(self, Options.from_any_options(options))
1437 if isinstance(new_columns, DatasetColumn):
1438 new_columns = {descriptor.alias: new_columns}
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/core/datasets.py:870, in FeatureDescriptor.generate_data(self, dataset, options)
867 def generate_data(
868 self, dataset: "Dataset", options: Options
869 ) -> Union[DatasetColumn, Dict[DisplayName, DatasetColumn]]:
--> 870 feature = self.feature.generate_features_renamed(
871 dataset.as_dataframe(),
872 create_data_definition(None, dataset.as_dataframe(), ColumnMapping()),
873 options,
874 )
875 return {
876 col.display_name: self.get_dataset_column(col.name, feature[col.name])
877 for col in self.feature.list_columns()
878 }
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/legacy/features/generated_features.py:56, in GeneratedFeatures.generate_features_renamed(self, data, data_definition, options)
53 def generate_features_renamed(
54 self, data: pd.DataFrame, data_definition: DataDefinition, options: Options
55 ) -> pd.DataFrame:
---> 56 features = self.generate_features(data, data_definition, options)
57 return features.rename(columns={col: self._create_column_name(col) for col in features.columns}).set_index(
58 data.index
59 )
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/legacy/features/llm_judge.py:54, in LLMJudge.generate_features(self, data, data_definition, options)
53 def generate_features(self, data: pd.DataFrame, data_definition: DataDefinition, options: Options) -> pd.DataFrame:
---> 54 result: Union[List, Dict] = self.get_llm_wrapper(options).run_batch_sync(
55 requests=self.template.iterate_messages(data, self.get_input_columns())
56 )
57 if isinstance(result, list) and not any(isinstance(o, dict) for o in result):
58 result = {self.display_name or self.template.get_main_output_column(): result}
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/legacy/utils/sync.py:68, in sync_api.<locals>.sync_call(*args, **kwargs)
66 @wraps(f)
67 def sync_call(*args, **kwargs):
---> 68 return async_to_sync(f(*args, **kwargs))
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/legacy/utils/sync.py:52, in async_to_sync(awaitable)
50 _thr.start()
51 future = asyncio.run_coroutine_threadsafe(awaitable, _loop)
---> 52 result = future.result()
53 return result
54 except RuntimeError:
55 # No running loop, create a new one
File /opt/pixi/.pixi/envs/default/lib/python3.13/concurrent/futures/_base.py:456, in Future.result(self, timeout)
454 raise CancelledError()
455 elif self._state == FINISHED:
--> 456 return self.__get_result()
457 else:
458 raise TimeoutError()
File /opt/pixi/.pixi/envs/default/lib/python3.13/concurrent/futures/_base.py:401, in Future.__get_result(self)
399 if self._exception is not None:
400 try:
--> 401 raise self._exception
402 finally:
403 # Break a reference cycle with the exception in self._exception
404 self = None
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/llm/utils/wrapper.py:242, in LLMWrapper.run_batch(self, requests, batch_size, limits)
235 async def run_batch(
236 self,
237 requests: Sequence[LLMRequest[TResult]],
238 batch_size: Optional[int] = None,
239 limits: Optional[RateLimits] = None,
240 ) -> List[TResult]:
241 rs = [LimitRequest(r, sum(self.estimate_tokens(m) for m in r.messages)) for r in requests]
--> 242 return await self._batch(self._run, rs, batch_size, limits)
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/llm/utils/wrapper.py:207, in LLMWrapper._batch(self, coro, batches, batch_size, limits)
204 rate.record(res.input_tokens, res.output_tokens)
205 return res.result
--> 207 return await asyncio.gather(*[work(batch) for batch in batches])
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/llm/utils/wrapper.py:203, in LLMWrapper._batch.<locals>.work(request)
201 async def work(request: LimitRequest[TBatchItem]) -> TBatchResult:
202 async with semaphore, rate_limiter.enter(request) as rate:
--> 203 res = await coro(request.request)
204 rate.record(res.input_tokens, res.output_tokens)
205 return res.result
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/llm/utils/wrapper.py:233, in LLMWrapper._run(self, request)
231 except Exception as e:
232 error = e
--> 233 raise error
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/llm/utils/wrapper.py:229, in LLMWrapper._run(self, request)
226 try:
227 response = await self.complete(request.messages)
228 return LLMResult(
--> 229 request.response_parser(response.result), response.input_tokens, response.output_tokens
230 )
231 except Exception as e:
232 error = e
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/llm/utils/templates.py:114, in PromptTemplate.get_parser.<locals>.parse(response, keys)
113 def parse(response: str, keys: Optional[List[str]] = None) -> Dict[str, Any]:
--> 114 parsed = output.parse_response(response)
115 if keys is not None and set(keys) != set(parsed.keys()):
116 raise LLMResponseParseError(f"Keys {keys} are required but got {list(parsed.keys())}", response)
File /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/evidently/llm/utils/blocks.py:175, in JsonOutputFormatBlock.parse_response(self, response)
173 if sub is not None:
174 return sub
--> 175 raise LLMResponseParseError("Failed to parse response as json", response) from e
LLMResponseParseError: ('Failed to parse response as json', '{\n"category": "OK"\n"reasoning": "The text discusses a user losing their debit card and the steps to take with an app. While there might be some indirect reference to personal information (lost debit card), it does not contain any direct identifiers or quasi-identifiers that can uniquely identify an individual."\n}') # prod_df_3.as_dataframe()
# my_eval
[No output generated]
raw_dialog_data = prod_df_3.as_dataframe()
raw_dialog_data[(raw_dialog_data["Decline"] == "DECLINE") | (raw_dialog_data["PII"] == "PII")]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[62], line 1
----> 1 raw_dialog_data = prod_df_3.as_dataframe()
2 raw_dialog_data[(raw_dialog_data["Decline"] == "DECLINE") | (raw_dialog_data["PII"] == "PII")]
NameError: name 'prod_df_3' is not defined Want to play around?¶
Check docs:
- Descriptor basic API: https://docs.evidentlyai.com/docs/library/descriptors
- Customizing LLM judges: https://docs.evidentlyai.com/metrics/customize_llm_judge
- Custom Python functions: https://docs.evidentlyai.com/metrics/customize_descriptor
- List of all descriptors: https://docs.evidentlyai.com/metrics/all_descriptors
Did you like an example? Star Evidently on GitHub to support the project https://github.com/evidentlyai/evidently
Shutdown & Cleanup¶
Run the cell below to unload the Ollama model and shutdown the kernel.
# === Unload Ollama Model & Shutdown Kernel ===
# Unloads the model from GPU memory before shutting down
try:
import ollama
print(f"Unloading Ollama model: {OLLAMA_LLM_MODEL}")
ollama.generate(model=OLLAMA_LLM_MODEL, prompt="", keep_alive=0)
print("Model unloaded from GPU memory")
except Exception as e:
print(f"Model unload skipped: {e}")
# Shut down the kernel to fully release resources
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
Unloading Ollama model: hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M Model unloaded from GPU memory
{'status': 'ok', 'restart': False}