LLM as a Judge with evidently.ai¶
Bazzite-AI Setup Required
RunD0_00_Bazzite_AI_Setup.ipynbfirst to configure Ollama and verify GPU access.
Attribution & License
This notebook is adapted from: evidentlyai/community-examples, licensed under the Apache License, Version 2.0. © Original authors.
Modifications: by Simeon Harrison/EuroCC Austria, © 2025.
#!pip install evidently[llm]
[No output generated]
import pandas as pd
import numpy as np
from evidently import Dataset
from evidently import DataDefinition
from evidently import Report
from evidently import BinaryClassification
from evidently.descriptors import *
from evidently.presets import TextEvals, ValueStats, ClassificationPreset, DataSummaryPreset
from evidently.metrics import *
from evidently.llm.templates import BinaryClassificationPromptTemplate
from evidently.sdk.models import PanelMetric
from evidently.sdk.panels import DashboardPanelPlot
from evidently.ui.workspace import CloudWorkspace
[No output generated]
# === Datasets Path ===
from pathlib import Path
DATASETS_DIR = Path("./datasets")
print(f"Datasets: {[f.name for f in DATASETS_DIR.glob('*.csv')]}")
Datasets: ['booking_queries_dataset.csv', 'code_review_dataset.csv', 'health_and_fitness_qna.csv']
In this tutorial, we will:
- Define the evaluation criteria for our LLM judge
- Build an LLM-as-a-Judge using different prompts/models
- Evaluate the quality of the judge comparing results to human labels
(Optional) Set up Evidently Cloud¶
Set up API keys for LLM judges:
## import os
## os.environ["OPENAI_API_KEY"] = "OPEN_AI_API_KEY"
## os.environ["ANTHROPIC_API_KEY"] = "ANTHROPIC_API_KEY"
[No output generated]
Optional. Connect to Cloud and create a Project:
# ws = CloudWorkspace(token="YOUR_API_TOKEN", url="https://app.evidently.cloud")
[No output generated]
#project = ws.create_project("My project name", org_id="YOUR_ORG_ID")
#project.description = "My project description"
# or project = ws.get_project("PROJECT_ID")
[No output generated]
Prepare the dataset¶
We start with an expert-labeled dataset. We will use it as the ground truth for our LLM judge.
dataset_path = DATASETS_DIR / "code_review_dataset.csv"
review_dataset = pd.read_csv(dataset_path)
[No output generated]
Preview:
pd.set_option('display.max_colwidth', None)
review_dataset.head(10)
Generated review \
0 This implementation appears to work, but the approach used does not align with modern best practices. There are ways to make this more efficient.
1 Great job! Keep it up!
2 It would be advisable to think about modularity. Possibly revise?
3 You’ve structured the class very well, and the use of dependency injection is nicely done. One suggestion is to simplify the constructor - it has too many responsibilities. You might break it into helper methods. Otherwise, the overall structure is clean and easy to follow. Nice work!
4 Great job! This is clean and well-organized. The architecture is sound and everything is in its place. I don’t really have any feedback - just wanted to say this is excellent. Well done!
5 You’ve done a solid job here. The tests are comprehensive and the logic is sound. One thing I noticed is that error handling might be incomplete - could we add a case for null input? That would make it more robust. Overall, great submission.
6 There is too much complexity in this function. It is possible to simplify it by extracting the inner conditional blocks into separate helper methods. This would improve readability and maintainability. Functions that perform multiple tasks should be split to adhere to the single-responsibility principle.
7 The loop is functioning correctly, but it could be enhanced by using a forEach method instead of a traditional for loop. This would align better with idiomatic JavaScript. Additionally, variable naming inside the loop could be improved.
8 Excellent submission overall. Everything looks good. There is nothing to improve here. Great job once again. Code is very clean. No changes needed.
9 It would be more efficient to not mutate the state directly. This approach can lead to bugs. Immutability is generally better.
Expert label \
0 bad
1 bad
2 bad
3 good
4 bad
5 good
6 bad
7 good
8 bad
9 bad
Expert comment
0 The tone is slighly condescending, no actionable help.
1 Not actionable
2 there is a suggestion, but no real guidance
3 Good tone, actionable
4 Pure praise
5 want more like this
6 there is some subtance but too sounds too harsh
7 constructive and specific, but passive voice sounds a bit too passive aggressive
8 uncritical praise, offers no value for improvement
9 some truth in the suggestion, but the phrasing is blunt, reads accusatorily. Create an Evidently dataset:
definition = DataDefinition(
text_columns=["Generated review", "Expert comment"],
categorical_columns=["Expert label"]
)
[No output generated]
eval_dataset = Dataset.from_pandas(
pd.DataFrame(review_dataset),
data_definition=definition)
[No output generated]
Preview the distribution of classes:
report = Report([
ValueStats(column="Expert label")
])
my_eval = report.run(eval_dataset)
my_eval
<evidently.core.report.Snapshot at 0x7f6dab6946e0>
Optional. Let's upload the source dataset to Evidently Cloud.
ws.add_dataset(
dataset = eval_dataset,
name = "source_dataset",
project_id = project.id,
description = "Dataset with expert labels on review quality"
)
Our goal: create LLM judge to match the human labels¶
Options:
- Splitting criteria: (actionable / non-actionable, appropriate tone / inappropriate tone).
- Try create a good/bad judge. (It may be useful to introduce a borderline or "needs review" tag for subtle or new cases).
Exp 1. Design the LLM judge - First try¶
For the tutorial flow, we'll keep the steps explicit and run 5 sequential experiments.
First attempt to create the judge:
import os
from typing import Dict, Any, List, Optional
from evidently.llm.utils.wrapper import OpenAIOptions, OpenAIWrapper, LLMMessage, LLMResult
# === Ollama Configuration via OpenAI-Compatible API ===
OLLAMA_HOST = os.getenv("OLLAMA_HOST", "http://ollama:11434")
# === Model Configuration ===
HF_LLM_MODEL = "NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF"
OLLAMA_LLM_MODEL = f"hf.co/{HF_LLM_MODEL}:Q4_K_M"
OLLAMA_OPTIONS = OpenAIOptions(
api_key="ollama",
api_url=f"{OLLAMA_HOST}/v1"
)
# === Patch OpenAIWrapper for smart JSON mode detection ===
# Evidently's OpenAI wrapper doesn't enable JSON mode by default.
# This patch detects when JSON output is expected and enables it.
_original_openai_complete = OpenAIWrapper.complete
async def _json_aware_complete(self, messages: List[LLMMessage], seed: Optional[int] = None) -> LLMResult[str]:
import openai
from openai.types.chat.chat_completion import ChatCompletion
message_text = " ".join(m.content for m in messages if m.content)
needs_json = "json" in message_text.lower() or '"category"' in message_text
needs_xml = "<new_prompt>" in message_text
formatted_messages = [{"role": msg.role, "content": msg.content} for msg in messages]
try:
kwargs = {"model": self.model, "messages": formatted_messages, "seed": seed}
if needs_json and not needs_xml:
kwargs["response_format"] = {"type": "json_object"}
response: ChatCompletion = await self.client.chat.completions.create(**kwargs)
except openai.RateLimitError as e:
from evidently.llm.utils.wrapper import LLMRateLimitError
raise LLMRateLimitError(e.message) from e
except openai.APIError as e:
from evidently.llm.utils.wrapper import LLMRequestError
raise LLMRequestError(f"Failed to call OpenAI complete API: {e.message}", original_error=e) from e
content = response.choices[0].message.content
assert content is not None
if response.usage is None:
return LLMResult(content, 0, 0)
return LLMResult(content, response.usage.prompt_tokens, response.usage.completion_tokens)
OpenAIWrapper.complete = _json_aware_complete
print(f"Ollama host: {OLLAMA_HOST}")
print(f"Model: {OLLAMA_LLM_MODEL}")
print(f"Using OpenAI-compatible API with smart JSON mode detection")
Ollama host: http://ollama:11434 Model: hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M Using OpenAI-compatible API with smart JSON mode detection
# 1. Name the experiment
name = "naive_prompt"
# 2. Define LLM judge prompt template
feedback_quality = BinaryClassificationPromptTemplate(
pre_messages=[("system", "You are evaluating the quality of code reviews given to junior developers.")],
criteria = """An review is GOOD when it's actionable and constructive.
A review is BAD when is non-actionable or overly critical.
""",
target_category="bad",
non_target_category="good",
uncertainty="unknown",
include_reasoning=True,
)
# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
pd.DataFrame(review_dataset),
data_definition=definition,
descriptors=[
LLMEval("Generated review",
template=feedback_quality,
provider="openai", # Use OpenAI provider with Ollama's OpenAI-compatible API
model=OLLAMA_LLM_MODEL,
alias="LLM-judged quality")
],
options=OLLAMA_OPTIONS
)
# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])
[No output generated]
#print(feedback_quality.get_template())
[No output generated]
LLM judgments. View all results locally:
eval_dataset.as_dataframe()
Generated review \
0 This implementation appears to work, but the approach used does not align with modern best practices. There are ways to make this more efficient.
1 Great job! Keep it up!
2 It would be advisable to think about modularity. Possibly revise?
3 You’ve structured the class very well, and the use of dependency injection is nicely done. One suggestion is to simplify the constructor - it has too many responsibilities. You might break it into helper methods. Otherwise, the overall structure is clean and easy to follow. Nice work!
4 Great job! This is clean and well-organized. The architecture is sound and everything is in its place. I don’t really have any feedback - just wanted to say this is excellent. Well done!
5 You’ve done a solid job here. The tests are comprehensive and the logic is sound. One thing I noticed is that error handling might be incomplete - could we add a case for null input? That would make it more robust. Overall, great submission.
6 There is too much complexity in this function. It is possible to simplify it by extracting the inner conditional blocks into separate helper methods. This would improve readability and maintainability. Functions that perform multiple tasks should be split to adhere to the single-responsibility principle.
7 The loop is functioning correctly, but it could be enhanced by using a forEach method instead of a traditional for loop. This would align better with idiomatic JavaScript. Additionally, variable naming inside the loop could be improved.
8 Excellent submission overall. Everything looks good. There is nothing to improve here. Great job once again. Code is very clean. No changes needed.
9 It would be more efficient to not mutate the state directly. This approach can lead to bugs. Immutability is generally better.
10 It's great to see error handling implemented, but it is incomplete. You may want to consider covering edge cases like null or undefined values. Another potential improvement is logging the error message for better observability.
11 Consider introducing a constant for the hardcoded value 500. This would make the code more maintainable and readable. Also, using descriptive naming for the constant would clarify its purpose.
12 This logic seems overly imperative. I would suggest a declarative approach to enhance readability. You could refactor this using array methods like map and filter. This would simplify the flow and reduce side effects.
13 Great job!
14 Good implementation! One idea to consider: instead of checking for null manually each time, you could create a utility function to centralize the check. That might simplify repeated logic and reduce potential for missed cases.
15 The use of constants here is helpful. Just a minor note: grouping them into a dedicated config or constants file might improve organization, especially as the codebase grows. This can help with reusability too.
16 Thanks for adding tests, those are really helpful. You might want to add one more case for when the input array is empty. That would improve coverage and make the function more robust against edge inputs.
17 Looks solid! One thing to think about is breaking the validateInput() method into smaller parts - it currently handles multiple concerns. Splitting it might make it easier to test and maintain.
18 The caching logic works nicely. Possible improvement: documenting the TTL (time-to-live) value and its rationale could help future readers understand the intent and avoid misconfiguration.
19 Good use of async here. One thing to watch for - make sure all async calls are being awaited where necessary, especially inside loops. You might double-check the loop in processData() just to be safe.
20 The abstraction is working well here. As a possible improvement, consider whether the Handler interface could be renamed for more clarity about its role - it might be too generic for this context.
21 You've handled the edge case for missing inputs well. To take it a step further, you could log a warning when this happens to improve observability in production. That might help with debugging.
22 The response formatting is consistent and readable. One additional thought - adding a comment on why certain headers are included might prevent confusion for others unfamiliar with the API expectations.
23 Great review! Really nice work.
24 The approach taken here is functional, although it might benefit from certain improvements related to structure or logic. You could explore refactoring possibilities.
25 There’s a potential opportunity to enhance this section of code for better performance, although it works fine as-is. Consider whether an update would be beneficial.
26 This part could be optimized, though it’s not strictly necessary. You may want to consider a more modern way of expressing the logic.
27 This structure is slightly unconventional. You could consider refactoring this step.
28 The naming here is a bit nonstandard. It might be better to use more readable terms.
29 Good job on this function. You could maybe simplify it a bit, though I understand if there’s a reason you kept it like this.
30 Overall this is well written. Just be mindful of potential issues with larger datasets, though it may not be a concern here.
31 This solution is functional. There may be stylistic adjustments worth considering, depending on your team’s conventions.
32 Looks good! Possibly worth reviewing the loop conditions one more time to ensure they behave as expected.
33 Nicely done. There might be a more idiomatic way to do this in your language of choice, but this works well.
34 Nice job here! One improvement to consider would be extracting the retry logic into a separate function. That would help isolate concerns and make the flow easier to test.
35 The structure is clean and easy to follow. A minor suggestion—consider adding a brief comment explaining the fallback logic in the error handler. That could help future maintainers.
36 This looks good! One small optimization might be to cache the result of getConfig() since it's called multiple times.
37 The abstraction is helpful and reusable. Just double-check whether config.options is always defined before accessing options.timeout, to avoid potential runtime issues.
38 Looks great! A possible enhancement could be logging unexpected input shapes—could help with debugging in production without affecting control flow.
39 The implementation of pagination is solid. One minor thing to review is whether pageSize defaults are clearly handled if undefined—might be worth an explicit fallback.
40 The use of early returns here simplifies the logic nicely. To make it even better, adding a short inline comment to explain the special-case condition could improve readability.
41 This works well overall. To make the validation logic more reusable, you might move it into a utility module. That could help if similar checks are needed elsewhere.
42 Great job! One thing to consider is breaking this into smaller pieces. Great job again!
43 This function is quite messy and hard to follow, but overall, great work! You clearly put effort into it.
44 You did a great job here!
45 Absolutely! I can review the code you wrote. In this function
46 The implementation has several issues around control flow and readability. That said, great job overall!
47 There’s quite a bit of duplication in this logic that might lead to bugs. But other than that, this is a great working implementation!
48 This method aligns with the functional requirements, though further alignment with idiomatic standards could enhance maintainability in cross-functional scenarios.
49 Not sure if this matters, but I noticed that config might be null in some cases. Just flagging it in case it causes issues later.
Expert label \
0 bad
1 bad
2 bad
3 good
4 bad
5 good
6 bad
7 good
8 bad
9 bad
10 good
11 good
12 good
13 bad
14 good
15 good
16 good
17 good
18 good
19 good
20 good
21 good
22 good
23 bad
24 bad
25 bad
26 bad
27 bad
28 bad
29 bad
30 bad
31 bad
32 bad
33 bad
34 good
35 good
36 good
37 good
38 good
39 good
40 good
41 good
42 bad
43 bad
44 bad
45 bad
46 bad
47 bad
48 bad
49 bad
Expert comment \
0 The tone is slighly condescending, no actionable help.
1 Not actionable
2 there is a suggestion, but no real guidance
3 Good tone, actionable
4 Pure praise
5 want more like this
6 there is some subtance but too sounds too harsh
7 constructive and specific, but passive voice sounds a bit too passive aggressive
8 uncritical praise, offers no value for improvement
9 some truth in the suggestion, but the phrasing is blunt, reads accusatorily.
10 Strong suggestion and tactful tone, a solid example of effective LLM feedback.
11 could be a bit more positive but overall good comment and neutral
12 neutral and actionable
13 not actionable
14 friendly and constructive
15 good, includes a suggestion without sounding critical
16 encouraging and practical
17 good, non-demanding suggestion
18 NaN
19 NaN
20 NaN
21 NaN
22 NaN
23 adds nothing
24 vague, suggests improvement without naming a single concrete step
25 passive and unclear
26 sounds like helpful feedback, but gives no concrete ideas
27 not concrete
28 i don't want it to pick on names, this adds nothing
29 hedged feedback, doesn’t say how to simplify
30 raises a theoretical problem without any evidence or action point
31 safe, vague feedback, looks like a filler
32 suggests doubt with no basis or detail.
33 suggests change but offers no alternative or reasoning, sounds dismissive
34 NaN
35 NaN
36 NaN
37 NaN
38 NaN
39 NaN
40 i like the constructive comment
41 neutral
42 too positive and repetetive
43 too positive and inconsistent (praise and critique at the same time)
44 non actionable
45 partial answer, part of LLM response
46 NaN
47 strange combo of praise and critique
48 too formal and abstract
49 too apologetic in tone
LLM-judged quality \
0 good
1 good
2 good
3 good
4 good
5 good
6 good
7 good
8 UNKNOWN
9 good
10 good
11 good
12 good
13 good
14 good
15 good
16 good
17 good
18 good
19 good
20 good
21 good
22 good
23 good
24 good
25 good
26 good
27 good
28 good
29 good
30 good
31 good
32 good
33 good
34 good
35 good
36 good
37 good
38 good
39 good
40 good
41 good
42 good
43 good
44 good
45 good
46 good
47 good
48 good
49 good
LLM-judged quality reasoning \
0 The review points out a potential inefficiency and provides constructive feedback.
1 Complimentary and encouraging feedback, providing a positive reinforcement to the junior developer.
2 The review suggests an improvement by considering modularity, which is constructive and actionable.
3 The review provides constructive feedback by suggesting a specific improvement (simplify the constructor), while also acknowledging the well-structured class and the use of dependency injection. This helps the developer understand the next steps to improve their code without feeling overwhelmed or discouraged.
4 The review is actionable and constructive. It provides praise for the work, states that it's clean and well-organized, acknowledges the sound architecture, and encourages the developer with a well-done message.
5 The review provides actionable feedback in the form of suggesting to add error handling for null input, and overall praises the submission.
6 The review provides specific suggestions to improve the code, identifies a principle to adhere to, and explains why it's beneficial. It is constructive and actionable.
7 The comment provides specific suggestions for improvement, making it actionable and constructive.
8 The review doesn't provide any specific feedback, actionable suggestions, or guidance for improving the code. As a result, it's not possible to classify the review as either good or bad.
9 The review is actionable, providing a suggestion to use immutability and explaining the benefits and potential issues with mutating the state directly.
10 The review provides actionable suggestions for improvement, such as covering edge cases and better error handling. It is constructive and specifically points out areas that could be enhanced.
11 Actionable and constructive feedback, suggesting to improve maintainability and readability by using a constant for hardcoded value.
12 The review provides actionable suggestions by mentioning a declarative approach and suggesting specific array methods to refactor the code. It also highlights potential benefits, such as improved readability and reduced side effects.
13 The text is constructive and provides positive recognition to the developer.
14 The review is actionable because it provides a suggestion to improve code organization and prevent potential errors. It is also constructive by offering a potential solution (creating a utility function) for the identified issue.
15 The text provides a constructive suggestion for improving organization and increasing reusability as the codebase grows.
16 The review is actionable since it suggests adding a test case for an edge input, and constructive because it helps to improve the function's robustness.
17 The review provides a specific suggestion to improve the code (breaking the validateInput() method into smaller parts) and is presented in an actionable and constructive manner.
18 The given feedback is actionable and constructive, pointing out a possible improvement (documentation) without being overly critical.
19 The feedback is actionable and constructive, providing a specific suggestion to ensure all async calls are being awaited where necessary.
20 The text is providing a clear suggestion for improvement, which makes it actionable and constructive.
21 It's actionable because it provides a suggestion to improve the code's observability, and it's constructive in that it helps with debugging.
22 The review is constructive and provides a suggestion for improvement. It is actionable, as it suggests adding a comment to clarify the purpose of certain headers.
23 The review text is positive and encourages the junior developer, which is a constructive approach.
24 The review provides constructive feedback by mentioning possible improvements related to structure and logic, as well as suggesting refactoring possibilities. It is actionable for the developer to work on these areas.
25 The review provides a suggestion for potential improvement, which is actionable. It also acknowledges that the code works fine as-is, making it constructive.
26 The feedback is actionable and constructive, suggesting an optimized approach or considering a more modern way to express the logic. It encourages improvement without harsh criticism.
27 The review is actionable and constructive, indicating a possible improvement by refactoring the step.
28 The feedback is actionable and constructive. The reviewer suggests using more readable terms for naming, which would improve the code's understandability.
29 The review is constructive and provides a suggestion for improvement, while also acknowledging the developer's reasoning.
30 The text is constructive, actionable and provides valuable feedback without being overly critical.
31 The review is actionable and constructive, providing a suggestion to consider the team's conventions for stylistic adjustments. This helps junior developers improve their code while allowing room for individual team preferences.
32 The review provides specific feedback by mentioning loop conditions, suggesting a possible issue, and encouraging the junior developer to review them before moving forward.
33 The review is actionable and constructive. It points out potential improvements without being overly critical.
34 The review is actionable and provides constructive feedback, suggesting a specific improvement that would help isolate concerns and make the code easier to test.
35 The text provides a clear and constructive suggestion for improving the code, which is actionable and helps future maintainers. It is neither non-actionable nor overly critical.
36 The review is actionable by suggesting an optimization and provides constructive feedback to improve the code.
37 The review provided is actionable and constructive. The reviewer identifies an issue and suggests a specific solution to address it.
38 The review provides a constructive enhancement suggestion helping with debugging in production, making it a GOOD review.
39 The review is actionable and constructive, providing a specific suggestion for improvement by handling pageSize defaults explicitly.
40 The review is actionable, providing a suggestion to improve readability with a short inline comment. It also acknowledges the simplification contributed by early returns, making it constructive.
41 The text provides actionable feedback with a reusability suggestion and explains its potential benefits, making it constructive.
42 The text provides constructive feedback and is actionable, acknowledging the junior developer's work with positive remarks. It suggests an improvement while encouraging them.
43 The review acknowledges the messiness and difficulty to follow of the function but commends the effort put into it. While critiquing, it also provides encouragement, making it constructive.
44 The phrase 'You did a great job here!' is actionable and constructive, which makes it a good code review. It provides positive feedback on the work done.
45 It's encouraging and offers to review the code, although it doesn't address any specific issues.
46 The review acknowledges the issues but also gives praise, which encourages the junior developer. Making it more likely to be constructive and actionable.
47 The review provides both positive feedback and identifies an area for improvement, making it constructive and actionable.
48 The review is actionable because it provides a suggestion to improve 'maintainability in cross-functional scenarios' by aligning with idiomatic standards. It is also constructive as it helps the junior developer identify an area of improvement.
49 The review provides a potential concern and suggests an improvement, making it actionable and constructive.
Judge_alignment
0 False
1 False
2 False
3 True
4 False
5 True
6 False
7 True
8 False
9 False
10 True
11 True
12 True
13 False
14 True
15 True
16 True
17 True
18 True
19 True
20 True
21 True
22 True
23 False
24 False
25 False
26 False
27 False
28 False
29 False
30 False
31 False
32 False
33 False
34 True
35 True
36 True
37 True
38 True
39 True
40 True
41 True
42 False
43 False
44 False
45 False
46 False
47 False
48 False
49 False Report. Let's summarize:
report = Report([
TextEvals()
])
my_eval = report.run(eval_dataset)
my_eval
<evidently.core.report.Snapshot at 0x7f6da9db2ad0>
Classification quality. This function runs the Classification Report to evaluate the LLM judge quality and optionally uploads it to Evidently Cloud (if the workspace is set) with a tag.
def run_classification_report(eval_dataset, name=None, cloud_ws=None, project_id=None):
df = eval_dataset.as_dataframe()
# Filter out UNKNOWN predictions
df_filtered = df[df["LLM-judged quality"] != "UNKNOWN"].copy()
# Normalize case: LLM outputs uppercase (GOOD/BAD), Expert labels are lowercase (good/bad)
df_filtered["LLM-judged quality"] = df_filtered["LLM-judged quality"].str.lower()
# Set the classification Data Definition
definition_class = DataDefinition(
classification=[BinaryClassification(
target="Expert label",
prediction_labels="LLM-judged quality",
pos_label="bad"
)],
categorical_columns=["Expert label", "LLM-judged quality"]
)
# Create a Dataset object
eval_data = Dataset.from_pandas(df_filtered, data_definition=definition_class)
# Build classification report
report = Report([
ClassificationPreset(),
ValueStats("LLM-judged quality"),
ValueStats("Expert label")
])
# Apply tag(s)
tags = [name] if name else []
my_eval = report.run(eval_data, tags=tags)
# Optional: upload to Evidently Cloud
if cloud_ws and project_id:
cloud_ws.add_run(project_id, my_eval, include_data=True)
return my_eval
[No output generated]
(See all Evidently Metrics and Presets: https://docs.evidentlyai.com/metrics/all_metrics)
Run the function to evaluate the LLM judge quality:
my_eval = run_classification_report(
eval_dataset,
name=name,
#cloud_ws=ws, #Optional
#project_id=project.id #Optional
)
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
You can also preview the classification report locally:
my_eval
<evidently.core.report.Snapshot at 0x7f6da0257b10>
Exp 2. Try another prompt¶
Let's try writing a more detailed prompt:
# 1. Name the experiment <- new name
name = "detailed_prompt"
# 2. Define LLM judge prompt template <- new prompt
feedback_quality_2 = BinaryClassificationPromptTemplate(
pre_messages=[("system", "You are evaluating the quality of code reviews given to junior developers.")],
criteria="""
A review is **GOOD** if it is actionable and constructive. It should:
- Offer clear, specific suggestions or highlight issues in a way that the developer can address
- Be respectful and encourage learning or improvement
- Use professional, helpful language—even when pointing out problems
A review is **BAD** if it is non-actionable or overly critical. For example:
- It may be vague, generic, or hedged to the point of being unhelpful
- It may focus on praise only, without offering guidance
- It may sound dismissive, contradictory, harsh, or robotic
- It may raise a concern but fail to explain what should be done
""",
target_category="bad",
non_target_category="good",
uncertainty="unknown",
include_reasoning=True,
)
# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
pd.DataFrame(review_dataset),
data_definition=definition,
descriptors=[
LLMEval("Generated review",
template=feedback_quality_2,
provider="openai", # Use OpenAI provider with Ollama's OpenAI-compatible API
model=OLLAMA_LLM_MODEL,
alias="LLM-judged quality")
],
options=OLLAMA_OPTIONS
)
# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])
[No output generated]
Evaluate the LLM judge quality:
my_eval = run_classification_report(
eval_dataset,
name=name,
#cloud_ws=ws, #Optional
#project_id=project.id #Optional
)
[No output generated]
my_eval
<evidently.core.report.Snapshot at 0x7f6da3be0b00>
Exp 3. Can we make it better?¶
# 1. Name the experiment <- new name
name = "detailed_prompt_think_better"
# 2. Define LLM judge prompt template <- new prompt
feedback_quality_3 = BinaryClassificationPromptTemplate(
pre_messages=[
("system", "You are evaluating the quality of code reviews given to junior developers.")],
criteria="""
A review is **GOOD** if it is actionable and constructive. It should:
- Offer clear, specific suggestions or highlight issues in a way that the developer can address
- Be respectful and encourage learning or improvement
- Use professional, helpful language—even when pointing out problems
A review is **BAD** if it is non-actionable or overly critical. For example:
- It may be vague, generic, or hedged to the point of being unhelpful
- It may focus on praise only, without offering guidance
- It may sound dismissive, contradictory, harsh, or robotic
- It may raise a concern but fail to explain what should be done
Always explain your reasoning.
""",
target_category="bad",
non_target_category="good",
uncertainty="unknown",
include_reasoning=True,
)
# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
pd.DataFrame(review_dataset),
data_definition=definition,
descriptors=[
LLMEval("Generated review",
template=feedback_quality_3,
provider="openai", # Use OpenAI provider with Ollama's OpenAI-compatible API
model=OLLAMA_LLM_MODEL,
alias="LLM-judged quality")
],
options=OLLAMA_OPTIONS
)
# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])
[No output generated]
my_eval = run_classification_report(
eval_dataset,
name=name,
#cloud_ws=ws, #Optional
#project_id=project.id #Optional
)
[No output generated]
my_eval
<evidently.core.report.Snapshot at 0x7f6da3be36f0>
Exp 4. Try a different model (Turbo)¶
Can a cheaper, simpler model perform as well?
# 1. Name the experiment <- new name
name = "ollama_local"
# 2. Define LLM judge prompt template
feedback_quality_3 = BinaryClassificationPromptTemplate(
pre_messages=[
("system", "You are evaluating the quality of code reviews given to junior developers.")],
criteria="""
A review is **GOOD** if it is actionable and constructive. It should:
- Offer clear, specific suggestions or highlight issues in a way that the developer can address
- Be respectful and encourage learning or improvement
- Use professional, helpful language—even when pointing out problems
A review is **BAD** if it is non-actionable or overly critical. For example:
- It may be vague, generic, or hedged to the point of being unhelpful
- It may focus on praise only, without offering guidance
- It may sound dismissive, contradictory, harsh, or robotic
- It may raise a concern but fail to explain what should be done
Always explain your reasoning.
""",
target_category="bad",
non_target_category="good",
uncertainty="unknown",
include_reasoning=True,
)
# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
pd.DataFrame(review_dataset),
data_definition=definition,
descriptors=[
LLMEval("Generated review",
template=feedback_quality_3,
provider="openai", # Use OpenAI provider with Ollama's OpenAI-compatible API
model=OLLAMA_LLM_MODEL,
alias="LLM-judged quality")
],
options=OLLAMA_OPTIONS
)
# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])
[No output generated]
my_eval = run_classification_report(
eval_dataset,
name=name,
#cloud_ws=ws, #Optional
#project_id=project.id #Optional
)
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
my_eval
<evidently.core.report.Snapshot at 0x7f6da3bee9f0>
# 1. Name the experiment <- new name
name = "ollama_final"
# 2. Define LLM judge prompt template
feedback_quality_3 = BinaryClassificationPromptTemplate(
pre_messages=[
("system", "You are evaluating the quality of code reviews given to junior developers.")],
criteria="""
A review is **GOOD** if it is actionable and constructive. It should:
- Offer clear, specific suggestions or highlight issues in a way that the developer can address
- Be respectful and encourage learning or improvement
- Use professional, helpful language—even when pointing out problems
A review is **BAD** if it is non-actionable or overly critical. For example:
- It may be vague, generic, or hedged to the point of being unhelpful
- It may focus on praise only, without offering guidance
- It may sound dismissive, contradictory, harsh, or robotic
- It may raise a concern but fail to explain what should be done
Always explain your reasoning.
""",
target_category="bad",
non_target_category="good",
uncertainty="unknown",
include_reasoning=True,
)
# 3. Apply the LLM judge
eval_dataset = Dataset.from_pandas(
pd.DataFrame(review_dataset),
data_definition=definition,
descriptors=[
LLMEval("Generated review",
template=feedback_quality_3,
provider="openai", # Use OpenAI provider with Ollama's OpenAI-compatible API
model=OLLAMA_LLM_MODEL,
alias="LLM-judged quality")
],
options=OLLAMA_OPTIONS
)
# 4. Add TRUE/FALSE for judge alignment
eval_dataset.add_descriptors([
ExactMatch(columns=["LLM-judged quality", "Expert label"], alias="Judge_alignment")
])
[No output generated]
my_eval = run_classification_report(
eval_dataset,
name=name,
#cloud_ws=ws, #Optional
#project_id=project.id #Optional
)
/opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 due to no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior. /opt/pixi/.pixi/envs/default/lib/python3.13/site-packages/sklearn/metrics/_classification.py:1833: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
# === Unload Ollama Model & Shutdown Kernel ===
# Unloads the model from GPU memory before shutting down
try:
import ollama
print(f"Unloading Ollama model: {OLLAMA_LLM_MODEL}")
ollama.generate(model=OLLAMA_LLM_MODEL, prompt="", keep_alive=0)
print("Model unloaded from GPU memory")
except Exception as e:
print(f"Model unload skipped: {e}")
# Shut down the kernel to fully release resources
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(restart=False)
Unloading Ollama model: hf.co/NousResearch/Nous-Hermes-2-Mistral-7B-DPO-GGUF:Q4_K_M Model unloaded from GPU memory
{'status': 'ok', 'restart': False}