Evaluating LLMs has emerged as a pivotal challenge in advancing the reliability and utility of artificial intelligence across both academic and industrial settings. As the capabilities of these models expand, so too does the need for rigorous, reproducible, and multi-faceted evaluation methodologies. In this tutorial, we provide a comprehensive examination of one of the field’s most critical frontiers: systematically evaluating the strengths and limitations of LLMs across various dimensions of performance. Using Google’s cutting-edge Generative AI models as benchmarks and the LangChain library as our orchestration tool, we present a robust and modular evaluation pipeline tailored for implementation in Google Colab. This framework integrates criterion-based scoring, encompassing correctness, relevance, coherence, and conciseness, with pairwise model comparisons and rich visual analytics to deliver nuanced and actionable insights. Grounded in expert-validated question sets and objective ground truth answers, this approach balances quantitative rigor with practical adaptability, offering researchers and developers a ready-to-use, extensible toolkit for high-fidelity LLM evaluation.
!pip install langchain langchain-google-genai ragas pandas matplotlib
We install key Python libraries for building and running AI-powered workflows, LangChain for orchestrating LLM interactions (with the langchain-google-genai extension for Google’s generative AI), Ragas for retrieval-augmented generation, and pandas plus matplotlib for data manipulation and visualization.
import os
import pandas as pd
import matplotlib.pyplot as plt
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain
from langchain.evaluation import load_evaluator
from langchain.schema import HumanMessage
We incorporate core Python utilities, including os for environment management, pandas for handling DataFrames, and matplotlib.pyplot for plotting, alongside LangChain’s Google Generative AI client, prompt templating, chain construction, evaluator loader, and the HumanMessage schema to build and assess conversational LLM pipelines.
os.environ["GOOGLE_API_KEY"] = "Use Your API Key"
Here, we configure your environment by storing your Google API key in the GOOGLE_API_KEY variable, allowing the LangChain Google Generative AI client to authenticate requests securely.
def create_evaluation_dataset():
"""Create a simple dataset for evaluation."""
questions = [
"Explain the concept of quantum computing in simple terms.",
"How does a neural network learn?",
"What are the main differences between SQL and NoSQL databases?",
"Explain how blockchain technology works.",
"What is the difference between supervised and unsupervised learning?"
]
ground_truth = [
"Quantum computing uses quantum bits or qubits that can exist in multiple states simultaneously, unlike classical bits. This allows quantum computers to process certain types of information much faster than classical computers for specific problems.",
"Neural networks learn through a process called backpropagation where they adjust the weights between neurons based on the error between predicted and actual outputs, gradually minimizing this error through many iterations of training data.",
"SQL databases are relational with structured schemas, fixed tables, and use SQL for queries. NoSQL databases are non-relational, schema-flexible, and designed for specific data models like document, key-value, wide-column, or graph formats.",
"Blockchain is a distributed ledger technology where data is stored in blocks that are linked cryptographically. Each block contains transaction data and a timestamp, creating an immutable chain. Consensus mechanisms verify transactions without central authority.",
"Supervised learning uses labeled data where the algorithm learns to predict outputs based on input-output pairs. Unsupervised learning works with unlabeled data to find patterns or structures without predefined outputs."
]
return pd.DataFrame({"question": questions, "ground_truth": ground_truth})
We construct a small evaluation DataFrame by pairing five example questions on AI and database concepts with their corresponding ground‑truth answers, making it easy to benchmark an LLM’s responses against predefined correct outputs.
def setup_models():
"""Set up different Google Generative AI models for comparison."""
models = {
"gemini-2.0-flash-lite": ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0),
"gemini-2.0-flash": ChatGoogleGenerativeAI(model="gemini-2.0-flash", temperature=0)
}
return models
Now, this function instantiates two zero‑temperature ChatGoogleGenerativeAI clients, one using the lightweight “gemini‑2.0‑flash‑lite” model and the other the full “gemini‑2.0‑flash” model, so you can easily compare their outputs side‑by‑side.
def generate_responses(models, dataset):
"""Generate responses from each model for the questions in the dataset."""
responses = {}
for model_name, model in models.items():
model_responses = []
for question in dataset["question"]:
try:
response = model.invoke([HumanMessage(content=question)])
model_responses.append(response.content)
except Exception as e:
print(f"Error with model {model_name} on question: {question}")
print(f"Error: {e}")
model_responses.append("Error generating response")
responses[model_name] = model_responses
return responses
This function loops through each configured model and each question in the dataset, invokes the model to generate a response, catches any errors (logging them and inserting a placeholder), and returns a dictionary mapping each model’s name to its list of generated answers.
def evaluate_responses(models, dataset, responses):
"""Evaluate model responses using different evaluation criteria."""
evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
reference_criteria = ["correctness"]
reference_free_criteria = [
"relevance",
"coherence",
"conciseness"
]
results = {model_name: {criterion: [] for criterion in reference_criteria + reference_free_criteria}
for model_name in models.keys()}
for criterion in reference_criteria:
evaluator = load_evaluator("labeled_criteria", criteria=criterion, llm=evaluator_model)
for model_name in models.keys():
for i, question in enumerate(dataset["question"]):
ground_truth = dataset["ground_truth"][i]
response = responses[model_name][i]
if response != "Error generating response":
eval_result = evaluator.evaluate_strings(
prediction=response,
reference=ground_truth,
input=question
)
normalized_score = float(eval_result.get('score', 0)) * 2
results[model_name][criterion].append(normalized_score)
else:
results[model_name][criterion].append(0)
for criterion in reference_free_criteria:
evaluator = load_evaluator("criteria", criteria=criterion, llm=evaluator_model)
for model_name in models.keys():
for i, question in enumerate(dataset["question"]):
response = responses[model_name][i]
if response != "Error generating response":
eval_result = evaluator.evaluate_strings(
prediction=response,
input=question
)
normalized_score = float(eval_result.get('score', 0)) * 2
results[model_name][criterion].append(normalized_score)
else:
results[model_name][criterion].append(0)
return results
This function leverages a “gemini‑2.0‑flash‑lite” evaluator to score each model’s answers on both reference‑based correctness and reference‑free metrics (relevance, coherence, conciseness), normalizes those scores, and returns a nested dictionary mapping each model and criterion to its list of evaluation results.
def calculate_average_scores(evaluation_results):
"""Calculate average scores for each model and criterion."""
avg_scores = {}
for model_name, criteria in evaluation_results.items():
avg_scores[model_name] = {}
for criterion, scores in criteria.items():
if scores:
avg_scores[model_name][criterion] = sum(scores) / len(scores)
else:
avg_scores[model_name][criterion] = 0
all_scores = [score for criterion_scores in criteria.values() for score in criterion_scores if score is not None]
if all_scores:
avg_scores[model_name]["overall"] = sum(all_scores) / len(all_scores)
else:
avg_scores[model_name]["overall"] = 0
return avg_scores
This function processes the nested evaluation results to compute the mean score for each criterion across all questions for every model. Also, it calculates an overall average by pooling all individual metric scores. The returned dictionary maps each model to its per‑criterion averages and an aggregated “overall” performance score.
def visualize_results(avg_scores):
"""Visualize evaluation results with bar charts."""
models = list(avg_scores.keys())
criteria = list(avg_scores[models[0]].keys())
plt.figure(figsize=(14, 8))
bar_width = 0.8 / len(models)
positions = range(len(criteria))
for i, model in enumerate(models):
model_scores = [avg_scores[model][criterion] for criterion in criteria]
plt.bar([p + i * bar_width for p in positions], model_scores,
width=bar_width, label=model)
plt.xlabel('Evaluation Criteria', fontsize=12)
plt.ylabel('Average Score (0-10)', fontsize=12)
plt.title('LLM Model Comparison by Evaluation Criteria', fontsize=14)
plt.xticks([p + bar_width * (len(models) - 1) / 2 for p in positions], criteria)
plt.legend()
plt.grid(axis="y", linestyle="--", alpha=0.7)
plt.tight_layout()
plt.show()
plt.figure(figsize=(10, 8))
categories = [c for c in criteria if c != 'overall']
N = len(categories)
angles = [n / float(N) * 2 * 3.14159 for n in range(N)]
angles += angles[:1]
plt.polar(angles, [0] * (N + 1))
plt.xticks(angles[:-1], categories)
for model in models:
values = [avg_scores[model][c] for c in categories]
values += values[:1]
plt.polar(angles, values, label=model)
plt.legend(loc="upper right")
plt.title('LLM Model Comparison - Radar Chart', fontsize=14)
plt.tight_layout()
plt.show()
This function creates side-by-side bar charts to compare each model’s average scores across all evaluation criteria. Then it renders a radar chart to visualize their performance profiles, enabling quick identification of relative strengths and weaknesses.
def main():
print("Creating evaluation dataset...")
dataset = create_evaluation_dataset()
print("Setting up models...")
models = setup_models()
print("Generating responses...")
responses = generate_responses(models, dataset)
print("Evaluating responses...")
evaluation_results = evaluate_responses(models, dataset, responses)
print("Calculating average scores...")
avg_scores = calculate_average_scores(evaluation_results)
print("Average scores:")
for model, scores in avg_scores.items():
print(f"\n{model}:")
for criterion, score in scores.items():
print(f" {criterion}: {score:.2f}")
print("\nVisualizing results...")
visualize_results(avg_scores)
print("Saving results to CSV...")
results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
for model, criteria in avg_scores.items():
for criterion, score in criteria.items():
results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
ignore_index=True)
results_df.to_csv("llm_evaluation_results.csv", index=False)
print("Results saved to llm_evaluation_results.csv")
detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
for i, question in enumerate(dataset["question"]):
row = {
"Question": question,
"Ground Truth": dataset["ground_truth"][i]
}
for model_name in models.keys():
row[model_name] = responses[model_name][i]
detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
detailed_df.to_csv("llm_response_comparison.csv", index=False)
print("Detailed responses saved to llm_response_comparison.csv")
The main function orchestrates the entire evaluation workflow end‑to‑end: it builds the dataset, initializes models, generates and scores responses, computes and displays average metrics, visualizes performance with charts, and finally exports both summary and detailed results as CSV files.
def pairwise_model_comparison(models, dataset, responses):
"""Compare two models side by side using an LLM as judge."""
evaluator_model = ChatGoogleGenerativeAI(model="gemini-2.0-flash-lite", temperature=0)
pairwise_template = """
Question: {question}
Response A: {response_a}
Response B: {response_b}
Which response better answers the user's question? Consider factors like accuracy,
helpfulness, clarity, and completeness.
First, analyze each response point by point. Then conclude with your choice of either:
A is better, B is better, or They are equally good/bad.
Your analysis:
"""
pairwise_prompt = PromptTemplate(
input_variables=["question", "response_a", "response_b"],
template=pairwise_template
)
pairwise_chain = LLMChain(llm=evaluator_model, prompt=pairwise_prompt)
model_names = list(models.keys())
pairwise_results = {f"{model_a} vs {model_b}": [] for model_a in model_names for model_b in model_names if model_a != model_b}
for i, question in enumerate(dataset["question"]):
for j, model_a in enumerate(model_names):
for model_b in model_names[j+1:]:
response_a = responses[model_a][i]
response_b = responses[model_b][i]
if response_a != "Error generating response" and response_b != "Error generating response":
comparison_result = pairwise_chain.run(
question=question,
response_a=response_a,
response_b=response_b
)
key_ab = f"{model_a} vs {model_b}"
pairwise_results[key_ab].append({
"question": question,
"result": comparison_result
})
return pairwise_results
This function runs head-to-head comparisons for each unique model pair by prompting a “gemini-2.0-flash-lite” judge to analyze and rank their responses on accuracy, clarity, and completeness, collecting per-question verdicts into a structured dictionary for side-by-side evaluation.
def enhanced_main():
"""Enhanced main function with additional evaluations."""
print("Creating evaluation dataset...")
dataset = create_evaluation_dataset()
print("Setting up models...")
models = setup_models()
print("Generating responses...")
responses = generate_responses(models, dataset)
print("Evaluating responses...")
evaluation_results = evaluate_responses(models, dataset, responses)
print("Calculating average scores...")
avg_scores = calculate_average_scores(evaluation_results)
print("Average scores:")
for model, scores in avg_scores.items():
print(f"\n{model}:")
for criterion, score in scores.items():
print(f" {criterion}: {score:.2f}")
print("\nVisualizing results...")
visualize_results(avg_scores)
print("\nPerforming pairwise model comparison...")
pairwise_results = pairwise_model_comparison(models, dataset, responses)
print("\nPairwise comparison results:")
for comparison, results in pairwise_results.items():
print(f"\n{comparison}:")
for i, result in enumerate(results[:2]):
print(f" Question {i+1}: {result['question']}")
print(f" Analysis: {result['result'][:100]}...")
print("\nSaving all results...")
results_df = pd.DataFrame(columns=["Model", "Criterion", "Score"])
for model, criteria in avg_scores.items():
for criterion, score in criteria.items():
results_df = pd.concat([results_df, pd.DataFrame([{"Model": model, "Criterion": criterion, "Score": score}])],
ignore_index=True)
results_df.to_csv("llm_evaluation_results.csv", index=False)
detailed_df = pd.DataFrame(columns=["Question", "Ground Truth"] + list(models.keys()))
for i, question in enumerate(dataset["question"]):
row = {
"Question": question,
"Ground Truth": dataset["ground_truth"][i]
}
for model_name in models.keys():
row[model_name] = responses[model_name][i]
detailed_df = pd.concat([detailed_df, pd.DataFrame([row])], ignore_index=True)
detailed_df.to_csv("llm_response_comparison.csv", index=False)
pairwise_df = pd.DataFrame(columns=["Comparison", "Question", "Analysis"])
for comparison, results in pairwise_results.items():
for result in results:
pairwise_df = pd.concat([pairwise_df, pd.DataFrame([{
"Comparison": comparison,
"Question": result["question"],
"Analysis": result["result"]
}])], ignore_index=True)
pairwise_df.to_csv("llm_pairwise_comparison.csv", index=False)
print("All results saved to CSV files.")
The enhanced_main function extends the core evaluation pipeline by adding automated pairwise model comparisons, printing concise progress updates at each stage, and exporting three CSV files, summary scores, detailed responses, and pairwise analysis , so you end up with a complete, side-by-side evaluation workspace.
if __name__ == "__main__":
enhanced_main()
Finally, this guard ensures that when the script is executed directly (not imported), it calls enhanced_main() to run the full evaluation and comparison pipeline end‑to‑end.
In conclusion, in this tutorial has introduced a versatile and principled framework for evaluating and comparing the performance of LLMs, leveraging Google’s Generative AI capabilities alongside the LangChain library for orchestration. Unlike simplistic accuracy-based metrics, the methodology presented here embraces the multidimensional nature of language understanding, combining granular criterion-based evaluation, structured model-to-model comparison, and intuitive visualizations. By capturing key attributes, including correctness, relevance, coherence, and conciseness, our evaluation pipeline enables practitioners to identify subtle yet significant performance differences that directly impact downstream applications. The outputs, including CSV-based reporting, radar plots, and bar graphs, not only support transparent benchmarking but also guide data-driven decision-making in model selection and deployment.
Here is the Colab Notebook. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 90k+ ML SubReddit.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is committed to harnessing the potential of Artificial Intelligence for social good. His most recent endeavor is the launch of an Artificial Intelligence Media Platform, Marktechpost, which stands out for its in-depth coverage of machine learning and deep learning news that is both technically sound and easily understandable by a wide audience. The platform boasts of over 2 million monthly views, illustrating its popularity among audiences.
Leave a comment