llm-evaluation-framework

Star

Here are 47 public repositories matching this topic...

Language: All

Filter by language

All 47 Python 30 Jupyter Notebook 6 TypeScript 6 Go 1 Java 1 PHP 1

Sort: Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

promptfoo / promptfoo

Star 15.9k

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated Mar 15, 2026
TypeScript

confident-ai / deepeval

Star 14.1k

The LLM Evaluation Framework

python evaluation-metrics evaluation-framework llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Mar 13, 2026
Python

msoedov / agentic_security

Star 1.8k

Agentic LLM Vulnerability Scanner / AI red teaming kit

agent-framework ai-red-team prompt-testing llm-security llm-vulnerabilities llm-evaluation llm-fuzzing llm-evaluation-framework llm-guardrails llm-scanner llm-jailbreaks llm-fuzzer llm-fuzzer-aggregator agent-security

Updated Feb 3, 2026
Python

rhesis-ai / rhesis

Star 296

Open-source platform & SDK for testing LLM and agentic apps. Define expected behavior, generate and run test scenarios, and review failures collaboratively.

open-source test-generation quality-assessment test-management test-execution responsible-ai trustworthy-ai generative-ai llmops llm-evaluation llm-evaluation-framework

Updated Mar 14, 2026
Python

JinjieNi / MixEval

Star 255

The official evaluation suite and dynamic data release for MixEval.

benchmark evaluation benchmarking-suite evaluation-framework benchmarking-framework foundation-models large-language-models large-language-model llm-inference llm-evaluation large-multimodal-models llm-evaluation-framework benchmark-mixture mixeval

Updated Nov 10, 2024
Python

cvs-health / langfair

Star 255

LangFair is a Python library for conducting use-case level LLM bias and fairness assessments

python ai artificial-intelligence bias fairness ai-safety fairness-testing bias-detection fairness-ai fairness-ml responsible-ai ethical-ai large-language-models llm llm-evaluation llm-evaluation-framework llm-evaluation-metrics

Updated Jan 9, 2026
Python

Addepto / contextcheck

Star 91

MIT-licensed Framework for LLMs, RAGs, Chatbots testing. Configurable via YAML and integrable into CI pipelines for automated testing.

open-source ci testing-tools chatbot-framework testing-framework chatbot-testing rag ai-chat large-language-models llm ai-testing llm-evaluation llm-evaluation-framework prompt-test llm-testing ai-testing-tool generative-ai-testing rag-testing summarization-testing

Updated Dec 11, 2024
Python

parea-ai / parea-sdk-py

Star 82

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

zli12321 / qa_metrics

Star 61

An easy python package to run quick basic QA evaluations. This package includes standardized QA evaluation metrics and semantic evaluation metrics: Black-box and Open-Source large language model prompting and evaluation, exact match, F1 Score, PEDANT semantic match, transformer match. Our package also supports prompting OPENAI and Anthropic API.

qa-automation-test rl-training llm exact-matching llm-evaluation llm-evaluation-toolkit llm-evaluation-framework reward-modeling

Updated Jul 18, 2025
Python

multinear / multinear

Star 44

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Sep 2, 2025
Python

flexpa / llm-fhir-eval

Star 42

Benchmarking Large Language Models for FHIR

fhir fhirpath fhir-resources llm evals llm-evaluation-framework fhir-llm

Updated Feb 4, 2026
TypeScript

zhuohaoyu / KIEval

Star 39

[ACL'24] A Knowledge-grounded Interactive Evaluation Framework for Large Language Models

machine-learning explainable-ai llm llm-evaluation llm-evaluation-toolkit llm-evaluation-framework llm-evaluation-metrics acl2024

Updated Jul 19, 2024
Python

vero-labs-ai / vero-eval

Star 28

Open source framework for evaluating AI Agents

python testing evaluation datasets dataset-generation evaluation-metrics evaluation-framework testing-framework testing-library synthetic-dataset-generation user-persona evals llm-evaluation rag-evaluation llm-evaluation-framework langgraph rag-testing

Updated Feb 24, 2026
Python

aws-samples / fm-leaderboarder

Star 19

FM-Leaderboard-er allows you to create leaderboard to find the best LLM/prompt for your own business use case based on your data, task, prompts

llm-evaluation llm-evaluation-framework llm-benchmarking

Updated Oct 31, 2024
Python

honeyhiveai / realign

Star 18

Realign is a testing and simulation framework for AI applications.

ai simulation evaluation alignment red-teaming rag prompt-engineering llms llmops llm-eval llm-evaluation aiengineering llm-evaluation-framework

Updated Dec 4, 2024
Python

dokimos-dev / dokimos

Star 18

Evaluation Framework for LLM applications in Java and Kotlin

kotlin java evaluation junit evaluation-metrics evaluation-framework junit-extension rag llm retrieval-augmented-generation langchain4j llm-evaluation rag-evaluation spring-ai llm-evaluation-framework llm-evaluation-metrics agentic-ai agent-evaluation koog spring-ai-evaluation

Updated Mar 8, 2026
Java

Human-Centric-Machine-Learning / prediction-powered-ranking

Star 9

Code for "Prediction-Powered Ranking of Large Language Models", NeurIPS 2024.

ranking-algorithm llm-eval llm-evaluation llm-evaluation-framework prediction-powered-inference rank-sets

Updated Oct 28, 2024
Jupyter Notebook

ronniross / confidence-scorer

Sponsor

Star 9

Measure of estimated confidence for non-hallucinative nature of outputs generated by Transformer-based Language Models.

dataset datasets llm llms llm-training llm-evaluation llms-reasoning llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework llm-evaluation-metrics llms-efficency llms-evalution

Updated Feb 26, 2026
Python

pyladiesams / eval-llm-based-apps-jan2025

Star 8

Create an evaluation framework for your LLM based app. Incorporate it into your test suite. Lay the monitoring foundation.

workshop llm llms llmops llm-eval llm-test llm-evaluation-framework llm-evaluation-metrics llm-monitoring llm-testing llm-evals

Updated May 6, 2025
Jupyter Notebook

petmal / MindTrial

Star 7

MindTrial: Evaluate and compare AI language models (LLMs) on text-based tasks with optional file/image attachments and tool use. Supports multiple providers (OpenAI, Google, Anthropic, DeepSeek, Mistral AI, xAI, Alibaba, Moonshot AI, OpenRouter), custom tasks in YAML, and HTML/CSV reports.

opensource openai xai artificial-intelligence-projects anthropic ai-tool openrouter qwen deepseek mistral-ai llm-evaluation-framework google-gemini-ai llm-benchmarking moonshot-ai language-models-ai llm-comparison ai-benchmark ai-evaluation-tools grok-ai ai-model-comparison

Updated Mar 12, 2026
Go

Improve this page

Add a description, image, and links to the llm-evaluation-framework topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-evaluation-framework topic, visit your repo's landing page and select "manage topics."

Learn more

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly