evals

Star

Here are 169 public repositories matching this topic...

Language: All

Filter by language

All 169 Python 79 TypeScript 35 Jupyter Notebook 14 Go 6 Rust 4 HTML 3 C++ 2 C# 1 Dockerfile 1 JavaScript 1

Sort: Most stars

Sort options

Most stars Fewest stars Most forks Fewest forks Recently updated Least recently updated

mastra-ai / mastra

Star 22.1k

From the team behind Gatsby, Mastra is a framework for building AI-powered applications and agents with a modern TypeScript stack.

nodejs javascript typescript ai reactjs mcp nextjs tts chatbots workflows agents llm evals

Updated Mar 19, 2026
TypeScript

Arize-ai / phoenix

Star 8.9k

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated Mar 19, 2026
Jupyter Notebook

Python SDK for AI agent monitoring, LLM cost tracking, benchmarking, and more. Integrates with most LLMs and agent frameworks including CrewAI, Agno, OpenAI Agents SDK, Langchain, Autogen, AG2, and CamelAI

agent ai openai evaluation-metrics mistral cost-estimation autogen groq agentops llm langchain anthropic evals ollama crewai agents-sdk openai-agents

Updated Oct 30, 2025
Python

Kiln-AI / Kiln

Star 4.7k

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

python windows macos machine-learning ai mcp evaluation prompt ml collaboration openai dataset-generation evaluation-framework synthetic-data fine-tuning prompt-engineering chain-of-thought rlhf evals ollama

Updated Mar 19, 2026
Python

pydantic / logfire

Sponsor

Star 4.1k

AI observability platform for production LLM and agent systems.

python ai metrics logging trace openai observability pydantic fastapi opentelemetry ai-tools ai-observability evals llm-observability pydantic-ai agent-observability

Updated Mar 17, 2026
Python

truera / trulens

Star 3.2k

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated Mar 18, 2026
Python

lmnr-ai / lmnr

Star 2.7k

Laminar - open-source observability platform purpose-built for AI agents. YC S24.

rust open-source typescript ai monitoring analytics evaluation ts self-hosted rust-lang developer-tools agents observability aiops ai-observability llmops evals llm-evaluation llm-observability agent-observability

Updated Mar 19, 2026
TypeScript

GitHamza0206 / simba

Star 1.4k

OpenSource Production ready Customer service with built in Evals and monitoring

knowledge-base customer-service rag llm evals

Updated Jan 12, 2026
TypeScript

mattpocock / evalite

Sponsor

Star 1.4k

Evaluate your LLM-powered apps with TypeScript

typescript ai evals

Updated Feb 20, 2026
TypeScript

superlinear-ai / raglite

Star 1.1k

RAGLite is a Python toolkit for Retrieval-Augmented Generation (RAG) with DuckDB or PostgreSQL

markdown pdf postgres sqlite postgresql reranking rag vector-search duckdb colbert llm pgvector chainlit retrieval-augmented-generation evals late-interaction late-chunking query-adapter

Updated Mar 17, 2026
Python

harbor-framework / harbor

Star 1k

Harbor is a framework for running agent evaluations and creating and using RL environments.

rl-environments evals terminal-bench

Updated Mar 18, 2026
Python

waynesutton / opensync

Star 318

Cloud-synced dashboards for OpenCode and Claude Code. Track sessions, search with semantic lookup, export eval datasets.

open-source ai sessions convex opensync dasbhoard evals

Updated Feb 23, 2026
TypeScript

microsoft / promptpex

Star 158

Test Generation for Prompts

testing evaluations prompt-engineering llms chatgpt evals gpt-4o

Updated Mar 18, 2026
TeX

keshik6 / HourVideo

Star 157

[NeurIPS 2024] Official code for HourVideo: 1-Hour Video Language Understanding

navigation perception summarization reasoning visual-reasoning egocentric-videos gpt-4 multiple-choice-questions benchmark-dataset video-language-understanding multimodal-large-language-models evals gemini-pro spatial-intelligence neurips-2024 1-hour-video-language-understanding long-form-video-language-understanding long-context-understanding

Updated Jul 12, 2025
Jupyter Notebook

METR / vivaria

Star 135

Vivaria is METR's tool for running evaluations and conducting agent elicitation research.

ai elicitation ai-evaluation evals

Updated Feb 15, 2026
TypeScript

mclenhard / mcp-evals

Star 125

A Node.js package and GitHub Action for evaluating MCP (Model Context Protocol) tool implementations using LLM-based scoring. This helps ensure your MCP server's tools are working correctly and performing well.

ai mcp evals

Updated Jun 23, 2025
TypeScript

dustalov / evalica

Sponsor

Star 62

Evalica, your favourite evaluation toolkit

python rust library statistics arena rating leaderboard evaluation pagerank elo ranking hacktoberfest serbia pairwise-comparison pyo3 bradley-terry winrate llm evals evalica

Updated Mar 10, 2026
Python

voratiq / voratiq

Star 61

Agent ensembles to design, generate, and select the best code for every task.

cli ai orchestration-framework multi-agent sandboxing code-generation agents evals spec-driven-development

Updated Mar 19, 2026
TypeScript

ombharatiya / ai-system-design-guide

Star 60

AI system design guide for engineers building production AI systems and evals.

aws machine-learning natural-language-processing azure gcp artificial-intelligence gemini llama interview-questions claude open-ai rag system-design-interview llm gen-ai evals agentic-workflow agentic-ai

Updated Mar 1, 2026

AgentEvalHQ / AgentEval

Star 52

AgentEval is the comprehensive .NET toolkit for AI agent evaluation--tool usage validation, RAG quality metrics, stochastic evaluation, and model comparison--built first for Microsoft Agent Framework (MAF) and Microsoft.Extensions.AI. What RAGAS, PromptFoo and DeepEval do for Python, AgentEval does for .NET

testing agent framework evaluations net workflows red-teaming agentic evals

Updated Mar 15, 2026
C#

Improve this page

Add a description, image, and links to the evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evals topic, visit your repo's landing page and select "manage topics."

Learn more

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evals

Here are 169 public repositories matching this topic...

mastra-ai / mastra

Arize-ai / phoenix

AgentOps-AI / agentops

Kiln-AI / Kiln

pydantic / logfire

truera / trulens

lmnr-ai / lmnr

GitHamza0206 / simba

mattpocock / evalite

superlinear-ai / raglite

harbor-framework / harbor

waynesutton / opensync

microsoft / promptpex

keshik6 / HourVideo

METR / vivaria

mclenhard / mcp-evals

dustalov / evalica

voratiq / voratiq

ombharatiya / ai-system-design-guide

AgentEvalHQ / AgentEval

Improve this page

Add this topic to your repo