OpenCompass * GitHub - http.pieter.net

OpenCompass Website ^HOT OpenCompass Toolkit ^{TRY IT OUT}

What is OpenCompass ? OpenCompass is a platform focused on understanding of the AGI, include Large Language Model and Multi-modality Model.

We aim to:

develop high-quality libraries to reduce the difficulties in evaluation
provide convincing leaderboards for improving the understanding of the large models
create powerful toolchains targeting a variety of abilities and tasks
build solid benchmarks to support the large model research
research on inference of Large Model(analysis, reasoning, prompt engineering.)

Toolkit

OpenCompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (LLaMA, LLaMa2, ChatGLM2, ChatGPT, Claude, etc) over 80+ datasets.
https://github.com/open-compass/opencompass

VLMEvalKit

VLMEvalKit is a toolkit for evaluating large vision-language models (LVLMs), currently supporting ~20 LVLMs and five multi-modal benchmarks.
https://github.com/open-compass/vlmevalkit

Models

CompassVerifier

CompassVerifier is an accurate and robust lightweight verifier model for evaluation and outcome reward.
https://github.com/open-compass/CompassVerifier

CompassJudger

CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards
https://github.com/open-compass/CompassJudger

Benchmarks and Methods

Project	Topic	Paper
DevBench	Automated Software Development	DevBench: Towards LLMs based Automated Software Development
CriticBench	Critic Reasoning	CriticBench: Evaluating Large Language Models as Critic
ANAH	Hallucination Annotation	ANAH: Analytical Annotation of Hallucinations in Large Language Models
MathBench	Mathematical Reasoning	MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
T-Eval	Tool Utilization	T-Eval: Evaluating the Tool Utilization Capability Step by Step
MMBench	Multi Modality	MMBench: Is Your Multi-modal Model an All-around Player?
BotChat	Subjective Evaluation	BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues
LawBench	Domain Evaluation	LawBench: Benchmarking Legal Knowledge of Large Language Models

Pinned Loading

opencompass Public

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 6.7k 738

VLMEvalKit Public

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

Python 3.8k 640

MMBench Public

Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"

287 15

CompassVerifier Public

[EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

Jupyter Notebook 63 2

CompassJudger Public

The All-in-one Judge Models introduced by Opencompass

116 6

MMBench-GUI Public

Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, includi...

Python 100 6

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenCompass

Toolkit

Models

Benchmarks and Methods

Pinned Loading

Repositories

People

Top languages

Most used topics