Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings
open-compass

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
OpenCompass Website HOT OpenCompass Toolkit TRY IT OUT

What is OpenCompass ? OpenCompass is a platform focused on understanding of the AGI, include Large Language Model and Multi-modality Model.

We aim to:

  • develop high-quality libraries to reduce the difficulties in evaluation
  • provide convincing leaderboards for improving the understanding of the large models
  • create powerful toolchains targeting a variety of abilities and tasks
  • build solid benchmarks to support the large model research
  • research on inference of Large Model(analysis, reasoning, prompt engineering.)

Toolkit

OpenCompass

VLMEvalKit

Models

CompassVerifier

CompassJudger

Benchmarks and Methods

Project Topic Paper

DevBench

Automated Software Development

DevBench: Towards LLMs based Automated Software Development

CriticBench

Critic Reasoning

CriticBench: Evaluating Large Language Models as Critic

ANAH

Hallucination Annotation

ANAH: Analytical Annotation of Hallucinations in Large Language Models

MathBench

Mathematical Reasoning

MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark

T-Eval

Tool Utilization

T-Eval: Evaluating the Tool Utilization Capability Step by Step

MMBench

Multi Modality

MMBench: Is Your Multi-modal Model an All-around Player?

BotChat

Subjective Evaluation

BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues

LawBench

Domain Evaluation

LawBench: Benchmarking Legal Knowledge of Large Language Models

Pinned Loading

  1. opencompass opencompass Public

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    Python 6.7k 738

  2. VLMEvalKit VLMEvalKit Public

    Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

    Python 3.8k 640

  3. MMBench MMBench Public

    Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"

    287 15

  4. CompassVerifier CompassVerifier Public

    [EMNLP 2025] CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

    Jupyter Notebook 63 2

  5. CompassJudger CompassJudger Public

    The All-in-one Judge Models introduced by Opencompass

    116 6

  6. MMBench-GUI MMBench-GUI Public

    Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, includi...

    Python 100 6

Repositories

Loading
Type
Select type
Language
Select language
Sort
Select order
Showing 10 of 43 repositories
  • VLMEvalKit Public

    Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

    open-compass/VLMEvalKit's past year of commit activity
    Python 3,841 Apache-2.0 640 202 27 Updated Feb 26, 2026
  • opencompass Public

    OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

    open-compass/opencompass's past year of commit activity
    Python 6,693 Apache-2.0 738 367 (1 issue needs help) 66 Updated Feb 26, 2026
  • open-compass/SWE-bench-server's past year of commit activity
    Python 0 0 0 1 Updated Feb 24, 2026
  • GTA Public

    [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents

    open-compass/GTA's past year of commit activity
    Python 135 Apache-2.0 9 0 0 Updated Feb 16, 2026
  • MiroFlow Public Forked from MiroMindAI/miroflow

    MiroMind Research Agent: Fully Open-Source Deep Research Agent with Reproducible State-of-the-Art Performance on FutureX, GAIA, HLE, BrowserComp and xBench.

    open-compass/MiroFlow's past year of commit activity
    Python 0 Apache-2.0 261 0 0 Updated Dec 30, 2025
  • RePro Public

    [ICLR 2026] Rectifying LLM Thought From Lens of Optimization

    open-compass/RePro's past year of commit activity
    Python 14 MIT 4 1 0 Updated Dec 5, 2025
  • SAGA Public

    The code repository for the NeurIPS 2025 paper "Rethinking Verification for LLM Code Generation: From Generation to Testing."

    open-compass/SAGA's past year of commit activity
    10 0 1 0 Updated Nov 27, 2025
  • ATLAS Public

    ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

    open-compass/ATLAS's past year of commit activity
    6 1 0 0 Updated Nov 20, 2025
  • OASIS Public
    open-compass/OASIS's past year of commit activity
    Python 3 0 0 0 Updated Nov 12, 2025
  • open-compass/InteractScience's past year of commit activity
    JavaScript 8 Apache-2.0 0 0 0 Updated Oct 31, 2025