Dark Mode

Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

HiThink-Research/FinMTM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

History

51 Commits

Repository files navigation

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Chenxi Zhang1,2* , Ziliang Gan1,3* , Liyun Zhu1* , Youwei Pang4 , Qing Zhang5 , Rongjunchen Zhang1 spades

1 HiThink Research 2Wuhan University 3Zhejiang University 4 Nanyang Technological University 5Shanghai Institute of Technology
*Equal Contribution spadesCorresponding Author
Correspondence: zhangrongjunchen@myhexin.com

[Paper] | [Project Page]|[Huggingface]


Overview of FinMTM: task types and capability coverage.


Updates

  • 2026-01: Initial release of benchmark dataset and paper.
  • TBD: Online leaderboard opens for submissions.

Contents

  • Overview
  • Results
  • Evaluation
  • Quickstart
  • License
  • Citation

Overview

Financial reasoning is challenging for VLMs due to specialized chart formats, dense domain knowledge, long-horizon dependencies, and evidence-grounded tool use. Existing benchmarks are mostly single-turn and do not sufficiently measure multi-turn dialogue stability, session-level memory, or agentic planning and execution.

FinMTM addresses this gap by providing:

  • Objective questions: single and multiple choice questions grounded in financial visuals.

  • Open-ended questions: multi-turn conversations that stress compositional reasoning, multi-step calculation, self-correction, and memory.

  • Financial agent task: tool-augmented multi-source workflows with long-horizon planning and evidence-grounded answers.


Data Construction Pipeline

We propose a novel multi-stage data construction pipeline to scale multi-turn financial sessions, ensuring alignment with targeted cognitive requirements and traceability to verifiable evidence.

Our multi-stage construction pipeline. We progressively build (i) objective visual-grounded items, (ii) multi-turn open-ended sessions emphasizing composition/calculation/self-correction/memory, and (iii) agentic workflows with tool planning, tool execution, and evidence-grounded responses.

Results

We benchmark a range of 22 leading VLMs on FinMTM. The final score is the average across: Objective Questions, Open-Ended Questions, and Financial Agent.

Comparison of leading VLMs on FinMTM. Final score is the average of Objective, Open-Ended, and Agent tasks.

Benchmark Results

Benchmark Results (Click to Expand)

Column Definitions

  • Objective Questions: Single-choice (Obj-Single), Multiple-choice (Obj-Multi)
  • Open-Ended Questions: Comprehension (Open-Com.), Calculation (Open-Cal.), Self-Correlation (Open-SelfCorr.), Memory (Open-Mem.)
  • Financial Agent Tasks: With fuzzing (Agent-w fuzz), Without fuzzing (Agent-w/o fuzz)
Method Obj-Single Obj-Multi Open-Com. Open-Cal. Open-SelfCorr. Open-Mem. Agent-w fuzz Agent-w/o fuzz
Proprietary Models
ChatGPT-4o 79.3 49.1 77.2 76.8 46.2 38.9 29.7 34.8
ChatGPT-o3* 85.8 73.3 83.8 78.6 52.8 43.6 31.4 35.2
ChatGPT-5* 89.0 79.6 86.9 80.7 56.9 46.7 35.9 49.7
Gemini 3 Flash 91.9 78.1 82.2 76.0 55.4 41.6 53.6 62.6
Grok-4-fast-non-reasoning* 71.0 46.8 66.0 61.2 39.9 24.8 30.2 39.7
Gemini 3 Pro 92.1 78.4 87.5 82.8 58.8 48.5 48.3 54.3
InternVL Series
InternVL2.5-8B 63.8 25.7 55.1 49.2 26.5 16.7 8.4 10.5
InternVL2.5-26B 70.5 31.3 61.7 57.7 32.3 22.8 11.2 14.0
InternVL2.5-40B 72.3 35.2 66.1 64.6 36.2 26.7 13.5 16.8
InternVL3-78B 75.6 42.4 76.2 77.6 43.6 32.6 18.2 22.8
Other VL Series
MiMo-VL-7B 61.1 21.4 75.1 75.4 47.2 39.9 20.2 25.5
GLM4.5V-108B 73.7 51.0 85.4 79.6 51.1 42.2 26.5 32.4
Qwen VL Series
Qwen2.5-VL-3B 64.5 16.4 68.2 67.7 40.5 27.6 9.4 11.9
Qwen2.5-VL-7B 73.4 24.1 74.3 73.4 43.1 33.9 11.1 14.2
Qwen3-VL-4B-Instruct 73.3 34.2 74.5 71.2 39.5 25.9 15.1 19.1
Qwen3-VL-4B-Thinking 66.1 24.3 71.2 68.5 42.5 31.0 12.8 15.6
Qwen3-VL-30B-A3B-Instruct 77.2 47.3 82.1 76.5 42.5 33.7 16.2 20.8
Qwen3-VL-30B-A3B-Thinking 71.5 49.4 80.7 67.1 44.2 35.1 18.9 23.3
Qwen3-VL-32B-Instruct 84.5 39.9 84.3 80.7 50.8 40.3 19.6 25.1
Qwen3-VL-32B-Thinking 83.4 46.5 80.3 68.6 43.5 33.7 23.2 28.6
Qwen3-VL-235B-A22B-Instruct 81.3 48.5 85.5 80.9 54.5 41.5 32.1 38.7
Qwen3-VL-235B-A22B-Thinking 80.5 42.3 84.5 79.4 52.5 43.0 35.2 41.5

Key Observations

  • Agentic settings expose larger gaps than pure reasoning-only settings.
  • Removing identifiable entities increases difficulty and stresses evidence-grounded reasoning.
  • Scaling helps, but robust tool planning and execution remain a major bottleneck for open-source models.

Evaluation

FinMTM uses task-aware evaluation protocols across the three tasks.

1) Objective Questions

  • Exact-match scoring over the predicted option(s).
  • Multi-choice uses a set-overlap rule (precision/recall/F-score style) to penalize missing or spurious selections.

2) Open-Ended Dialogues (Multi-turn)

We score dialogues with a weighted combination of:

  • turn-level quality (per-turn correctness, grounding, reasoning quality)
  • session-level quality (cross-turn consistency, long-context stability, memory correctness)

Notably, the level taxonomy is defined at the session level, i.e., each level characterizes the overall cognitive requirement of an entire multi-turn conversation rather than any single turn in isolation.

3) Financial Agent Tasks

We evaluate:

  • planning quality (step ordering, tool selection, decomposition)
  • tool execution (tool name + core args correctness; evidence sufficiency)
  • final outcome (answer correctness + evidence-grounded summarization)

Quickstart (The code is still under refinement.)

1. Environment Setup

Download the dataset from the huggingface link. For evaluation, run the following commands to set up the environment:

cd finmtm
conda create -n finmtm_env python=3.10 -y
conda activate finmtm_env
pip install -r requirements.txt

2. Inference

2.1 Inference for Objective Questions (Single/Multiple Choice)

cd ./inference/SC_MC
chmod +x etest.sh
./etest.sh

2.2 Inference for Multi-Turn QA

cd ./inference/MTQA
chmod +x etest.sh
./etest.sh

2.3 General Inference Command (Optional)

To customize inference parameters, run the command below directly:

python inference.py \
--backend qwen3vl \
--api-base http://localhost:8000/v1 \
--model qwen3vl-4b-instruct \
--input-dir ./inputs \
--output-dir ./outputs \
--include "*.jsonl"

3. Evaluation

For results of multi-turn QA tasks, run the following commands to start evaluation:

python -m eval_runner.main \
--dirs /path/to/data \ # Directory of data to evaluate
--client qwen \ # Client type
--api_base http://127.0.0.1:8000/v1 \ # API service address
--model Qwen3-VL-30B-A3B-Instruct # Model used for evaluation

# Alternatively, run via the script (optional)
chmod +x etest.sh
./etest.sh

License

Code: Apache 2.0 Dataset: CC BY-NC 4.0 Research-use only. Must comply with: https://openai.com/policies/terms-of-use.

Citation

If you find our work useful, please consider citing:

@misc{zhang2026finmtm,
title={FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation},
author={Chenxi Zhang and Ziliang Gan and Liyun Zhu and Youwei Pang and Qing Zhang and Rongjunchen Zhang},
year={2026},
eprint={2602.03130},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2602.03130},
}

About

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Topics

Resources

Readme

Stars

Watchers

Forks

Releases

No releases published

Packages

Contributors