Name	Name	Last commit message	Last commit date
Latest commit History 51 Commits
inference	inference
judge	judge
static	static
README.md	README.md
requirements.txt	requirements.txt

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Chenxi Zhang^1,2* , Ziliang Gan^1,3* , Liyun Zhu^1* , Youwei Pang⁴ , Qing Zhang⁵ , Rongjunchen Zhang^{1 spades}

¹ HiThink Research ²Wuhan University ³Zhejiang University ⁴ Nanyang Technological University ⁵Shanghai Institute of Technology
_{^*Equal Contribution ^spadesCorresponding Author}
Correspondence: zhangrongjunchen@myhexin.com

[Paper] | [Project Page]|[Huggingface]

Overview of FinMTM: task types and capability coverage.

Updates

2026-01: Initial release of benchmark dataset and paper.
TBD: Online leaderboard opens for submissions.

Overview
Results
Evaluation
Quickstart
License
Citation

Overview

Financial reasoning is challenging for VLMs due to specialized chart formats, dense domain knowledge, long-horizon dependencies, and evidence-grounded tool use. Existing benchmarks are mostly single-turn and do not sufficiently measure multi-turn dialogue stability, session-level memory, or agentic planning and execution.

FinMTM addresses this gap by providing:

Objective questions: single and multiple choice questions grounded in financial visuals.
Open-ended questions: multi-turn conversations that stress compositional reasoning, multi-step calculation, self-correction, and memory.
Financial agent task: tool-augmented multi-source workflows with long-horizon planning and evidence-grounded answers.

Data Construction Pipeline

We propose a novel multi-stage data construction pipeline to scale multi-turn financial sessions, ensuring alignment with targeted cognitive requirements and traceability to verifiable evidence.

Our multi-stage construction pipeline. We progressively build (i) objective visual-grounded items, (ii) multi-turn open-ended sessions emphasizing composition/calculation/self-correction/memory, and (iii) agentic workflows with tool planning, tool execution, and evidence-grounded responses.

Results

We benchmark a range of 22 leading VLMs on FinMTM. The final score is the average across: Objective Questions, Open-Ended Questions, and Financial Agent.

Comparison of leading VLMs on FinMTM. Final score is the average of Objective, Open-Ended, and Agent tasks.

Benchmark Results

Benchmark Results (Click to Expand)

Column Definitions

Objective Questions: Single-choice (Obj-Single), Multiple-choice (Obj-Multi)
Open-Ended Questions: Comprehension (Open-Com.), Calculation (Open-Cal.), Self-Correlation (Open-SelfCorr.), Memory (Open-Mem.)
Financial Agent Tasks: With fuzzing (Agent-w fuzz), Without fuzzing (Agent-w/o fuzz)

Method	Obj-Single	Obj-Multi	Open-Com.	Open-Cal.	Open-SelfCorr.	Open-Mem.	Agent-w fuzz	Agent-w/o fuzz
Proprietary Models
ChatGPT-4o	79.3	49.1	77.2	76.8	46.2	38.9	29.7	34.8
ChatGPT-o3*	85.8	73.3	83.8	78.6	52.8	43.6	31.4	35.2
ChatGPT-5*	89.0	79.6	86.9	80.7	56.9	46.7	35.9	49.7
Gemini 3 Flash	91.9	78.1	82.2	76.0	55.4	41.6	53.6	62.6
Grok-4-fast-non-reasoning*	71.0	46.8	66.0	61.2	39.9	24.8	30.2	39.7
Gemini 3 Pro	92.1	78.4	87.5	82.8	58.8	48.5	48.3	54.3
InternVL Series
InternVL2.5-8B	63.8	25.7	55.1	49.2	26.5	16.7	8.4	10.5
InternVL2.5-26B	70.5	31.3	61.7	57.7	32.3	22.8	11.2	14.0
InternVL2.5-40B	72.3	35.2	66.1	64.6	36.2	26.7	13.5	16.8
InternVL3-78B	75.6	42.4	76.2	77.6	43.6	32.6	18.2	22.8
Other VL Series
MiMo-VL-7B	61.1	21.4	75.1	75.4	47.2	39.9	20.2	25.5
GLM4.5V-108B	73.7	51.0	85.4	79.6	51.1	42.2	26.5	32.4
Qwen VL Series
Qwen2.5-VL-3B	64.5	16.4	68.2	67.7	40.5	27.6	9.4	11.9
Qwen2.5-VL-7B	73.4	24.1	74.3	73.4	43.1	33.9	11.1	14.2
Qwen3-VL-4B-Instruct	73.3	34.2	74.5	71.2	39.5	25.9	15.1	19.1
Qwen3-VL-4B-Thinking	66.1	24.3	71.2	68.5	42.5	31.0	12.8	15.6
Qwen3-VL-30B-A3B-Instruct	77.2	47.3	82.1	76.5	42.5	33.7	16.2	20.8
Qwen3-VL-30B-A3B-Thinking	71.5	49.4	80.7	67.1	44.2	35.1	18.9	23.3
Qwen3-VL-32B-Instruct	84.5	39.9	84.3	80.7	50.8	40.3	19.6	25.1
Qwen3-VL-32B-Thinking	83.4	46.5	80.3	68.6	43.5	33.7	23.2	28.6
Qwen3-VL-235B-A22B-Instruct	81.3	48.5	85.5	80.9	54.5	41.5	32.1	38.7
Qwen3-VL-235B-A22B-Thinking	80.5	42.3	84.5	79.4	52.5	43.0	35.2	41.5

Key Observations

Agentic settings expose larger gaps than pure reasoning-only settings.
Removing identifiable entities increases difficulty and stresses evidence-grounded reasoning.
Scaling helps, but robust tool planning and execution remain a major bottleneck for open-source models.

Evaluation

FinMTM uses task-aware evaluation protocols across the three tasks.

1) Objective Questions

Exact-match scoring over the predicted option(s).
Multi-choice uses a set-overlap rule (precision/recall/F-score style) to penalize missing or spurious selections.

2) Open-Ended Dialogues (Multi-turn)

We score dialogues with a weighted combination of:

turn-level quality (per-turn correctness, grounding, reasoning quality)
session-level quality (cross-turn consistency, long-context stability, memory correctness)

Notably, the level taxonomy is defined at the session level, i.e., each level characterizes the overall cognitive requirement of an entire multi-turn conversation rather than any single turn in isolation.

3) Financial Agent Tasks

We evaluate:

planning quality (step ordering, tool selection, decomposition)
tool execution (tool name + core args correctness; evidence sufficiency)
final outcome (answer correctness + evidence-grounded summarization)

Quickstart (The code is still under refinement.)

1. Environment Setup

Download the dataset from the huggingface link. For evaluation, run the following commands to set up the environment:

cd finmtm conda create -n finmtm_env python=3.10 -y conda activate finmtm_env pip install -r requirements.txt

2. Inference

2.1 Inference for Objective Questions (Single/Multiple Choice)

cd ./inference/SC_MC chmod +x etest.sh ./etest.sh

2.2 Inference for Multi-Turn QA

cd ./inference/MTQA chmod +x etest.sh ./etest.sh

2.3 General Inference Command (Optional)

To customize inference parameters, run the command below directly:

python inference.py \ --backend qwen3vl \ --api-base http://localhost:8000/v1 \ --model qwen3vl-4b-instruct \ --input-dir ./inputs \ --output-dir ./outputs \ --include "*.jsonl"

3. Evaluation

For results of multi-turn QA tasks, run the following commands to start evaluation:

python -m eval_runner.main \ --dirs /path/to/data \ # Directory of data to evaluate --client qwen \ # Client type --api_base http://127.0.0.1:8000/v1 \ # API service address --model Qwen3-VL-30B-A3B-Instruct # Model used for evaluation # Alternatively, run via the script (optional) chmod +x etest.sh ./etest.sh

License

Code: Apache 2.0 Dataset: CC BY-NC 4.0 Research-use only. Must comply with: https://openai.com/policies/terms-of-use.

Citation

If you find our work useful, please consider citing:

@misc{zhang2026finmtm, title={FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation}, author={Chenxi Zhang and Ziliang Gan and Liyun Zhu and Youwei Pang and Qing Zhang and Rongjunchen Zhang}, year={2026}, eprint={2602.03130}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.03130}, }

Folders and files

Latest commit

History

Repository files navigation

FinMTM: A Multi-Turn Multimodal Benchmark for Financial Reasoning and Agent Evaluation

Updates

Contents

Overview

Results

Benchmark Results

Key Observations

Evaluation

1) Objective Questions

2) Open-Ended Dialogues (Multi-turn)

3) Financial Agent Tasks

Quickstart (The code is still under refinement.)

1. Environment Setup

2. Inference

2.1 Inference for Objective Questions (Single/Multiple Choice)

2.2 Inference for Multi-Turn QA

2.3 General Inference Command (Optional)

3. Evaluation

License

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages