Logo

A Comprehensive Benchmark for Fine-Grained Video Motion Understanding

Introduction

Multimodal Large Language Models (MLLMs) have shown remarkable capabilities in video content understanding but still struggle with fine-grained motion comprehension. To comprehensively assess the motion understanding ability of existing MLLMs, we introduce FAVOR-Bench, comprising 1,776 videos with structured manual annotations of various motions. Our benchmark includes both close-ended and open-ended tasks. For close-ended evaluation, we carefully design 8,184 multiple-choice question-answer pairs spanning six distinct sub-tasks. For open-ended evaluation, we develop both a novel cost-efficient LLM-free and a GPT-assisted caption assessment method, where the former can enhance benchmarking interpretability and reproducibility. Comprehensive experiments with 21 state-of-the-art MLLMs reveal significant limitations in their ability to comprehend and describe detailed temporal dynamics in video motions. To alleviate this limitation, we further build FAVOR-Train, a dataset consisting of 17,152 videos with fine-grained motion annotations. The results of finetuning Qwen2.5-VL on FAVOR-Train yield consistent improvements on motion-related tasks of TVBench, MotionBench and our FAVOR-Bench. Comprehensive assessment results demonstrate that the proposed FAVOR-Bench and FAVOR-Train provide valuable tools to the community for developing more powerful video understanding models.

Leaderboard

Evaluation results on FAVOR-Bench. Click on a specific indicator (such as ALL) to view the corresponding ranking list.

# Model Date Input Close-Ended Open-Ended
ALL AS HAC SAD MAD CM NSM GPT-C GPT-D LLM-Free
Gemini-1.5-Pro 2024-04 1 fps* 49.87 49.22 53.73 48.80 54.85 41.58 56.25 4.52 4.68 52.91
GPT-4o 2024-08 1 fps* 42.09 40.65 45.10 42.84 45.48 36.00 48.44 4.33 4.01 49.50
Claude-3.7-Sonnet 2025-02 1 fps* 43.73 45.20 43.02 41.82 48.05 39.07 46.88 4.32 4.63 43.03
Video-LLaVA-7B 2023-11 8 frms 25.37 24.91 21.54 25.45 30.54 26.23 21.88 2.18 2.31 41.36
LLaVA-NeXT-Video-7B 2024-05 8 frms 23.45 21.27 22.45 26.05 26.72 23.07 14.06 2.57 2.02 29.48
LLaVA-NeXT-Video-34B 2024-05 8 frms 30.44 31.70 31.99 32.31 22.99 29.58 46.88 2.83 2.67 39.41
Tarsier-7B 2024-07 8 frms 17.46 12.55 21.16 17.87 17.93 22.23 31.25 3.47 2.80 46.25
Tarsier-34B 2024-07 8 frms 30.34 28.56 34.98 26.90 31.29 31.91 37.50 3.79 2.97 47.13
Aria 2024-10 8 frms 34.63 33.33 41.14 30.14 35.27 33.21 59.38 2.85 2.61 42.78
LLaVA-Video-7B-Qwen2 2024-10 64 frms 38.60 36.14 41.27 41.28 44.48 29.58 46.88 3.57 3.40 45.41
LLaVA-Video-72B-Qwen2 2024-10 64 frms 46.08 48.35 47.50 45.25 51.70 33.02 53.12 3.42 3.42 46.06
Tarsier2-Recap-7B 2024-12 16 frms -- -- -- -- -- -- -- 4.60 4.38 56.58
InternVL2.5-2B 2024-12 8 frms 22.90 18.70 28.23 23.71 27.47 19.16 23.44 2.80 2.99 43.23
InternVL2.5-8B 2024-12 8 frms 34.59 31.97 38.68 38.09 37.76 26.14 35.94 3.11 3.38 44.18
InternVL2.5-78B 2024-12 8 frms 38.54 38.38 40.62 39.05 43.65 29.40 39.06 2.98 3.41 44.01
VideoChat-Flash-Qwen2-7B 2025-01 1 fps 43.82 41.90 48.41 42.84 50.95 35.07 50.00 3.25 2.55 40.82
VideoLLaMA3-2B 2025-01 1 fps 32.98 28.97 36.60 34.90 38.01 28.56 40.62 3.14 2.98 39.29
VideoLLaMA3-7B 2025-01 1 fps 41.46 40.20 44.13 42.42 48.30 31.53 42.19 3.64 3.24 48.63
Qwen2.5-VL-3B 2025-01 1 fps 37.05 38.45 38.22 36.64 39.75 29.77 32.81 2.77 2.91 47.32
Qwen2.5-VL-7B 2025-01 1 fps 40.76 39.48 43.28 43.14 43.65 33.49 39.06 3.28 3.41 48.46
Qwen2.5-VL-72B 2025-01 1 fps 48.14 50.28 46.98 48.13 51.78 40.28 51.56 3.37 3.44 49.72
Qwen2.5-VL-7B+FAVOR-Train (Ours) -- 1 fps 42.13 41.75 45.17 40.91 43.57 39.16 39.06 3.55 3.53 56.33

* Due to the API response limitations, the video input of proprietary MLLMs is restricted to 16 frames if the video is longer than 16 seconds (demoted as "1 fps*").

FAVOR-Bench

Motivation

Illustration of motion understanding capabilities of proprietary and open-source MLLMs (taking GPT-4o and Qwen2-VL-72B as examples). Both models correctly answer the coarse-grained summarization question (Task 1), but fail to resolve the fine-grained action detail question (Task 2). For the open-ended description task (Task 3), despite required to focus on temporal dynamics, the responses emphasize static content, and the motion descriptions are either coarse-grained or contain errors.

Evaluation Tasks

Overview of evaluation tasks. FAVOR-Bench comprises close-ended and open-ended evaluations. The close-ended evaluation is composed of six tasks, focusing on different aspects of fine-grained motion understanding. The open-ended evaluation comprises a GPT-assisted evaluation and a novel LLM-free framework. In the GPT-assisted evaluation, model responses are directly compared with manual captions. The LLM-free framework parses structured motion elements from responses and compares them with the structured annotations.

Statistics

Data statistics of FAVOR-Bench. Left: Task type distribution across close-ended and open-ended evaluation in FAVOR-Bench. Middle: Distribution of motion numbers per video. Right: The word cloud statistics of motion vocabularies in FAVOR-Bench.

More data statistics of FAVOR-Bench. Left: Index distribution of correct answers for the close-ended tasks. For example, ``(1)" indicates that the correct option is ranked first. Middle: Video duration distribution of FAVOR-Bench. Right: Question number distribution for videos of FAVOR-Bench.

Benchmark Comparison

data-composition

Comparison of FAVOR-Bench with existing video understanding benchmarks. #Videos and #Close-Ended QA refer to the number of videos and close-ended question-answer pairs respectively. FAVOR-Bench covers wide video types (Third-Person, Ego-Centric, Simulation) while focusing on fine-grained motion understanding. Moreover, FAVOR-Bench provides comprehensive evaluation, including close-ended QA and open-ended tasks (both GPT-assisted evaluation and our novel LLM-Free framework).

Experimental Results

Results on FAVOR-Bench

grade-lv

The overall performances of 21 MLLMs on FAVOR-Bench, including close-ended multiple choice and open-ended evaluation with GPT-assisted and LLM-free scores. GPT-C and GPT-D mean correctness and detailedness scores generated by GPT-4o. The highest and second-highest results among all MLLMs are indicated in bold and underlined. Due to the API response limitations, the video input of proprietary MLLMs is restricted to 16 frames if the video is longer than 16 seconds (demoted as "1 fps*".) Tarsier2-Recap-7B is a model specially designed for captioning and it fails to fulfill the close-ended evaluation.

Fine-tuning with FAVOR-Train

Comparison on TVBench and MotionBench with our proposed FAVOR-Train. "AVG" means the average score of all the 10 tasks of TVBench. "ALL" denotes the accuracy on all 4,018 questions of MotionBench-Dev. Qwen2.5-VL gains considerable performance improvement from fine-tuning with FAVOR-Train.

Question-Answer Examples

Citation

        @article{tu2025favor,
                  title={FAVOR-Bench: A Comprehensive Benchmark for Fine-Grained Video Motion Understanding},
                  author={Tu, Chongjun and Zhang, Lin and Chen, Pengtao and Ye, Peng and Zeng, Xianfang and Cheng, Wei and Yu, Gang and Chen, Tao},
                  journal={arXiv preprint arXiv:2503.14935},
                  year={2025}
                }