[go: up one dir, main page]

Skip to content

[NeurIPS 2024] This repo contains evaluation code for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models"

Notifications You must be signed in to change notification settings

MMStar-Benchmark/MMStar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 

Repository files navigation

MMStar

🌐 Homepage | 🤗 Dataset | 📖 Paper | 🏆 Leaderboard

This repo contains the official evaluation code and dataset for the paper "Are We on the Right Way for Evaluating Large Vision-Language Models?"

💡 Highlights

  • 🔥 Two Key Issues that Lead to Misjudgment of the current LVLMs' Capabilities
  • 🔥 An Elite Vision-indispensable Multi-modal Benchmark, MMStar
  • 🔥 Two Metrics: Multi-modal Gain (MG) and Multi-modal Leakage (ML)

📜 News

[2024.9.26] 🚀 MMStar was accepted by NeurIPS 2024!

[2024.4.16] 🚀 MMStar has been supported in the VLMEvalKit repository and OpenCompass leaderboard.

[2024.4.2] 🚀 Huggingface Dataset and evaluation code are available!

[2024.4.1] 🚀 We released the ArXiv paper.

👨‍💻 Todo

  • Evaluation code for MMStar
  • Support online Leaderboard
  • Curate online test set, MMStar-test (This involves working with the existing multi-modal benchmarks containing protected test set, feel free to contact us!)

👀 Introduction

We dig into current evaluation works and identify two primary issues:

(1) Visual content is unnecessary for many samples.

(2) Unintentional data leakage exists in LLM and LVLM training.

Both problems lead to misjudgments of actual multi-modal performance gains and potentially misguide the study of LVLM. To this end, we present MMStar, an elite vision-indispensable multi-modal benchmark comprising 1,500 challenge samples meticulously selected by humans. After applying the coarse filter process and manual review, we narrow down from a total of 22,401 samples to 11,607 candidate samples and finally select 1,500 high-quality samples to construct our MMStar benchmark.

In MMStar, we display 6 core capabilities in the inner ring, with 18 detailed axes presented in the outer ring. The middle ring showcases the number of samples for each detailed dimension. Each core capability contains a meticulously balanced 250 samples. We further ensure a relatively even distribution across the 18 detailed axes.

🤖 Evaluation

You can evaluate any LLMs and LVLMs on our MMStar following with the evaluation guidelines.

🏆 Leaderboard

🎯 The Leaderboard for MMStar is continuously being updated, welcoming the contribution of your LVLMs!

Please note that to thoroughly evaluate your own LVLM, you are required to provide us with three result files in xlsx format. These should include the results of your LVLM with visual input, the results of your LVLM without visual input, and the results of your original LLM base without visual input. We have provided a submission format in the submits folder. After completing the aforementioned steps, please contact us via chlin@mail.ustc.edu.cn to submit your results and to update the leaderboard.

📧 Contact

✒️ Citation

If you find our work helpful for your research, please consider giving a star ⭐ and citation 📝

@article{chen2024we,
  title={Are We on the Right Way for Evaluating Large Vision-Language Models?},
  author={Chen, Lin and Li, Jinsong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Chen, Zehui and Duan, Haodong and Wang, Jiaqi and Qiao, Yu and Lin, Dahua and others},
  journal={arXiv preprint arXiv:2403.20330},
  year={2024}
}