Papers
arxiv:2512.14620

JMMMU-Pro: Image-based Japanese Multi-discipline Multimodal Understanding Benchmark via Vibe Benchmark Construction

Published on Dec 16
· Submitted by
Atsuyuki Miyai
on Dec 17
Authors:
,
,

Abstract

JMMMU-Pro, an image-based Japanese multi-discipline multimodal understanding benchmark, challenges open-source large multimodal models through integrated visual-textual understanding and is constructed using Vibe Benchmark Construction, a cost-effective method leveraging realistic image generation.

AI-generated summary

This paper introduces JMMMU-Pro, an image-based Japanese Multi-discipline Multimodal Understanding Benchmark, and Vibe Benchmark Construction, a scalable construction method. Following the evolution from MMMU to MMMU-Pro, JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, thereby creating a benchmark that requires integrated visual-textual understanding through visual perception. To build JMMMU-Pro, we propose Vibe Benchmark Construction, a methodology in which an image generative model (e.g., Nano Banana Pro) produces candidate visual questions, and humans verify the outputs and, when necessary, regenerate with adjusted prompts to ensure quality. By leveraging Nano Banana Pro's highly realistic image generation capabilities and its ability to embed clean Japanese text, we construct a high-quality benchmark at low cost, covering a wide range of background and layout designs. Experimental results show that all open-source LMMs struggle substantially with JMMMU-Pro, underscoring JMMMU-Pro as an important benchmark for guiding future efforts in the open-source community. We believe that JMMMU-Pro provides a more rigorous evaluation tool for assessing the Japanese capabilities of LMMs and that our Vibe Benchmark Construction also offers an efficient guideline for future development of image-based VQA benchmarks.

Community

Paper author Paper submitter

“Bro, Benchmarks like MMMU-Pro are too expensive to build, right?”

One month ago: Yes.
Now: No

🚀 Proposing Vibe Benchmark Construction!

  • NanoBanana Pro generates VQA itself, and humans only check or lightly edit prompts for regeneration.

🚀Building JMMMU-Pro incredibly quickly!

  • JMMMU-Pro extends JMMMU by composing the question image and question text into a single image, requiring integrated visual-textual understanding through visual perception.

🧐 Most open-source LMMs seem to perform close to random guessing on JMMMU-Pro. Let's take on the challenge!

Paper: https://arxiv.org/pdf/2512.14620
Project Page: https://mmmu-japanese-benchmark.github.io/JMMMU_Pro/

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.14620 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.