Abstract
START enhances multimodal large language models by integrating spatial and textual learning through chart-element grounding and chart-to-code generation, improving chart understanding and performance across benchmarks.
Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.
Community
Does visual grounding help visual reasoning in Chart Understanding? 📊ðŸ§
I am excited to share our latest paper, "START: Spatial and Textual Learning for Chart Understanding," which explores how we can teach Multimodal LLMs (MLLMs) to better understand complex, real-world charts.
The Challenge: In real-world scenarios (like scientific papers), charts often have complex layouts with multiple subplots. Current models often fail because they jump to reasoning without first "grounding" (locating) the correct visual elements or understanding the underlying data.
Our Solution - START: We propose a spatial and textual learning framework that trains MLLMs using two auxiliary tasks alongside Chart QA:
Chart Element Grounding (Spatial): Explicitly teaching the model to locate specific components (legends, subplots), which boosts spatial reasoning.
Chart-to-Code Generation (Textual): Recovering the Python code used to render the chart to understand data details.
Key Contributions:
START-Dataset: We developed a novel pipeline that converts real chart images (from ArXiv) into executable Python code and precise element locations, preserving real-world visual complexity.
CS-Bench: A new benchmark specifically designed to evaluate chart spatial understanding.
SOTA Results: Our model, START-RL-7B, outperforms previous state-of-the-art models (like Chart-R1) by a clear margin on benchmarks like CharXiv, ChartMimic, and ChartQAPro.
This work has been accepted to WACV2026.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ChartM3: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension (2025)
- ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning (2025)
- SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards (2025)
- Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark (2025)
- Zooming into Comics: Region-Aware RL Improves Fine-Grained Comic Understanding in Vision-Language Models (2025)
- UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models (2025)
- ChartAB: A Benchmark for Chart Grounding & Dense Alignment (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 4
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper