arxiv:2512.07186

START: Spatial and Textual Learning for Chart Understanding

Published on Dec 8

· Submitted by

Zhuoming Liu on Dec 16

Amazon AGI

Upvote

Authors:

Zhuoming Liu ,

Abstract

START enhances multimodal large language models by integrating spatial and textual learning through chart-element grounding and chart-to-code generation, improving chart understanding and performance across benchmarks.

AI-generated summary

Chart understanding is crucial for deploying multimodal large language models (MLLMs) in real-world scenarios such as analyzing scientific papers and technical reports. Unlike natural images, charts pair a structured visual layout (spatial property) with an underlying data representation (textual property) -- grasping both is essential for precise, fine-grained chart reasoning. Motivated by this observation, we propose START, the Spatial and Textual learning for chART understanding. Specifically, we introduce (i) chart-element grounding and (ii) chart-to-code generation to strengthen an MLLM's understanding of both chart visual layout and data details. To facilitate spatial and textual learning, we propose the START-Dataset generated with a novel data-generation pipeline that first leverages an MLLM to translate real chart images into executable chart code, recovering the underlying data representation while preserving the visual distribution of real-world charts. We then evolve the code with a Large Language Model (LLM) to ascertain the positions of chart elements that capture the chart's visual structure, addressing challenges that existing methods cannot handle. To evaluate a model's ability to understand chart spatial structures, we propose the Chart Spatial understanding Benchmark (CS-Bench), filling a critical gap in comprehensive chart understanding evaluation. Leveraging spatial and textual learning, START delivers consistent gains across model sizes and benchmarks over the base models and surpasses prior state-of-the-art by a clear margin. Code, data and models will be publicly available.

View arXiv page View PDF GitHub 2 Add to collection

Community

zhuomingliu

Paper author Paper submitter 2 days ago

Does visual grounding help visual reasoning in Chart Understanding? 📊🧠
I am excited to share our latest paper, "START: Spatial and Textual Learning for Chart Understanding," which explores how we can teach Multimodal LLMs (MLLMs) to better understand complex, real-world charts.

The Challenge: In real-world scenarios (like scientific papers), charts often have complex layouts with multiple subplots. Current models often fail because they jump to reasoning without first "grounding" (locating) the correct visual elements or understanding the underlying data.

Our Solution - START: We propose a spatial and textual learning framework that trains MLLMs using two auxiliary tasks alongside Chart QA:

Chart Element Grounding (Spatial): Explicitly teaching the model to locate specific components (legends, subplots), which boosts spatial reasoning.
Chart-to-Code Generation (Textual): Recovering the Python code used to render the chart to understand data details.

Key Contributions:
START-Dataset: We developed a novel pipeline that converts real chart images (from ArXiv) into executable Python code and precise element locations, preserving real-world visual complexity.
CS-Bench: A new benchmark specifically designed to evaluate chart spatial understanding.
SOTA Results: Our model, START-RL-7B, outperforms previous state-of-the-art models (like Chart-R1) by a clear margin on benchmarks like CharXiv, ChartMimic, and ChartQAPro.

This work has been accepted to WACV2026.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 1

Datasets citing this paper 4

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.07186 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.