MMSI-Video-Bench: A Holistic Benchmark for Video-Based Spatial Intelligence Paper • 2512.10863 • Published 21 days ago • 21
MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations Paper • 2406.09401 • Published Jun 13, 2024
Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models Paper • 2505.17015 • Published May 22, 2025 • 9
MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence Paper • 2505.23764 • Published May 29, 2025 • 3
OST-Bench: Evaluating the Capabilities of MLLMs in Online Spatio-temporal Scene Understanding Paper • 2507.07984 • Published Jul 10, 2025 • 42
VFlowOpt: A Token Pruning Framework for LMMs with Visual Information Flow-Guided Optimization Paper • 2508.05211 • Published Aug 7, 2025 • 1
ChangingGrounding: 3D Visual Grounding in Changing Scenes Paper • 2510.14965 • Published Oct 16, 2025
G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning Paper • 2511.21688 • Published Nov 26, 2025 • 8
Seedream 4.0: Toward Next-generation Multimodal Image Generation Paper • 2509.20427 • Published Sep 24, 2025 • 82