SmolVLM: Redefining small and efficient multimodal models Paper • 2504.05299 • Published Apr 7, 2025 • 205
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published Nov 26, 2024 • 89
Improving Vision-Language-Action Model with Online Reinforcement Learning Paper • 2501.16664 • Published Jan 28, 2025 • 1
JARVIS-VLA: Post-Training Large-Scale Vision Language Models to Play Visual Games with Keyboards and Mouse Paper • 2503.16365 • Published Mar 20, 2025 • 41
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey Paper • 2503.12605 • Published Mar 16, 2025 • 35
DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation Paper • 2501.16764 • Published Jan 28, 2025 • 22
VideoRAG: Retrieval-Augmented Generation over Video Corpus Paper • 2501.05874 • Published Jan 10, 2025 • 75
LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs Paper • 2501.06186 • Published Jan 10, 2025 • 65
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking Paper • 2501.04519 • Published Jan 8, 2025 • 288
SPAR3D: Stable Point-Aware Reconstruction of 3D Objects from Single Images Paper • 2501.04689 • Published Jan 8, 2025 • 17
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization Paper • 2501.03271 • Published Jan 5, 2025 • 10
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM Paper • 2501.00599 • Published Dec 31, 2024 • 46
Molar: Multimodal LLMs with Collaborative Filtering Alignment for Enhanced Sequential Recommendation Paper • 2412.18176 • Published Dec 24, 2024 • 16
Health AI Developer Foundations (HAI-DEF) Collection Groups models released for use in health AI by Google. Read more about HAI-DEF at http://goo.gle/hai-def • 22 items • Updated 21 days ago • 179
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration Paper • 2412.04440 • Published Dec 5, 2024 • 22
ARCLE: The Abstraction and Reasoning Corpus Learning Environment for Reinforcement Learning Paper • 2407.20806 • Published Jul 30, 2024 • 1
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark Paper • 2410.19168 • Published Oct 24, 2024 • 24
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities Paper • 2401.12168 • Published Jan 22, 2024 • 29