Papers
arxiv:2512.15715

In Pursuit of Pixel Supervision for Visual Pre-training

Published on Dec 17
· Submitted by
Lihe Yang
on Dec 18
Authors:
,
,
,
,
,
,
,

Abstract

Pixio, an enhanced masked autoencoder, demonstrates competitive performance across various downstream tasks using pixel-space self-supervised learning, outperforming latent-space approaches.

AI-generated summary

At the most basic level, pixels are the source of the visual information through which we perceive the world. Pixels contain information at all levels, ranging from low-level attributes to high-level concepts. Autoencoders represent a classical and long-standing paradigm for learning representations from pixels or other raw inputs. In this work, we demonstrate that autoencoder-based self-supervised learning remains competitive today and can produce strong representations for downstream tasks, while remaining simple, stable, and efficient. Our model, codenamed "Pixio", is an enhanced masked autoencoder (MAE) with more challenging pre-training tasks and more capable architectures. The model is trained on 2B web-crawled images with a self-curation strategy with minimal human curation. Pixio performs competitively across a wide range of downstream tasks in the wild, including monocular depth estimation (e.g., Depth Anything), feed-forward 3D reconstruction (i.e., MapAnything), semantic segmentation, and robot learning, outperforming or matching DINOv3 trained at similar scales. Our results suggest that pixel-space self-supervised learning can serve as a promising alternative and a complement to latent-space approaches.

Community

Paper submitter
This comment has been hidden (marked as Spam)

arXiv lens breakdown of this paper 👉 https://arxivlens.com/PaperView/Details/in-pursuit-of-pixel-supervision-for-visual-pre-training-8810-5e30657e

  • Executive Summary
  • Detailed Breakdown
  • Practical Applications

Sign up or log in to comment

Models citing this paper 5

Browse 5 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.15715 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.15715 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.