Modern vision models often split into two regimes: models that learn strong semantics for recognition, and models that preserve spatial detail for reconstruction.
In this talk, we present STELLAR, a self-supervised framework for learning sparse visual concepts as a unified representation for vision models. The key idea is to factorize visual features into semantic concept tokens (the "what"), and spatial assignment maps (the "where"), allowing the model to align concepts across views while preserving the geometry needed for reconstruction.
This sparse, low-rank representation creates a compact interface that supports recognition, dense prediction, and image reconstruction, while also suggesting future directions for efficient visual encoding, video self-supervision, generative modeling, and world-model-style visual reasoning.
We discuss the core method, empirical results, and why concept-centric visual representations may be a useful building block for the next generation of unified vision systems.