CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Aniket Rajiv Didolkar1,2*     Andrii Zadaianchuk3*†     Rabiul Awal1,2*     Maximilian Seitzer4     Efstratios Gavves3,5     Aishwarya Agrawal1,2†
1Mila - Quebec AI Institute   2Université de Montréal   3University of Amsterdam   4 University of Tübingen 5 Archimedes/Athena RC, Greece
*Equal contribution. Order determined by flipping a coin.   Equal advising
Published at CVPR 2025

Abstract

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects and parts, without allowing user input to guide which objects are represented.

Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. We find that the proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

CTRL-O model

(a) CTRL-O architecture. An input image is processed by a frozen DINOv2 ViT model, yielding patch features. These features are then transformed by a learnable transformer encoder to align the feature space with the control queries. The control queries are introduced in the Slot Attention (SA) module, which guides the grouping of the encoded features into slots. The initial slots in the SA module are conditioned with the control queries. Finally, an MLP decoder, conditioned on control queries, reconstructs the DINOv2 features.

(b) Control Contrastive Loss To ensure that slots use query information to represent specific objects, we apply a contrastive loss between control queries and the Slot Attention-modulated weighted DINO features (referred to as weighted DINO slots).

Visual Grounding with Language Queries

CTRL-O visual grounding example
CTRL-O is able to bind language queries to specific objects in complex scenes and infers object-centric representations of corresponding objects.

Downstream Tasks

CTRL-O enables powerful downstream applications by providing controllable object-centric representations.

CTRL-O downstream applications
We demonstrate the effectiveness of CTRL-O on two key downstream applications: instance-controllable text-to-image generation and visual question answering. These applications showcase how language-controlled object-centric representations can enable precise control and reasoning.

Key Findings

  • Language Control: CTRL-O allows users to control which objects are represented in the slot representations using language queries, including simple object category names and complex referring expressions.
  • No Mask Supervision: Our approach works without requiring mask supervision, making it applicable to a wide range of real-world scenarios.
  • Downstream Applications: We demonstrate the effectiveness of CTRL-O on two downstream applications:
    • Instance-Controllable Image Generation: CTRL-O enables generating images where specific objects can be controlled while maintaining the context.
    • Visual Question Answering: Controllable object-centric representations achieve strong performance on visual question answering tasks.

BibTeX


@inproceedings{didolkar2025ctrlo,
    title={CTRL-O: Language-Controllable Object-Centric Visual Representation Learning},
    author={Didolkar, Aniket Rajiv and Zadaianchuk, Andrii and Awal, Rabiul and Seitzer, Maximilian and Gavves, Efstratios and Agrawal, Aishwarya},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}