CTRL-O: Language-Controllable Object-Centric Visual Representation Learning

Aniket Rajiv Didolkar1,2*     Andrii Zadaianchuk3*†     Rabiul Awal1,2*     Maximilian Seitzer4     Efstratios Gavves3,5     Aishwarya Agrawal1,2†
1Mila - Quebec AI Institute   2Université de Montréal   3University of Amsterdam   4 University of Tübingen 5 Archimedes/Athena RC, Greece
*Equal contribution. Order determined by flipping a coin.   Equal advising
Published at CVPR 2025

Overview CTRL-O at a Glance

Language Control

Control which objects are represented using natural language queries and referring expressions

Weak Supervision

Works without requiring mask supervision, applicable to a wide range of real-world scenarios

Multimodal Applications

Enables both instance-specific text-to-image generation and visual question answering capabilities

Comparison of Standard OCL vs CTRL-O
Left: Standard object-centric learning assigns arbitrary slots with no semantic labels. Right: CTRL-O introduces language-based control, enabling specific object targeting and multimodal applications.

Summary Abstract

Object-centric representation learning aims to decompose visual scenes into fixed-size vectors called "slots" or "object files", where each slot captures a distinct object. Current state-of-the-art object-centric models have shown remarkable success in object discovery in diverse domains, including complex real-world scenes. However, these models suffer from a key limitation: they lack controllability. Specifically, current object-centric models learn representations based on their preconceived understanding of objects and parts, without allowing user input to guide which objects are represented.

Introducing controllability into object-centric models could unlock a range of useful capabilities, such as the ability to extract instance-specific representations from a scene. In this work, we propose a novel approach for user-directed control over slot representations by conditioning slots on language descriptions. The proposed ConTRoLlable Object-centric representation learning approach, which we term CTRL-O, achieves targeted object-language binding in complex real-world scenes without requiring mask supervision. Next, we apply these controllable slot representations on two downstream vision language tasks: text-to-image generation and visual question answering. We find that the proposed approach enables instance-specific text-to-image generation and also achieves strong performance on visual question answering.

Architecture Model Design

CTRL-O model
CTRL-O ARCHITECTURE
Language Control 01

Control queries guide the slot attention process

02 Slot Attention Module

Features group into object-centric slots

03 Contrastive Loss

Binds queries to visual objects

"

CTRL-O transforms visual features into language-controllable object-centric representations, enabling precise binding between language descriptions and specific objects in complex scenes.

Evaluation Model Performance Results

Our evaluation shows that CTRL-O significantly outperforms existing approaches and benefits from key architectural components.

Component Ablation Analysis

Slot Init. GT Masks CL DC Binding Hits FG-ARI mBO
71.2 69.8 35.4
8.1 34.52 22.42
10.11 43.83 25.76
56.3 44.8 27.3
61.3 47.5 27.2

CTRL-O Model Component Ablation for Grounding. CL = Contrastive Loss, DC = Decoder Conditioning.

Object Discovery Performance

Approach Model FG-ARI mBO
Unsupervised DINOSAUR (MLP Dec.) 40.5 27.7
DINOSAUR (TF. Dec.) 34.1 31.6
Stable-LSD 35.0 30.4
SlotDiffusion 37.3 31.4
Weak Supervision Stable-LSD (Bbox Supervision) - 30.3
CTRL-O (Trained on COCO) 47.5 27.2

Comparison of CTRL-O with unsupervised and weakly-supervised object-centric approaches on the COCO dataset.

Results Visual Grounding with Language Queries

Our approach enables precise binding between language descriptions and visual objects in complex scenes without requiring mask supervision.

Visual Grounding Examples

CTRL-O visual grounding example
CTRL-O is able to bind language queries to specific objects in complex scenes and infers object-centric representations of corresponding objects.

Referring Expression Comprehension Performance

Methods Sup. Pretraining dataset Fine tuning RefCOCO RefCOCO+ Gref
val testA testB val testA testB val
GroupViT T CC12M+YFCC 7.99 6.16 10.51 8.49 6.79 10.59 10.68
GroupViT T CC12M+YFCC 10.82 11.11 11.29 11.14 10.78 11.84 12.77
MaskCLIP T WIT 11.52 11.85 12.06 11.87 12.01 12.57 12.74
MaskCLIP T WIT 19.45 18.69 21.37 19.97 18.93 21.48 21.11
Shatter & Gather T VG 21.80 19.00 24.96 22.20 19.86 24.85 25.89
CTRL-O T VG 21.80 20.10 21.57 21.90 21.54 21.36 25.32
CTRL-O T+P VG 28.2 33.13 27.05 25.87 30.58 22.58 30.50

Performance comparison on referring expression comprehension. Sup: T = Text supervision, T+P = Text + Phrase supervision.

CTRL-O with combined text and phrase supervision significantly outperforms existing approaches, achieving the best results on most benchmarks.

Applications Downstream Applications

CTRL-O enables powerful downstream applications by providing controllable object-centric representations.

Overview of CTRL-O Downstream Tasks

a T2I Generation

Instance-specific text-to-image generation with context preservation

CTRL-O multimodal applications
DOWNSTREAM APPLICATIONS
b Visual QA

Language-object binding for precise query answering

Instance-Controllable T2I

CTRL-O instance-controllable text-to-image generation
Instance-specific control over text-to-image generation while maintaining scene context. CTRL-O enables targeted modification of specific objects while preserving the surrounding scene details.

Visual Question Answering

CTRL-O achieves the highest accuracy on VQAv2 in both No-Coupling and Coupling settings, outperforming baseline models.

Citation BibTeX

@inproceedings{didolkar2025ctrlo,
    title={CTRL-O: Language-Controllable Object-Centric Visual Representation Learning},
    author={Didolkar, Aniket Rajiv and Zadaianchuk, Andrii and Awal, Rabiul and Seitzer, Maximilian and Gavves, Efstratios and Agrawal, Aishwarya},
    booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year={2025}
}