News

Latest publications in ICCV 2025
yesterday

A paper from our group got accepted to ICCV 2025 as an ORAL!


[ORAL] Diffusion Image Prior

Hamadi Chihaoui, Paolo Favaro, in the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025.

Zero-shot image restoration (IR) methods based on pretrained diffusion models have recently achieved significant success. These methods typically require at least a parametric form of the degradation model. However, in real-world scenarios, the degradation may be too complex to define explicitly without relying on crude approximations. To handle this general case, we introduce the DIffusion Image Prior (DIIP). We take inspiration from the Deep Image Prior (DIP), since it can be used to remove artifacts without the need for an explicit degradation model. However, in contrast to DIP, we find that pretrained diffusion models offer a much stronger prior, despite being trained without knowledge from corrupted data. We show that, the optimization process in DIIP first reconstructs a clean version of the image before eventually overfitting to the degraded input, but it does so for a broader range of degradations than DIP. In light of this result, we propose a blind image restoration (IR) method based on early stopping, which does not require prior knowledge of the degradation model. We validate DIIP on various degradation-blind IR tasks, including JPEG artifact removal, waterdrop removal, denoising and super-resolution with state-of-the-art results.

Paper: https://openaccess.thecvf.com/content/ICCV2025/html/Chihaoui_Diffusion_Image_Prior_ICCV_2025_paper.html

 


MIRAGE: Unsupervised Single Image to Novel View Generation with Cross Attention Guidance

Llukman Cerkezi, Aram Davtyan, Sepehr Sameni, Paolo Favaro, in the Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025.

This paper introduces a novel pipeline to generate novel views of an object from a single image. Our method, MIRAGE, trains a pose-conditioned diffusion model on a dataset of real images of multiple unknown categories, all completely unsupervised. The conditioning is obtained via clustering pre-trained self-supervised features to identify approximate object categories and poses. At inference time, we introduce hard-attention guidance and apply cross-view attention to align the appearance of the objects in the generated views with that in the input image. Through our experiments, we show that MIRAGE generates novel views that are on par or better than supervised methods in terms of image realism and 3D consistency. Furthermore, MIRAGE is robust to diverse textures and geometries, not restricted to simple rigid rotations, and is capable of generating plausible deformations of nonrigid objects, such as animals.

Paper: https://openaccess.thecvf.com/content/ICCV2025W/3D-VAST/html/Cerkezi_MIRAGE_Unsupervised_Single_Image_to_Novel_View_Generation_with_Cross_ICCVW_2025_paper.html

Code: https://github.com/llukmancerkezi/mirage

Latest publications in NeurIPS 2025
yesterday

A paper from our group got accepted to NeurIPS 2025!


KOALA++: Efficient Kalman-Based Optimization with Gradient-Covariance Products

Zixuan Xia, Aram Davtyan, Paolo Favaro, in Neural Information Processing Systems (NeurIPS), 2025.

We propose KOALA++, a scalable Kalman-based optimization algorithm that explicitly models structured gradient uncertainty in neural network training. Unlike second-order methods, which rely on expensive second order gradient calculation, our method directly estimates the parameter covariance matrix by recursively updating compact gradient covariance products. This design improves upon the original KOALA framework that assumed diagonal covariance by implicitly capturing richer uncertainty structure without storing the full covariance matrix and avoiding large matrix inversions. Across diverse tasks, including image classification and language modeling, KOALA++ achieves accuracy on par or better than state-of-the-art first- and second-order optimizers while maintaining the efficiency of first-order methods.

Paper: https://arxiv.org/abs/2506.04432

Code: https://github.com/Sumxiaa/KOALA_Plus_Plus

Latest publications in CVPR 2025
March 6, 2025

A paper from our group got accepted to CVPR 2025!

 


GEM: A Generalizable Ego-Vision Multimodal World Model for Fine-Grained Ego-Motion, Object Dynamics, and Scene Composition Control

Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Pedro M B Rezende, Yasaman Haghighi, David Brüggemann, Isinsu Katircioglu, Lin Zhang, Xiaoran Chen, Suman Saha, Marco Cannici, Elie Aljalbout, Botao Ye, Xi Wang, Aram Davtyan, Mathieu Salzmann, Davide Scaramuzza, Marc Pollefeys, Paolo Favaro, Alexandre Alahi, in the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2025.

We present GEM, a Generalizable Ego-vision Multimodal world model that predicts future frames using a reference frame, sparse features, human poses, and ego-trajectories. Hence, our model has precise control over object dynamics, ego-agent motion and human poses. GEM generates paired RGB and depth outputs for richer spatial understanding. We introduce autoregressive noise schedules to enable stable long-horizon generations. Our dataset is comprised of 4000+ hours of multimodal data across domains like autonomous driving, egocentric human activities, and drone flights. Pseudo-labels are used to get depth maps, egotrajectories, and human poses. We use a comprehensive evaluation framework, including a new Control of Object Manipulation (COM) metric, to assess controllability. Experiments show GEM excels at generating diverse, controllable scenarios and temporal consistency over long generations. Code, models, and datasets are fully open-sourced.

Paper: https://arxiv.org/abs/2412.11198

Latest publications in ICLR 2025
Feb. 2, 2025

A paper from our group got accepted to ICLR 2025!

 


Faster Inference of Flow-Based Generative Models via Improved Data-Noise Coupling

Aram Davtyan, Leello Tadesse Dadi, Volkan Cevher, Paolo Favaro, in International Conference on Learning Representations (ICLR), 2025

Conditional Flow Matching (CFM), a simulation-free method for training continuous normalizing flows, provides an efficient alternative to diffusion models for key tasks like image and video generation. The performance of CFM in solving these tasks depends on the way data is coupled with noise. A recent approach uses minibatch optimal transport (OT) to reassign noise-data pairs in each training step to streamline sampling trajectories and thus accelerate inference. However, its optimization is restricted to individual minibatches, limiting its effectiveness on large datasets. To address this shortcoming, we introduce LOOM-CFM (Looking Out Of Minibatch-CFM), a novel method to extend the scope of minibatch OT by preserving and optimizing these assignments across minibatches over training time. Our approach demonstrates consistent improvements in the sampling speed-quality trade-off across multiple datasets. LOOM-CFM also enhances distillation initialization and supports high-resolution synthesis in latent space training.

Paper: https://openreview.net/forum?id=rsGPrJDIhh

Latest publications in AAAI 2025
Dec. 11, 2024

A paper from our group got accepted to AAAI 2025!

 


CAGE: Unsupervised Visual Composition and Animation for Controllable Video Generation

Aram Davtyan, Sepehr Sameni, Björn Ommer, Paolo Favaro, in AAAI Conference on Artificial Intelligence, 2025

The field of video generation has expanded significantly in recent years, with controllable and compositional video generation garnering considerable interest. Traditionally, achieving this has relied on leveraging annotations such as text, objects' bounding boxes, and motion cues, which require substantial human effort and thus limit its scalability. Thus, we address the challenge of controllable and compositional video generation without any annotations by introducing a novel unsupervised approach. Once trained from scratch on a dataset of unannotated videos, our model can effectively compose scenes by assembling predefined object parts and animating them in a plausible and controlled manner. The core innovation of our method lies in its training process, where video generation is conditioned on a randomly selected subset of pre-trained self-supervised local features. This conditioning compels the model to learn how to inpaint the missing information in the video both spatially and temporally, thereby resulting in understanding the inherent compositionality and the dynamics of the scene. The abstraction level and the imposed invariance of the conditioning to minor visual perturbations enable control over object motion by simply moving the features to the desired future locations. We call our model CAGE, which stands for visual Composition and Animation for video GEneration. We conduct extensive experiments to validate the effectiveness of CAGE across various scenarios, demonstrating its capability to accurately follow the control and to generate high-quality videos that exhibit coherent scene composition and realistic animation.

Project website: https://araachie.github.io/cage.

Paper: https://arxiv.org/abs/2403.14368