News

Latest publications in AAAI 2024

Feb. 20, 2024

A paper from our group got accepted to AAAI 2024!

Learn the Force We Can: Enabling Sparse Motion Control in Multi-Object Video Generation

Aram Davtyan and Paolo Favaro, in AAAI Conference on Artificial Intelligence, 2024

We propose a novel unsupervised method to autoregressively generate videos from a single frame and a sparse motion input. Our trained model can generate unseen realistic object-to-object interactions. Although our model has never been given the explicit segmentation and motion of each object in the scene during training, it is able to implicitly separate their dynamics and extents. Key components in our method are the randomized conditioning scheme, the encoding of the input motion control, and the randomized and sparse sampling to enable generalization to out of distribution but realistic correlations. Our model, which we call YODA, has therefore the ability to move objects without physically touching them. Through extensive qualitative and quantitative evaluations on several datasets, we show that YODA is on par with or better than state of the art video generation prior work in terms of both controllability and video quality.

Project website: https://araachie.github.io/yoda.

Paper: https://arxiv.org/abs/2306.03988

Latest publications in ICCV 2023

Aug. 21, 2023

Two papers from our group got accepted to ICCV 2023!

Efficient Video Prediction via Sparsely Conditioned Flow Matching

Aram Davtyan*, Sepehr Sameni* and Paolo Favaro, in IEEE International Conference on Computer Vision (ICCV 2023)

We introduce a novel generative model for video prediction based on latent flow matching, an efficient alternative to diffusion-based models. In contrast to prior work, we keep the high costs of modeling the past during training and inference at bay by conditioning only on a small random set of past frames at each integration step of the image generation process. Moreover, to enable the generation of high-resolution videos and to speed up the training, we work in the latent space of a pretrained VQGAN. Finally, we propose to approximate the initial condition of the flow ODE with the previous noisy frame. This allows to reduce the number of integration steps and hence, speed up the sampling at inference time. We call our model Random frame conditioned flow Integration for VidEo pRedition, or, in short, RIVER. We show that RIVER achieves superior or on par performance compared to prior work on common video prediction benchmarks, while requiring an order of magnitude fewer computational resources.

Project website: https://araachie.github.io/river.

Paper: https://arxiv.org/abs/2211.14575

Spatio-Temporal Crop Aggregation for Video Representation Learning

Sepehr Sameni, Simon Jenni and Paolo Favaro, in IEEE International Conference on Computer Vision (ICCV 2023)

We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pretrained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature predictions. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. These techniques make our method not only extremely efficient to train but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, nonlinear, and k-NN probing on common action classification and video understanding datasets.

Paper: https://arxiv.org/abs/2211.17042

Latest publication in NeurIPS 2022

Sept. 14, 2022

Our paper on unsupervised object segmentation got accepted to NeurIPS 2022!

MOVE: Unsupervised Movable Object Segmentation and Detection

Adam Bielski and Paolo Favaro, in 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

We introduce MOVE, a novel method to segment objects without any form of supervision. MOVE exploits the fact that foreground objects can be shifted locally relative to their initial position and result in realistic (undistorted) new images. This property allows us to train a segmentation model on a dataset of images without annotation and to achieve state of the art (SotA) performance on several evaluation datasets for unsupervised salient object detection and segmentation. In unsupervised single object discovery, MOVE gives an average CorLoc improvement of 7.2% over the SotA, and in unsupervised class-agnostic object detection it gives a relative AP improvement of 53% on average. Our approach is built on top of self-supervised features (e.g. from DINO or MAE), an inpainting network (based on the Masked AutoEncoder) and adversarial training.

Paper: https://arxiv.org/abs/2210.07920

Latest Publication in ECCV

July 5, 2022

Our paper on controllable video generation through global and local motion analysis got accepted to ECCV2022!

Controllable Video Generation through Global and Local Motion Dynamics

Aram Davtyan and Paolo Favaro, in European Conference on Computer Vision, 2022.

We present GLASS, a method for Global and Local Action-driven Sequence Synthesis. GLASS is a generative model that is trained on video sequences in an unsupervised manner and that can animate an input image at test time. The method learns to segment frames into foreground-background layers and to generate transitions of the foregrounds over time through a global and local action representation. Global actions are explicitly related to 2D shifts, while local actions are instead related to (both geometric and photometric) local deformations. GLASS uses a recurrent neural network to transition between frames and is trained through a reconstruction loss. We also introduce W-Sprites (Walking Sprites), a novel synthetic dataset with a predefined action space. We evaluate our method on both W-Sprites and real datasets, and find that GLASS is able to generate realistic video sequences from a single input image and to successfully learn a more advanced action space than in prior work.

Paper: https://arxiv.org/pdf/2204.06558.pdf

Latest publication in EMBC

May 27, 2022

Our work tackling the generalization problem of automatic sleep scoring models got accepted to EMBC 2022. This is one of the main hurdles that limits the adoption of such models for clinical and research sleep studies.

Towards Sleep Scoring Generalization Through Self-Supervised Meta-Learning

Abdelhak Lemkhenter and Paolo Favaro, in EMBC, 2022.

In this work we introduce a novel meta-learning method for sleep scoring based on self-supervised learning. Our approach aims at building models for sleep scoring that can generalize across different patients and recording facilities, but do not require a further adaptation step to the target data. Towards this goal, we build our method on top of the Model Agnostic Meta-Learning (MAML) framework. In our analysis, we show that MAML can be significantly boosted in performance by incorporating a self-supervised learning (SSL) stage. This SSL stage is based on a general purpose pseudo-task that limits the overfitting to the subject-specific patterns present in the training dataset. We show that our proposed method outperforms the baseline methods and state of the art meta-learning methods on the Sleep Cassette, Sleep Telemetry, ISRUC, UCD and CAP datasets.

begin
1
2(current)
3
4
end