Seminars and Talks

Learnings from Building Video Game World Models
by Abdelhak Lemkhenter
Date: Friday, Mar. 21
Time: 15:00
Location: N10_302, Institute of Computer Science

Our guest speaker is Abdelhak Lemkhenter from Microsoft, Cambridge.

You are all cordially invited to the CVG Seminar on March 21st, 2025 at 3:00 pm CEST

Abstract

Over the last few years, the research community has continuously pushed the boundary of
video generative modelling with many impressive demos of open and closed source models.
This led to an increasing interest in the steerability of such models and their ability to capture
the different dynamics present in the data. In this talk, we will discuss the recent advances in
world modelling applied to video games as an interesting setting for training such models. We
will discuss the recently published World and Human Action Model (WHAM) through the lens of
its design, its evaluation and key learning that came of scaling world models to a modern video
game title.

Bio

Abdelhak Lemkhenter is a Researcher at Microsoft Research Cambridge currently working on few-
shot imitation learning and world modeling in complex modern video games. His research
interests also include robust and scalable representation learning and data-centric
learning. He completed his PhD in Informatics at the University of Bern and obtained his Master
degree from the Ecole Central de Paris.

Synthetic Realities: possibilities, frontiers and societal challenges
by Anderson Rocha
Date: Friday, Jul. 5
Time: 14:30
Location: N10_302, Institute of Computer Science

Our guest speaker is Prof. Anderson Rocha from the University of Campinas (Unicamp), Brazil.

You are all cordially invited to the CVG Seminar on July 5th at 2:30 pm CEST

Abstract

We explore the burgeoning landscape of synthetic realities (AI-enabled synthetic contents allied with narratives and contexts), detailing their impact, technological advancements, and ethical quandaries. Synthetic realities provide innovative solutions and opportunities for immersive experiences across various sectors, including education, healthcare, and commerce. However, these advancements also usher in substantial challenges, such as the propagation of misinformation, privacy concerns, and ethical dilemmas. In this talk, we discuss the specifics of synthetic media, including deepfakes and their generation techniques, and the imperative need for robust detection methods to combat the potential misuse of such technologies, as well as concerted efforts on regulation, standardization and technological literacy. We show the dual-edged nature of synthetic realities and advocate for interdisciplinary research, informed public discourse, and collaborative efforts to harness their benefits while mitigating risks. This talk contributes to the discourse on the responsible development and application of artificial intelligence and synthetic media in modern society.

Bio

Anderson Rocha (IEEE Fellow) is Full-Professor of Artificial Intelligence and Digital Forensics at the Institute of Computing, University of Campinas (Unicamp), Brazil. He is the Head of the Artificial Intelligence Lab., Recod.ai, at Unicamp. He is a three-term elected member of the IEEE Information Forensics and Security Technical Committee (IFS-TC) and a former chair of such committee. He is also chair-elect for the 2025-2026 term. He is a Microsoft Research and a Google Research Faculty Fellow as well as a Tan Chin Tuan (TCT) Fellow. Since 2023, he is also an Asia Pacific Artificial Intelligence Association Fellow. He is ranked among the Top 2% of research scientists worldwide, according to PlosOne/Stanford and Research.com studies. Finally, he is now a LinkedIn Top Voice in Artificial Intelligence for continuously raising awareness of Al and its potential impacts on society at large. 

Sparse-view 3D in the Wild
by Jason Y. Zhang
Date: Friday, Apr. 26
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Jason Y. Zhang from Carnegie Mellon University.

You are all cordially invited to the CVG Seminar on April 26th at 4 pm CEST

  • via Zoom (passcode is 003713).

Abstract

Reconstructing 3D scenes and objects from images alone has been a long-standing goal in computer vision. However, typical methods require a large number of images with precisely calibrated camera poses, which is cumbersome for end users. We propose a probabilistic framework that can predict distributions over relative camera rotations. These distributions are then composed into coherent camera poses given sparse image sets. To improve precision, we then propose a diffusion-based model that represents camera poses as a distribution over rays instead of camera extrinsics. We demonstrate that our system is capable of recovering accurate camera poses from a variety of self-captures and is sufficient for high-quality 3D reconstruction.

Bio

Jason Y. Zhang is a final-year PhD student at Carnegie Mellon University, advised by Deva Ramanan and Shubham Tulsiani. Jason completed his undergraduate degree at UC Berkeley, where he worked with Jitendra Malik and Angjoo Kanazawa. He is interested in scaling single-view and multi-view 3D to unconstrained environments. Jason is supported in part by the NSF GRFP.

Understanding and Harnessing Foundation Models
by Narek Tumanyan
Date: Friday, Mar. 22
Time: 14:30
Location: Online Call via Zoom

Our guest speaker is Narek Tumanyan from the Weizmann Institute of Science.

You are all cordially invited to the CVG Seminar on March 22nd at 2:30 pm CET

  • via Zoom (passcode is 696673).

Abstract

The field of computer vision has been undergoing a paradigm shift, moving from task-specific models to "foundation models" - large-scale networks trained on a massive amount of data that can be adopted to a variety of downstream tasks. However, current state-of-the-art foundation models are largely "black boxes". That is, despite being successfully leveraged for downstream tasks, the underlying mechanisms which are responsible for their performance are not well understood. In this talk, we will study the internal representations of two prominent foundation models: DINO-ViT - a self-supervised vision transformer, and StableDiffusion - a text-to-image generative latent diffusion model. This will enable us to

  1. Unveil novel visual descriptors;
  2. Devise efficient frameworks of semantic image manipulation based on the novel visual descriptors.

We demonstrate how gaining understanding of internal representations enables a more creative usage of foundation models and expands their capacities to a broader set of tasks.

Bio

I am a PhD student at the Weizmann Institute of Science, Faculty of Mathematics and Computer Science, advised by Tali Dekel. My research is focused on analyzing and understanding the internal representations of large-scale models and leveraging them as priors for downstream tasks in images and videos, such as image manipulation, editing, and point tracking. I have completed my Master’s degree at the Weizmann Institute in Tali Dekel's lab, where I also started my PhD in March of 2023.

Towards Perceptually-Enabled Task Assistants
by Ehsan Elhamifar
Date: Wednesday, Mar. 13
Time: 11:00
Location: N10_302, Institute of Computer Science

Our guest speaker is Prof. Ehsan Elhamifar from the Khoury College of Computer Sciences, Northeastern University.

You are all cordially invited to the CVG Seminar on March 13th at 11:00 am CET

Abstract

Humans perform a wide range of complex activities, such as cooking hour-long recipes, assembling and repairing devices and performing surgeries. Many of these activities are procedural: they consist of sequences of steps that must be followed to achieve the desired goals. Learning complex procedures from videos of humans performing them allows us to design intelligent task assistants, robots and coaching platforms that perform or guide people through tasks. In this talk, we present new neural architectures as well as learning and inference frameworks to understand complex activity videos, addressing the following challenges:

  1. Procedural videos are long, uncurated and contain many task-irrelevant activities, with different videos showing different ways of performing the same task.
  2. Gathering framewise video annotation is costly and not scalable to many videos and tasks.
  3. At inference time, we must accurately recognize actions as data arrive in real-time, especially with only a few frames

Bio

Ehsan Elhamifar is an Associate Professor in the Khoury College of Computer Sciences, the director of the Mathematical Data Science (MCADS) Lab and the Director of MS in AI at Northeastern University. He has broad research interests in computer vision, machine learning and AI. The overarching goal of his research is to develop AI that learns from and makes inferences about data analogous to humans. He is a recipient of the DARPA Young Faculty Award. Prior to Northeastern, he was a postdoctoral scholar in the EECS department at UC Berkeley. He obtained his PhD in ECE at the Johns Hopkins University (JHU) and received two Masters degrees, one in EE from Sharif University of Technology in Iran and another in Applied Mathematics and Statistics from JHU.