Seminars and Talks

Synthetic Realities: possibilities, frontiers and societal challenges
by Anderson Rocha
Date: Friday, Jul. 5
Time: 14:30
Location: N10_302, Institute of Computer Science

Our guest speaker is Prof. Anderson Rocha from the University of Campinas (Unicamp), Brazil.

You are all cordially invited to the CVG Seminar on July 5th at 2:30 pm CEST


We explore the burgeoning landscape of synthetic realities (AI-enabled synthetic contents allied with narratives and contexts), detailing their impact, technological advancements, and ethical quandaries. Synthetic realities provide innovative solutions and opportunities for immersive experiences across various sectors, including education, healthcare, and commerce. However, these advancements also usher in substantial challenges, such as the propagation of misinformation, privacy concerns, and ethical dilemmas. In this talk, we discuss the specifics of synthetic media, including deepfakes and their generation techniques, and the imperative need for robust detection methods to combat the potential misuse of such technologies, as well as concerted efforts on regulation, standardization and technological literacy. We show the dual-edged nature of synthetic realities and advocate for interdisciplinary research, informed public discourse, and collaborative efforts to harness their benefits while mitigating risks. This talk contributes to the discourse on the responsible development and application of artificial intelligence and synthetic media in modern society.


Anderson Rocha (IEEE Fellow) is Full-Professor of Artificial Intelligence and Digital Forensics at the Institute of Computing, University of Campinas (Unicamp), Brazil. He is the Head of the Artificial Intelligence Lab.,, at Unicamp. He is a three-term elected member of the IEEE Information Forensics and Security Technical Committee (IFS-TC) and a former chair of such committee. He is also chair-elect for the 2025-2026 term. He is a Microsoft Research and a Google Research Faculty Fellow as well as a Tan Chin Tuan (TCT) Fellow. Since 2023, he is also an Asia Pacific Artificial Intelligence Association Fellow. He is ranked among the Top 2% of research scientists worldwide, according to PlosOne/Stanford and studies. Finally, he is now a LinkedIn Top Voice in Artificial Intelligence for continuously raising awareness of Al and its potential impacts on society at large. 

Sparse-view 3D in the Wild
by Jason Y. Zhang
Date: Friday, Apr. 26
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Jason Y. Zhang from Carnegie Mellon University.

You are all cordially invited to the CVG Seminar on April 26th at 4 pm CEST

  • via Zoom (passcode is 003713).


Reconstructing 3D scenes and objects from images alone has been a long-standing goal in computer vision. However, typical methods require a large number of images with precisely calibrated camera poses, which is cumbersome for end users. We propose a probabilistic framework that can predict distributions over relative camera rotations. These distributions are then composed into coherent camera poses given sparse image sets. To improve precision, we then propose a diffusion-based model that represents camera poses as a distribution over rays instead of camera extrinsics. We demonstrate that our system is capable of recovering accurate camera poses from a variety of self-captures and is sufficient for high-quality 3D reconstruction.


Jason Y. Zhang is a final-year PhD student at Carnegie Mellon University, advised by Deva Ramanan and Shubham Tulsiani. Jason completed his undergraduate degree at UC Berkeley, where he worked with Jitendra Malik and Angjoo Kanazawa. He is interested in scaling single-view and multi-view 3D to unconstrained environments. Jason is supported in part by the NSF GRFP.

Understanding and Harnessing Foundation Models
by Narek Tumanyan
Date: Friday, Mar. 22
Time: 14:30
Location: Online Call via Zoom

Our guest speaker is Narek Tumanyan from the Weizmann Institute of Science.

You are all cordially invited to the CVG Seminar on March 22nd at 2:30 pm CET

  • via Zoom (passcode is 696673).


The field of computer vision has been undergoing a paradigm shift, moving from task-specific models to "foundation models" - large-scale networks trained on a massive amount of data that can be adopted to a variety of downstream tasks. However, current state-of-the-art foundation models are largely "black boxes". That is, despite being successfully leveraged for downstream tasks, the underlying mechanisms which are responsible for their performance are not well understood. In this talk, we will study the internal representations of two prominent foundation models: DINO-ViT - a self-supervised vision transformer, and StableDiffusion - a text-to-image generative latent diffusion model. This will enable us to

  1. Unveil novel visual descriptors;
  2. Devise efficient frameworks of semantic image manipulation based on the novel visual descriptors.

We demonstrate how gaining understanding of internal representations enables a more creative usage of foundation models and expands their capacities to a broader set of tasks.


I am a PhD student at the Weizmann Institute of Science, Faculty of Mathematics and Computer Science, advised by Tali Dekel. My research is focused on analyzing and understanding the internal representations of large-scale models and leveraging them as priors for downstream tasks in images and videos, such as image manipulation, editing, and point tracking. I have completed my Master’s degree at the Weizmann Institute in Tali Dekel's lab, where I also started my PhD in March of 2023.

Towards Perceptually-Enabled Task Assistants
by Ehsan Elhamifar
Date: Wednesday, Mar. 13
Time: 11:00
Location: N10_302, Institute of Computer Science

Our guest speaker is Prof. Ehsan Elhamifar from the Khoury College of Computer Sciences, Northeastern University.

You are all cordially invited to the CVG Seminar on March 13th at 11:00 am CET


Humans perform a wide range of complex activities, such as cooking hour-long recipes, assembling and repairing devices and performing surgeries. Many of these activities are procedural: they consist of sequences of steps that must be followed to achieve the desired goals. Learning complex procedures from videos of humans performing them allows us to design intelligent task assistants, robots and coaching platforms that perform or guide people through tasks. In this talk, we present new neural architectures as well as learning and inference frameworks to understand complex activity videos, addressing the following challenges:

  1. Procedural videos are long, uncurated and contain many task-irrelevant activities, with different videos showing different ways of performing the same task.
  2. Gathering framewise video annotation is costly and not scalable to many videos and tasks.
  3. At inference time, we must accurately recognize actions as data arrive in real-time, especially with only a few frames


Ehsan Elhamifar is an Associate Professor in the Khoury College of Computer Sciences, the director of the Mathematical Data Science (MCADS) Lab and the Director of MS in AI at Northeastern University. He has broad research interests in computer vision, machine learning and AI. The overarching goal of his research is to develop AI that learns from and makes inferences about data analogous to humans. He is a recipient of the DARPA Young Faculty Award. Prior to Northeastern, he was a postdoctoral scholar in the EECS department at UC Berkeley. He obtained his PhD in ECE at the Johns Hopkins University (JHU) and received two Masters degrees, one in EE from Sharif University of Technology in Iran and another in Applied Mathematics and Statistics from JHU.

Dense FixMatch: a Simple Semi-supervised Learning Method for Pixel-wise Prediction Tasks
by Atsuto Maki
Date: Friday, Feb. 2
Time: 14:30
Location: N10_302, Institute of Computer Science

Our guest speaker is Prof. Atsuto Maki from the KTH Royal Institute of Technology.

You are all cordially invited to the CVG Seminar on February 2nd at 2:30 p.m. CET


We discuss Dense FixMatch, a simple method for online semi-supervised learning of dense and structured prediction tasks combining pseudo-labeling and consistency regularization via strong data augmentation. It is an application of FixMatch enabled beyond image classification by adding a matching operation on the pseudo-labels. This allows us to still use the full strength of data augmentation pipelines, including geometric transformations. We evaluated it on semi-supervised semantic segmentation on Cityscapes and Pascal VOC with different percentages of labeled data, and ablated design choices and hyper-parameters. Dense FixMatch significantly improves results compared to supervised learning using only labeled data, approaching its performance with 1/4 of the labeled samples.

[1] Dense FixMatch: a simple semi-supervised learning method for pixel-wise prediction tasks [link]

[2] An analysis of over-sampling labeled data in semi-supervised learning with FixMatch [link]


Atsuto Maki is a Professor of Computer Science at KTH Royal Institute of Technology, Sweden. He obtained BEng and MEng in electrical engineering from Kyoto University and the University of Tokyo, respectively, and his PhD degree in computer science from KTH. Previously he was an associate professor at the Graduate School of Informatics, Kyoto University, and then a senior researcher at Toshiba’s Cambridge Research Lab in the UK. His research interests cover a broad range of topics in machine learning, deep learning, and computer vision, including motion and object recognition, clustering, subspace analysis, stereopsis, and representation learning. He has been serving as a program committee member at major computer vision conferences, e.g. as an area chair of ICCV and ECCV.