Seminars and Talks

3D-Awareness and Frequency Bias of Generative Models
by Katja Schwarz
Date: Friday, Sep. 2
Time: 14:30
Location: N10_302, Institute of Computer Science

Our guest speaker is Katja Schwarz from the University of Tuebingen.

You are all cordially invited to the CVG Seminar on September 2nd at 2:30 p.m. CEST


What can we learn from 2D images? While 2D generative adversarial networks have enabled high-resolution image synthesis, they largely lack an understanding of the 3D world and the image formation process. Recently, 3D-aware GANs have enabled explicit control over the camera pose and the generated content while training on 2D images, only. However, state-of-the-art 3D-aware generative models rely on coordinate-based MLPs which need to be queried for each sample along a camera ray, making volume rendering slow. Motivated by recent results in voxel-based novel view synthesis, I will introduce a sparse voxel grid representation for fast and 3D-consistent generative modeling in the first part of the talk.

In the second part, we will dive deeper into 2D GANs and investigate which spectral properties are learned from 2D images. Surprisingly, multiple recent works report an elevated amount of high frequencies in the spectral statistics which makes it straightforward to distinguish real and generated images. Explanations for this phenomenon are controversial: While most works attribute the artifacts to the generator, other works point to the discriminator. I will present our study on the frequency bias of generative models that takes a sober look at those explanations and provides insights on what makes proposed measures against high-frequency artifacts effective.


Katja is a 4th-year PhD student in the Autonomous Vision Group at Tuebingen University and is currently doing an internship with Sanja Fidler at NVIDIA. Katja received her BSc degree in 2016 and MSc degree in 2018 from Heidelberg University. In July 2019 she started her PhD at Tuebingen University under the supervision of Andreas Geiger. Her research lies at the intersection of computer vision and graphics and focuses on generative modeling in 2D and 3D.

Slot Attention: Recent progress towards object discovery in real-world video & 3D scenes
by Thomas Kipf
Date: Friday, Jul. 22
Time: 14:30
Location: Online Call via Zoom

Our guest speaker is Thomas Kipf from Google Brain and you are all cordially invited to the CVG Seminar on July 22nd at 2:30 p.m. CET on Zoom (passcode is 809285).


The world around us — and our understanding of it — is rich in compositional structure: from atoms and their interactions to objects and entities in our environments. How can we learn models of the world that take this structure into account and generalize to new compositions in systematic ways? This talk focuses on an emerging class of slot-based neural architectures that utilize attention mechanisms to perform perceptual grouping of scenes into objects and abstract entities without direct supervision.
I will briefly introduce the Slot Attention mechanism as a core representative for this class of models and cover our recent extension of Slot Attention to multi-object video (SAVi). I will further give an overview of our new work on 1) extending SAVi to real-world video on the Waymo Open dataset (SAVi++), and 2) using Slot Attention in a scene representation transformer architecture to radically speed up 3D-centric object discovery via novel view synthesis (Object Scene Representation Transformer, OSRT).


Thomas is a Senior Research Scientist at Google Brain. He obtained his Ph.D. at the University of Amsterdam working with Max Welling. For his Ph.D. thesis on Deep Learning with Graph-Structured Representations, he received the ELLIS Ph.D. Award 2021. He is broadly interested in developing and studying machine learning models that can reason about the rich structure of both the physical and digital world and its combinatorial complexity. This includes topics in graph representation learning, object-centric learning, and causal representation learning.

CV applications in medicine and biology
by Dmitry Dylov
Date: Friday, Jun. 3
Time: 14:30
Location: Online Call via Zoom

Our guest speaker is Dmitry Dylov from Skoltech and you are all cordially invited to the CVG Seminar on June 3rd at 2:30 p.m. on Zoom (passcode is 859119) or in-person (room 302 at the institute of informatics)


Dr. Dmitry V. Dylov is an Associate Professor at the Center for Computational and Data-Intensive Science and Engineering at Skoltech. He earned his Ph.D. degree in Electrical Engineering at Princeton University (Princeton, NJ, USA) in 2010, and M.Sc. degree in Applied Physics and Mathematics at Moscow Institute of Physics and Technology (Moscow, Russia) in 2006.

Prior to joining Skoltech, Dr. Dylov had been a Lead Scientist at GE Global Research Center (Niskayuna, NY, USA), where he had been leading various projects ranging from bioimaging and computational optics to medical image analytics. Dr. Dylov’s innovation record includes IP contributions to GE Healthcare, frequent technical consulting to emerging start-ups, and the foundation of two spin-off companies with clinical validation in major hospitals in the USA (MSKCCMGH, UCSF, Albany Med).

Dr. Dylov has established a new theoretical and computational paradigm for treating noise in imaging systems, resulting in impactful publications in reputable journals, such as Physical Review Letters and Nature Photonics. His career record includes more than 50 peer-reviewed publications, 16 international patents, and more than 80 invited and contributed talks. Dr. Dylov has earned the McGraw Teaching Excellence certificate at Princeton and has been an instructor in the Edison Engineering Development Program at GE. He has served as an avid professional service volunteer, a scientific reviewer, and an advocate for the educational outreach within SPIE, OSA, APS, and IEEE societies.

Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
by Greg Yang
Date: Friday, Apr. 8
Time: 14:30
Location: NS302

Our guest speaker is Greg Yang from Microsoft Research and you are all cordially invited to the CVG Seminar on April 8th at 2:30 p.m. on Zoom (passcode is 757974) or in-person (room 302 at the institute of informatics)


You can’t train GPT-3 on a single GPU, much less tune its hyperparameters (HPs)…or so it seems. I’m here to tell you this is not true: you can tune its HPs on a single GPU even if you can’t train it that way! In the first half of this talk, I’ll describe how, in the so-called maximal update parametrization (abbreviated µP), narrow and wide neural networks share the same set of optimal HPs. This lets us tune any large model by just tuning a small version of it — we call this µTransfer. In particular, this allowed us to tune the 6.7 billion parameter version of GPT-3 using only 7% of its pretraining compute budget, and, with some asterisks, we get a performance comparable to the original GPT-3 model with twice the parameter count. In the second half of this talk, I’ll discuss the theoretical reason µP has this special property and the connection to the study of infinite-width neural networks and, more generally, the theory of Tensor Programs. The first half will target general practitioners or empirical researchers in machine learning, while the second half targets those who are more theoretically curious. This talk is based on


Greg Yang is a researcher at Microsoft Research in Redmond, Washington. He joined MSR after he obtained Bachelor's in Mathematics and Master's degrees in Computer Science from Harvard University, respectively advised by ST Yau and Alexander Rush. He won the Hoopes prize at Harvard for best undergraduate thesis as well as Honorable Mention for the AMS-MAA-SIAM Morgan Prize, the highest honor in the world for an undergraduate in mathematics. He gave an invited talk at International Congress of Chinese Mathematicians 2019.

Are Large-scale Datasets Necessary for Self-Supervised Pre-training?
by Alaa El-Nouby
Date: Friday, Mar. 11
Time: 15:30
Location: Online Call via Zoom

Our guest speaker is Alaa El-Nouby from Meta AI Research and Inria Paris and you are all cordially invited to the CVG Seminar on March 11th at 3:30 p.m. on Zoom (passcode is 913674).


Pre-training models on large scale datasets, like ImageNet, is a standard practice in computer vision. This paradigm is especially effective for tasks with small training sets, for which high-capacity models tend to overfit. In this work, we consider a self-supervised pre-training scenario that only leverages the target task data. We consider datasets, like Stanford Cars, Sketch or COCO, which are order(s) of magnitude smaller than Imagenet. Our study shows that denoising autoencoders, such as BEiT or a variant that we introduce in this paper, are more robust to the type and size of the pre-training data than popular self-supervised methods trained by comparing image embeddings. We obtain competitive performance compared to ImageNet pre-training on a variety of classification datasets, from different domains. On COCO, when pre-training solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting.


Alaa El-Nouby is a PhD student at Meta AI Research and Inria Paris advised by Hervé Jégou and Ivan Laptev. His research interests are metric learning, self-supervised learning and transformers for computer vision. Prior to pursuing his PhD, Alaa received his Msc from the University of Guelph and the Vector institute, advised by Graham Taylor, where he conducted research in spatio-temporal representation learning and text-to-image synthesis with generative models.