Seminars and Talks

Dense FixMatch: a Simple Semi-supervised Learning Method for Pixel-wise Prediction Tasks
by Atsuto Maki
Date: Friday, Feb. 2
Time: 14:30
Location: N10_302, Institute of Computer Science

Our guest speaker is Prof. Atsuto Maki from the KTH Royal Institute of Technology.

You are all cordially invited to the CVG Seminar on February 2nd at 2:30 p.m. CET

Abstract

We discuss Dense FixMatch, a simple method for online semi-supervised learning of dense and structured prediction tasks combining pseudo-labeling and consistency regularization via strong data augmentation. It is an application of FixMatch enabled beyond image classification by adding a matching operation on the pseudo-labels. This allows us to still use the full strength of data augmentation pipelines, including geometric transformations. We evaluated it on semi-supervised semantic segmentation on Cityscapes and Pascal VOC with different percentages of labeled data, and ablated design choices and hyper-parameters. Dense FixMatch significantly improves results compared to supervised learning using only labeled data, approaching its performance with 1/4 of the labeled samples.

[1] Dense FixMatch: a simple semi-supervised learning method for pixel-wise prediction tasks [link]

[2] An analysis of over-sampling labeled data in semi-supervised learning with FixMatch [link]

Bio

Atsuto Maki is a Professor of Computer Science at KTH Royal Institute of Technology, Sweden. He obtained BEng and MEng in electrical engineering from Kyoto University and the University of Tokyo, respectively, and his PhD degree in computer science from KTH. Previously he was an associate professor at the Graduate School of Informatics, Kyoto University, and then a senior researcher at Toshiba’s Cambridge Research Lab in the UK. His research interests cover a broad range of topics in machine learning, deep learning, and computer vision, including motion and object recognition, clustering, subspace analysis, stereopsis, and representation learning. He has been serving as a program committee member at major computer vision conferences, e.g. as an area chair of ICCV and ECCV.

Supercharging Multimodal Video Representations
by Rohit Girdhar
Date: Friday, Jan. 19
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Rohit Girdhar from the GenAI Research group, Meta.

You are all cordially invited to the CVG Seminar on January 19th at 4 pm CET

  • via Zoom (passcode is 659431).

Abstract

Last few years have seen an explosion in the capabilities of representations learned by large models trained on lots of data. From LLMs like GPT4 for natural language processing, to multimodal models like CLIP or Flamingo for visual reasoning, or even text-to-image models like DALLE-3 for image generation; these models have revolutionized the way computers understand these different modalities. One modality, however, has somewhat been left behind—videos. While GPT4V and DALLE-3 have made huge strides in image understanding and generation, understanding or generating videos is still an open problem. What are the reasons for this, and will video representations ever catch up? I believe that instead of thinking of this as a competition between videos and the other modalities, the strong language, image, or generative representations should instead be viewed as an asset for bootstrapping strong video representations. In this talk, I will share some of my recent work in building better video representations, by leveraging these advanced representations, specifically for the tasks of video understanding, multimodal understanding, and video generation.

Bio

Rohit is a Research Scientist in the GenAI Research group at Meta. His current research focuses on understanding and generating multimodal data, using minimal human supervision. He obtained a MS and PhD in Robotics from Carnegie Mellon University, where he worked on learning from and understanding videos. He was previously part of the Facebook AI Research (FAIR) group at Meta, and has spent time at DeepMind, Adobe and Facebook as an intern. His research has won multiple international challenges, and has been recognized through a Best Paper (Finalist) Award at CVPR’22, Best Paper Award at ICCV’19 HVU Workshop, Siebel Scholarship at CMU, and a Gold Medal and Research Award for undergraduate research at IIIT Hyderabad.

Segmenting Objects without Manual Supervision
by Laurynas Karazija
Date: Friday, Jan. 12
Time: 14:30
Location: Online Call via Zoom

Our guest speaker is Laurynas Karazija from the Visual Geometry Group, University of Oxford.

You are all cordially invited to the CVG Seminar on January 12th at 2:30 pm CET

  • via Zoom (passcode is 043728).

Abstract

Detecting, localising and representing objects comprising the visual world is an important and interesting problem with many downstream applications. Today's systems are supervised, relying on extensive and expensive manual annotations. In this talk, I will introduce some recent works that explore learning from appearance, motion and language in an unsupervised or weakly-supervised manner. In particular, I will focus on the drawbacks of appearance-based object-centric models, explain how to teach segmentation networks using optical flow in an end-to-end manner and show how pretrained generative diffusion models can be used to synthesise segmenters directly by sampling and representing objects and their context.

Bio

Laurynas Karazija is a PhD student at the Visual Geometry Group at the University of Oxford, UK, working with Prof Andrea Vedaldi, Prof Christian Rupprecht and Dr Iro Laina. He focuses on learning to understand and decompose the visual world into distinct objects with as little supervision as possible.

Three Views on View Synthesis
by Kyle Sargent
Date: Friday, Dec. 15
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Kyle Sargent from Stanford Vision Lab.

You are all cordially invited to the CVG Seminar on December 15th at 4 pm CET

  • via Zoom (passcode is 520944).

Abstract

Novel view synthesis from a single image is an important problem in computer vision. Several sources of randomness and ill-posedness make the problem extremely challenging. I will present three papers from over the course of my research career, each taking a very different perspective and technical approach to this problem. As the talk progresses, I will explain how I have come to regard 3D generative modeling and 3D novel view synthesis as closely connected, and give supporting evidence. The final paper I will present is ZeroNVS: Zero-shot 360-degree View Synthesis from a Single Real Image, my most recent paper, which is currently in submission.

Bio

Kyle Sargent is a second year PhD student in the Stanford Vision Lab, advised by Jiajun Wu and Fei-Fei Li. He works on 3D generative models and novel view synthesis. He has written several papers for top vision conferences. This includes two first or co-first authored Best Paper Finalists, at CVPR2022 and ICCV2023. Prior to joining Stanford, he was an AI Resident at Google Research, and prior to that, he was an undergraduate at Harvard.

Using Deep Generative Models for Representation Learning and Beyond
by Daiqing Li
Date: Thursday, Dec. 7
Time: 16:00
Location: Online Call via Zoom

Our guest speaker is Daiqing Li from Playground.

You are all cordially invited to the CVG Seminar on December 7th at 4 pm CET

  • via Zoom (passcode is 102781).

Abstract

Diffusion-based deep generative models have demonstrated remarkable performance in text condition synthesis tasks in images, videos, and 3D. In this talk, I will talk about how to use large-scale T2I models as vision foundation models for representation learning and other downstream tasks, such as synthetic dataset generation and semantic segmentation.

Bio

Daiqing Li is currently serving as a research lead at Playground, where their primary focus lies in advancing the realm of pixel foundation models. Previously, Daiqing held the position of senior research scientist at the NVIDIA Toronto AI Lab. In this capacity, their research encompassed a broad spectrum, including computer vision, computer graphics, generative models, and machine learning. He collaborates closely with Sanja Fidler and Antonio Torralba in NVIDIA and several of his works have been integrated into NVIDIA products, notably Omniverse and Clara. Daiqing graduated from the University of Toronto and has been recognized as the runner-up for the MICCAI Young Scientist Awards. His recent research focuses on using generative models for dataset synthesis, perception tasks, and representation learning. He is the author of SemanticGAN, BigDatasetGAN, and DreamTeacher.