News

Graduation Announcement
June 25, 2019

We are pleased to announce that two students in our group have successfully defended their doctoral thesis recently. Qiyang Hu and Attila Szabó are now holding the title of Doctor of Philosophy. Congratulations!

  • Dr. Qiyang Hu defended her thesis "Learning Controllable Representations for Image Synthesis" on June 10. The external examiner was Prof. Dr. Alex Bronstein. A list of her contributions can be found on our publication page.
     
  • Dr. Attila Szabó started his studies in 2012 and worked primarily on representation learning. His defense was on June 24 with the title "Learning Interpretable Representations of Images" (see images attached below). The external examiner was Prof. Dr. Pascal Fua.

Both theses can be found here

Prof. Dr. Paolo Favaro and the members of CVG congratulate Qiyang and Attila for their incredible achievement and wish them all the best for the future. 

 

Learning Controllable Representations for Image Synthesis

In this thesis, our focus is learning a controllable representation and applying the learned controllable feature representation on images synthesis, video generation, and even 3D reconstruction. We propose different methods to disentangle the feature representation in neural network and analyze the challenges in disentanglement such as reference ambiguity and shortcut problem when using the weak label. We use the disentangled feature representation to transfer attributes between images such as exchanging hairstyle between two face images. Furthermore, we study the problem of how another type of feature, sketch, works in a neural network. The sketch can provide shape and contour of an object such as the silhouette of the side-view face. We leverage the silhouette constraint to improve the 3D face reconstruction from 2D images. The sketch can also provide the moving directions of one object, thus we investigate how one can manipulate the object to follow the trajectory provided by a user sketch. We propose a method to automatically generate video clips from a single image input using the sketch as motion and trajectory guidance to animate the object in that image. We demonstrate the efficiency of our approaches on several synthetic and real datasets.

 

Learning Interpretable Representations of Images

Computers represent images with pixels and each pixel contains three numbers for red, green and blue colour values. These numbers are meaningless for humans and they are mostly useless when used directly with classical machine learning techniques like linear classifiers. Interpretable representations are the attributes that humans understand: the colour of the hair, viewpoint of a car or the 3D shape of the object in the scene. Many computer vision tasks can be viewed as learning interpretable representations, for example a supervised classification algorithm directly learns to represent images with their class labels.
In this work we aim to learn interpretable representations (or features) indirectly with lower levels of supervision. This approach has the advantage of cost savings on dataset annotations and the flexibility of using the features for multiple follow-up tasks. We made contributions in three main areas: weakly supervised learning, unsupervised learning and 3D reconstruction.
In the weakly supervised case we use image pairs as supervision. Each pair shares a common attribute and differs in a varying attribute. We propose a training method that learns to separate the attributes into separate feature vectors. These features then are used for attribute transfer and classification. We also show theoretical results on the ambiguities of the learning task and the ways to avoid degenerate solutions.
We show a method for unsupervised representation learning, that separates semantically meaningful concepts. We explain and show ablation studies how the components of our proposed method work: a mixing autoencoder, a generative adversarial net and a classifier.
We propose a method for learning single image 3D reconstruction. It is done using only the images, no human annotation, stereo, synthetic renderings or ground truth depth map is needed. We train a generative model that learns the 3D shape distribution and an encoder to reconstruct the 3D shape. For that we exploit the notion of image realism. It means that the 3D reconstruction of the object has to look realistic when it is rendered from different random angles. We prove the efficacy of our method from first principles.

PhD Defense of Attila Szabó
June 18, 2019

Attila Szabó will defend his thesis with the title Learning Interpretable Representations of Images on June 24, 15:00 at Engehaldenstrasse 8 in room 002. 

Abstract

Computers represent images with pixels and each pixel contains three numbers for red, green and blue colour values. These numbers are meaningless for humans and they are mostly useless when used directly with classical machine learning techniques like linear classifiers. Interpretable representations are the attributes that humans understand: the colour of the hair, viewpoint of a car or the 3D shape of the object in the scene. Many computer vision tasks can be viewed as learning interpretable representations, for example a supervised classification algorithm directly learns to represent images with their class labels. In this work we aim to learn interpretable representations (or features) indirectly with lower levels of supervision. This approach has the advantage of cost savings on dataset annotations and the flexibility of using the features for multiple follow-up tasks. We made contributions in three main areas: weakly supervised learning, unsupervised learning and 3D reconstruction. In the weakly supervised case we use image pairs as supervision. Each pair shares a common attribute and differs in a varying attribute. We propose a training method that learns to separate the attributes into separate feature vectors. These features then are used for attribute transfer and classification. We also show theoretical results on the ambiguities of the learning task and the ways to avoid degenerate solutions. We show a method for unsupervised representation learning, that separates semantically meaningful concepts. We explain and show ablation studies how the components of our proposed method work: a mixing autoencoder, a generative adversarial net and a classifier. We propose a method for learning single image 3D reconstruction. It is done using only the images, no human annotation, stereo, synthetic renderings or ground truth depth map is needed. We train a generative model that learns the 3D shape distribution and an encoder to reconstruct the 3D shape. For that we exploit the notion of image realism. It means that the 3D reconstruction of the object has to look realistic when it is rendered from different random angles. We prove the efficacy of our method from first principles.

Our CVPR 2019 Papers
March 27, 2019

We recently got two papers accepted in the IEEE Conference on Computer Vision and Pattern Recognition which will take place in Long Beach, California, from June 16-20. With over 5000 submissions and 1300 accepted papers (acceptance rate is ~25%), CVPR is one of the largest, top-tier conferences for Computer Vision and Machine Learning. Below you find the abstract of our two submissions. Keep an eye on our publication page as it will be updated with new materials in the coming weeks.
 
Learning to Extract Flawless Slow Motion from Blurry Videos
by M. Jin, Z. Hu and P. Favaro
 
In this paper, we introduce the task of generating a sharp slow-motion video given a low frame rate blurry video. We propose a data-driven approach, where the training data is captured with a high frame rate camera and blurry images are simulated through an averaging process. While it is possible to train a neural network to recover the sharp frames from their average, there is no guarantee of the temporal smoothness for the formed video, as the frames are estimated independently. To solve this problem we introduce two networks: One, DeblurNet, to predict sharp keyframes and the second, InterpNet, to predict intermediate frames between the generated keyframes. To address the temporal smoothness requirement we obtain two sets of keyframes from two subsequent blurry input images and then apply InterpNet between all subsequent pairs of keyframes, including the case where one keyframe is generated from one blurry image and the other keyframe is generated from the other blurry image. Therefore, a smooth transition is ensured by interpolating between consecutive keyframes using InterpNet. Moreover, the proposed scheme enables further increase in frame rate without retraining the network, by applying InterpNet recursively between pairs of sharp frames. We demonstrate the performance of our approach in increasing the frame rate of real blurry videos up to 20 times. We evaluate also several datasets, including a novel dataset captured with a Sony RX V camera, which we will make publicly available.
 
 
On Stabilizing Generative Adversarial Training with Noise
 

We present a novel method and analysis to train generative adversarial networks (GAN) in a stable manner. As shown in recent analysis, training is often undermined by the limited support of the probability distribution of the data samples. We notice that the distributions of real and generated data should match even when they undergo the same filtering. Therefore, to address the limited support problem we propose to train GANs by using different filtered versions of the real and generated data distributions. In this way, filtering does not prevent the exact matching of the data distribution, while helping training by extending the support of both distributions. As filtering we consider adding samples from an arbitrary distribution to the data, which corresponds to a convolution of the data distribution with the arbitrary one. We also propose to learn the generation of these samples so as to challenge the discriminator in the adversarial training. We show that our approach results in a stable and well-behaved training of even the original minimax GAN formulation. Moreover, our technique can be incorporated in most modern GAN formulations and leads to a consistent improvement on several common datasets.
 

Graduation Announcements
Jan. 28, 2019

We proudly announce the graduation for two of our group members.

Dr. Meiguang Jin, who started his PhD in 2014, had his defense in december last year with the title "Motion Deblurring from a Single Image" and is currently working with us as a Post-doc until he returns to China. He recently received the Alumni Award from our Institute for his outstanding dissertation. 

Dr. Mehdi Noroozi held his defense presentation "Beyond Supervised Representation Learning" last week on the closing day of the CUSO Deep Learning Winter School. He will continue his research at Bosch in Germany. 

Congratulations to both on your achievement. We wish you a successful and bright life ahead. 

Deep Learning in Lenk
Jan. 27, 2019

In one exciting week, the 40 participants of the CUSO Deep Learning Winter School had a chance to catch up on the newest developments in the world of Deep Learning. The school covered a wide range of topics, including theoretial background, unsupervised learning, generative models, language modelling and more. With inspiring speakers such as Alyosha Efros, René Vidal, Paolo Favaro and others the school offered the students an opportunity to connect and discuss - and hopefully spark new and exciting ideas for future works. 

The school was organized by Paolo Favaro, François Fleuret and Marcus Liwicki under the Doctoral Program in Computer Science of the CUSO universities. Head over to the school's website to watch recordings of the talks and download presentation slides.