A list of completed theses and new thesis topics from from the Computer Vision Group.
Are you about to start a BSc or MSc thesis? Please read our instructions for preparing and delivering your work.
Below we list possible thesis topics for Bachelor and Master students in the areas of Computer Vision, Machine Learning, Deep Learning and Pattern Recognition. The project descriptions leave plenty of room for your own ideas. If you would like to discuss a topic in detail, please contact the supervisor listed below and Prof. Paolo Favaro to schedule a meeting.
Today we have many 3D scanning techniques that allow us to capture the shape and appearance of objects. It is easier than ever to scan real 3D objects and transform them into a digital model for further processing, such as modeling, rendering or animation. However, the output of a 3D scanner is often a raw point cloud with little to no annotations. The unstructured nature of the point cloud representation makes it difficult for processing, e.g. surface reconstruction. One application is the detection and segmentation of an object of interest. In this project, the student is challenged to design a system that takes a point cloud (a 3D scan) as input and outputs the names of objects contained in the scan. This output can then be used to eliminate outliers or points that belong to the background. The approach involves collecting a large dataset of 3D scans and training a neural network on it.
Contact: Adrian Wälchli
A photograph accurately captures the world in a moment of time and from a specific perspective. Since it is a projection of the 3D space to a 2D image plane, the depth information is lost. Is it possible to restore it, given only a single photograph? In general, the answer is no. This problem is ill-posed, meaning that many different plausible depth maps exist, and there is no way of telling which one is the correct one. However, if we cover one of our eyes, we are still able to recognize objects and estimate how far away they are. This motivates the exploration of an approach where prior knowledge can be leveraged to reduce the ill-posedness of the problem. Such a prior could be learned by a deep neural network, trained with many images and depth maps.
Deblurring finds many applications in our everyday life. It is particularly useful when taking pictures on handheld devices (e.g. smartphones) where camera shake can degrade important details. Therefore, it is desired to have a good deblurring algorithm implemented directly in the device. In this project, the student will implement and optimize a state-of-the-art deblurring method based on a deep neural network for deployment on mobile phones (Android). The goal is to reduce the number of network weights in order to reduce the memory footprint while preserving the quality of the deblurred images. The result will be a camera app that automatically deblurs the pictures, giving the user a choice of keeping the original or the deblurred image.
If an object in front of the camera or the camera itself moves while the aperture is open, the region of motion becomes blurred because the incoming light is accumulated in different positions across the sensor. If there is camera motion, there is also parallax. Thus, a motion blurred image contains depth information. In this project, the student will tackle the problem of recovering a depth-map from a motion-blurred image. This includes the collection of a large dataset of blurred- and sharp images or videos using a pair or triplet of GoPro action cameras. Two cameras will be used in stereo to estimate the depth map, and the third captures the blurred frames. This data is then used to train a convolutional neural network that will predict the depth map from the blurry image.
The idea of this project is that we have two types of neural networks that work together: There is one network A that assigns images to k clusters and k (simple) networks of type B perform a self-supervised task on those clusters. The goal of all the networks is to make the k networks of type B perform well on the task. The assumption is that clustering in semantically similar groups will help the networks of type B to perform well. This could be done on the MNIST dataset with B being linear classifiers and the task being rotation prediction.
The student designs a data augmentation network that transforms training images in such a way that image realism is preserved (e.g. with a constrained spatial transformer network) and the transformed images are more difficult to classify (trained via adversarial loss against an image classifier). The model will be evaluated for different data settings (especially in the low data regime), for example on the MNIST and CIFAR datasets.
People with sensory impairment (hearing, speech, vision) depend heavily on assistive technologies to communicate and navigate in everyday life. The mass production of media content today makes it impossible to manually translate everything into a common language for assistive technologies, e.g. captions or sign language. In this project, the student employs a neural network to learn a representation for lip-movement in videos in an unsupervised fashion, possibly with an encoder-decoder structure where the decoder reconstructs the audio signal. This requires collecting a large dataset of videos (e.g. from YouTube) of speakers or conversations where lip movement is visible. The outcome will be a neural network that learns an audio-visual representation of lip movement in videos, which can then be leveraged to generate captions for hearing impaired persons.
Satellite images have many applications, e.g. in meteorology, geography, education, cartography and warfare. They are an accurate and detailed depiction of the surface of the earth from above. Although it is relatively simple to collect many satellite images in an automated way, challenges arise when processing them for use in navigation and cartography. The idea of this project is to automatically convert an arbitrary satellite image, of e.g. a city, to a map of simple 2D shapes (streets, houses, forests) and label them with colors (semantic segmentation). The student will collect a dataset of satellite image and topological maps and train a deep neural network that learns to map from one domain to the other. The data could be obtained from a Google Maps database or similar.
The idea of inferring the emotional state of a subject by looking at their face is nothing new. Neither is the idea of automating this process using computers. Researchers used to computationally extract handcrafted features from face images that had proven themselves to be effective and then used machine learning techniques to classify the facial expressions using these features. Recently, there has been a trend towards using deeplearning and especially Convolutional Neural Networks (CNNs) for the classification of these facial expressions. Researchers were able to achieve good results on images that were taken in laboratories under the same or at least similar conditions. However, these models do not perform very well on more arbitrary face images with different head poses and illumination. This thesis aims to show the challenges of Facial Expression Recognition (FER) in this wild setting. It presents the currently used datasets and the present state-of-the-art results on one of the biggest facial expression datasets currently available. The contributions of this thesis are twofold. Firstly, I analyze three famous neural network architectures and their effectiveness on the classification of facial expressions. Secondly, I present two modifications of one of these networks that lead to the proposed STN-COV model. While this model does not outperform all of the current state-of-the-art models, it does beat several ones of them.
This work covers a new approach to 3D reconstruction. In traditional 3D reconstruction one uses multiple images of the same object to calculate a 3D model by taking information gained from the differences between the images, like camera position, illumination of the images, rotation of the object and so on, to compute a point cloud representing the object. The characteristic trait shared by all these approaches is that one can almost change everything about the image, but it is not possible to change the object itself, because one needs to find correspondences between the images. To be able to use different instances of the same object, we used a 3D DPM model that can find different parts of an object in an image, thereby detecting the correspondences between the different pictures, which we then can use to calculate the 3D model. To take this theory to practise, we gave a 3D DPM model, which was trained to detect cars, pictures of different car brands, where no pair of images showed the same vehicle and used the detected correspondences and the Factorization Method to compute the 3D point cloud. This technique leads to a completely new approach in 3D reconstruction, because changing the object itself was never done before.
This thesis explores the field of artificial neural networks with realistic looking visual outputs. It aims at morphing face pictures of a specific identity to look like another individual by only modifying key features, such as eye color, while leaving identity-independent features unchanged. Prior works have covered the topic of symmetric translation between two specific domains but failed to optimize it on faces where only parts of the image may be changed. This work applies a face masking operation to the output at training time, which forces the image generator to preserve colors while altering the face, fitting it naturally inside the unmorphed surroundings. Various experiments are conducted including an ablation study on the final setting, decreasing the baseline identity switching performance from 81.7% to 75.8 % whilst improving the average χ2 color distance from 0.551 to 0.434. The provided code-based software gives users easy access to apply this neural face swap to images and videos of arbitrary crop and brings Computer Vision one step closer to replacing Computer Graphics in this specific area.
In the digital age of ever increasing data amassment and accessibility, the demand for scalable machine learning models effective at refining the new oil is unprecedented. Unsupervised representation learning methods present a promising approach to exploit this invaluable yet unlabeled digital resource at scale. However, a majority of these approaches focuses on synthetic or simplified datasets of images. What if a method could learn directly from natural Internet-scale image data? In this thesis, we propose a novel approach for unsupervised learning of object representations by mixing natural image scenes. Without any human help, our method mixes visually similar images to synthesize new realistic scenes using adversarial training. In this process the model learns to represent and understand the objects prevalent in natural image data and makes them available for downstream applications. For example, it enables the transfer of objects from one scene to another. Through qualitative experiments on complex image data we show the effectiveness of our method along with its limitations. Moreover, we benchmark our approach quantitatively against state-of-the-art works on the STL-10 dataset. Our proposed method demonstrates the potential that lies in learning representations directly from natural image data and reinforces it as a promising avenue for future research.
In computer vision, Visual Odometry is the problem of recovering the camera motion from a video. It is related to Structure from Motion, the problem of reconstructing the 3D geometry from a collection of images. Decades of research in these areas have brought successful algorithms that are used in applications like autonomous navigation, motion capture, augmented reality and others. Despite the success of these prior works in real-world environments, their robustness is highly dependent on manual calibration and the magnitude of noise present in the images in form of, e.g., non-Lambertian surfaces, dynamic motion and other forms of ambiguity. This thesis explores an alternative approach to the Visual Odometry problem via Deep Learning, that is, a specific form of machine learning with artificial neural networks. It describes and focuses on the implementation of a recent work that proposes the use of Recurrent Neural Networks to learn dependencies over time due to the sequential nature of the input. Together with a convolutional neural network that extracts motion features from the input stream, the recurrent part accumulates knowledge from the past to make camera pose estimations at each point in time. An analysis on the performance of this system is carried out on real and synthetic data. The evaluation covers several ways of training the network as well as the impact and limitations of the recurrent connection for Visual Odometry.
Computers represent images with pixels and each pixel contains three numbers for red, green and blue colour values. These numbers are meaningless for humans and they are mostly useless when used directly with classical machine learning techniques like linear classifiers. Interpretable representations are the attributes that humans understand: the colour of the hair, viewpoint of a car or the 3D shape of the object in the scene. Many computer vision tasks can be viewed as learning interpretable representations, for example a supervised classification algorithm directly learns to represent images with their class labels. In this work we aim to learn interpretable representations (or features) indirectly with lower levels of supervision. This approach has the advantage of cost savings on dataset annotations and the flexibility of using the features for multiple follow-up tasks. We made contributions in three main areas: weakly supervised learning, unsupervised learning and 3D reconstruction. In the weakly supervised case we use image pairs as supervision. Each pair shares a common attribute and differs in a varying attribute. We propose a training method that learns to separate the attributes into separate feature vectors. These features then are used for attribute transfer and classification. We also show theoretical results on the ambiguities of the learning task and the ways to avoid degenerate solutions. We show a method for unsupervised representation learning, that separates semantically meaningful concepts. We explain and show ablation studies how the components of our proposed method work: a mixing autoencoder, a generative adversarial net and a classifier. We propose a method for learning single image 3D reconstruction. It is done using only the images, no human annotation, stereo, synthetic renderings or ground truth depth map is needed. We train a generative model that learns the 3D shape distribution and an encoder to reconstruct the 3D shape. For that we exploit the notion of image realism. It means that the 3D reconstruction of the object has to look realistic when it is rendered from different random angles. We prove the efficacy of our method from first principles.
In this thesis, our focus is learning a controllable representation and applying the learned controllable feature representation on images synthesis, video generation, and even 3D reconstruction. We propose different methods to disentangle the feature representation in neural network and analyze the challenges in disentanglement such as reference ambiguity and shortcut problem when using the weak label. We use the disentangled feature representation to transfer attributes between images such as exchanging hairstyle between two face images. Furthermore, we study the problem of how another type of feature, sketch, works in a neural network. The sketch can provide shape and contour of an object such as the silhouette of the side-view face. We leverage the silhouette constraint to improve the 3D face reconstruction from 2D images. The sketch can also provide the moving directions of one object, thus we investigate how one can manipulate the object to follow the trajectory provided by a user sketch. We propose a method to automatically generate video clips from a single image input using the sketch as motion and trajectory guidance to animate the object in that image. We demonstrate the efficiency of our approaches on several synthetic and real datasets.
The complexity of any information processing task is highly dependent on the space where data is represented. Unfortunately, pixel space is not appropriate for the computer vision tasks such as object classification. The traditional computer vision approaches involve a multi-stage pipeline where at first images are transformed to a feature space through a handcrafted function and then consequenced by the solution in the feature space. The challenge with this approach is the complexity of designing handcrafted functions that extract robust features. The deep learning based approaches address this issue by end-to-end training of a neural network for some tasks that lets the network to discover the appropriate representation for the training tasks automatically. It turns out that image classification task on large scale annotated datasets yields a representation transferable to other computer vision tasks. However, supervised representation learning is limited to annotations. In this thesis we study self-supervised representation learning where the goal is to alleviate these limitations by substituting the classification task with pseudo tasks where the labels come for free. We discuss self-supervised learning by solving jigsaw puzzles that uses context as supervisory signal. The rational behind this task is that the network requires to extract features about object parts and their spatial configurations to solve the jigsaw puzzles. We also discuss a method for representation learning that uses an artificial supervisory signal based on counting visual primitives. This supervisory signal is obtained from an equivariance relation. We use two image transformations in the context of counting: scaling and tiling. The first transformation exploits the fact that the number of visual primitives should be invariant to scale. The second transformation allows us to equate the total number of visual primitives in each tile to that in the whole image. The most effective transfer strategy is fine-tuning, which restricts one to use the same model or parts thereof for both pretext and target tasks. We discuss a novel framework for self-supervised learning that overcomes limitations in designing and comparing different tasks, models, and data domains. In particular, our framework decouples the structure of the self-supervised model from the final task-specific finetuned model. Finally, we study the problem of multi-task representation learning. A naive approach to enhance the representation learned by a task is to train the task jointly with other tasks that capture orthogonal attributes. Having a diverse set of auxiliary tasks, imposes challenges on multi-task training from scratch. We propose a framework that allows us to combine arbitrarily different feature spaces into a single deep neural network. We reduce the auxiliary tasks to classification tasks and the multi-task learning to multi-label classification task consequently. Nevertheless, combining multiple representation space without being aware of the target task might be suboptimal. As our second contribution, we show empirically that this is indeed the case and propose to combine multiple tasks after the fine-tuning on the target task.
With the information explosion, a tremendous amount photos is captured and shared via social media everyday. Technically, a photo requires a finite exposure to accumulate light from the scene. Thus, objects moving during the exposure generate motion blur in a photo. Motion blur is an image degradation that makes visual content less interpretable and is therefore often seen as a nuisance. Although motion blur can be reduced by setting a short exposure time, an insufficient amount of light has to be compensated through increasing the sensor’s sensitivity, which will inevitably bring large amount of sensor noise. Thus this motivates the necessity of removing motion blur computationally. Motion deblurring is an important problem in computer vision and it is challenging due to its ill-posed nature, which means the solution is not well defined. Mathematically, a blurry image caused by uniform motion is formed by the convolution operation between a blur kernel and a latent sharp image. Potentially there are infinite pairs of blur kernel and latent sharp image that can result in the same blurry image. Hence, some prior knowledge or regularization is required to address this problem. Even if the blur kernel is known, restoring the latent sharp image is still difficult as the high frequency information has been removed. Although we can model the uniform motion deblurring problem mathematically, it can only address the camera in-plane translational motion. Practically, motion is more complicated and can be non-uniform. Non-uniform motion blur can come from many sources, camera out-of-plane rotation, scene depth change, object motion and so on. Thus, it is more challenging to remove non-uniform motion blur. In this thesis, our focus is motion blur removal. We aim to address four challenging motion deblurring problems. We start from the noise blind image deblurring scenario where blur kernel is known but the noise level is unknown. We introduce an efficient and robust solution based on a Bayesian framework using a smooth generalization of the 0−1 loss to address this problem. Then we study the blind uniform motion deblurring scenario where both the blur kernel and the latent sharp image are unknown. We explore the relative scale ambiguity between the latent sharp image and blur kernel to address this issue. Moreover, we study the face deblurring problem and introduce a novel deep learning network architecture to solve it. We also address the general motion deblurring problem and particularly we aim at recovering a sequence of 7 frames each depicting some instantaneous motion of the objects in the scene.
In this thesis we study the blind deconvolution problem. Blind deconvolution consists in the estimation of a sharp image and a blur kernel from an observed blurry image. Because the blur model admits several solutions it is necessary to devise an image prior that favors the true blur kernel and sharp image. Recently it has been shown that a class of blind deconvolution formulations and image priors has the no-blur solution as global minimum. Despite this shortcoming, algorithms based on these formulations and priors can successfully solve blind deconvolution. In this thesis we show that a suitable initialization can exploit the non-convexity of the problem and yield the desired solution. Based on these conclusions, we propose a novel “vanilla” algorithm stripped of any enhancement typically used in the literature. Our algorithm, despite its simplicity, is able to compete with the top performers on several datasets. We have also investigated a remarkable behavior of a 1998 algorithm, whose formulation has the no-blur solution as global minimum: even when initialized at the no-blur solution, it converges to the correct solution. We show that this behavior is caused by an apparently insignificant implementation strategy that makes the algorithm no longer minimize the original cost functional. We also demonstrate that this strategy improves the results of our “vanilla” algorithm. Finally, we present a study of image priors for blind deconvolution. We provide experimental evidence supporting the recent belief that a good image prior is one that leads to a good blur estimate rather than being a good natural image statistical model. By focusing the attention on the blur estimation alone, we show that good blur estimates can be obtained even when using images quite different from the true sharp image. This allows using image priors, such as those leading to “cartooned” images, that avoid the no-blur solution. By using an image prior that produces “cartooned” images we achieve state-of-the-art results on different publicly available datasets. We therefore suggests a shift of paradigm in blind deconvolution: from modeling natural image statistics to modeling cartooned image statistics.
This thesis investigates the problem of 3D reconstruction of a scene from 2D images. In particular, we focus on photometric stereo which is a technique that computes the 3D geometry from at least three images taken from the same viewpoint and under different illumination conditions. When the illumination is unknown (uncalibrated photometric stereo) the problem is ambiguous: different combinations of geometry and illumination can generate the same images. First, we solve the ambiguity by exploiting the Lambertian reflectance maxima. These are points defined on curved surfaces where the normals are parallel to the light direction. Then, we propose a solution that can be computed in closed-form and thus very efficiently. Our algorithm is also very robust and yields always the same estimate regardless of the initial ambiguity. We validate our method on real world experiments and achieve state-of-art results. In this thesis we also solve for the first time the uncalibrated photometric stereo problem under the perspective projection model. We show that unlike in the orthographic case, one can uniquely reconstruct the normals of the object and the lights given only the input images and the camera calibration (focal length and image center). We also propose a very efficient algorithm which we validate on synthetic and real world experiments and show that the proposed technique is a generalization of the orthographic case. Finally, we investigate the uncalibrated photometric stereo problem in the case where the lights are distributed near the scene. In this case we propose an alternating minimization technique which converges quickly and overcomes the limitations of prior work that assumes distant illumination. We show experimentally that adopting a near-light model for real world scenes yields very accurate reconstructions.