Research - Computer Vision Group

Self-supervised learning

Self-supervised learning is a novel paradigm in machine learning, where one can learn features without manual annotation. The main principle is to take the available data samples, to split each sample into two parts, and to learn to predict one part given the other as input. This principle allows a model to learn structure in the data. We have proposed a method that learns how to solve puzzles. We split images into a set of 9 tiles (input) and the corresponding pixel coordinates of the center of each tile (output). By learning to arrange the tiles in the correct order, the model learns to distinguish object parts and how these object parts are typically arranged.

Disentangling factors of variation

We assume that visual data can be described by a finite set of attributes, or factors, such as the object identities, 3d shape, pose, viewpoint, and the global illumination. Computer graphics rendering engines are an example of how these factors can be used to generate images. We are thus interested in the inverse process, where we obtain these factors given an image. The collection of such factors is a feature vector that can be used efficiently for object classification, detection and segmentation. We explore completely unsupervised methods as well as partly supervised methods, where only some factors (e.g., the object category) are specified.

Unsupervised learning

We are interested in building useful feature representations of images. In our approaches a good representation is one that makes future learning easier. Currently, we use neural networks to solve tasks and because of their compositional architecture, a feature is naturally identified as one of many possible intermediate outputs of the trained model. The question is then: How do we build a feature that can be used as input to a weak classifier or regressor for other tasks? We have explored the use of feature constraints based on known image transformations.

Learning 3D and motion

We aim to reconstruct 3D objects and their motion from video sequences. The challenge of this task is to deal with multiple moving objects, and to separate their deformations from their rigid motion. We explore data-driven approaches by using neural networks as motion and 3D estimators.

Image deblurring

If either the camera or objects in a scene move during the exposure, images will be degraded by an artifact known as motion blur. To remove this degradation we consider explicit models of blur (shift-invariant, camera shake, non uniform) and design energy minimization methods or data-driven methods (e.g., via deep learning) to retrieve the latent sharp image. Our approaches introduce priors for sharp images and models of the blurry image noise in an energy formulation. We then build novel iterative algorithms to solve the minimization task.

Light field superresolution and blind deconvolution

In a single snapshot light field cameras provide a sample of light position and direction. When objects are Lambertian, we show how light field images can be used to obtain surfaces of the scene and texture at about the sensor resolution for extended depth of field and digital refocusing. We also show that light field cameras allow to remove motion blur effectively even with general depth-varying scenes.

Shape learning

Given a few views of an object, humans are surprisingly good at predicting how new views will look like, even if this is the first time they see that object. Whether they do it implicitly or explicitly, they somehow know how to deal with the 3D shape of objects. Does this ability come from a lot of visual experience? Does all the information lie in the few examples we have observed? We explore the latter hypothesis. Currently, we are focusing on the first image analysis step: How can one perform segmentation/grouping regardless of the object texture?

Coded photography

Out of focus blur increases with the distance from the focal plane. Thus, by measuring the blur extent one can obtain the depth of an object. Solving this task is ambiguous, as one cannot distinguish between a sharp image of a blurry object and a blurry image of a sharp object. However, the ambiguity is reduced if the shape of the blur is not a disk. We explore methods for depth estimation from defocus information by testing different masks and achieve state-of-the-art results.

Uncalibrated photometric stereo

Photometric stereo computes 3D geometry from images with different illumination conditions. When illumination is unknown, then the problem has the so-called generalized bas-relief ambiguity. Prior work solves the ambiguity by using heuristics that depend on the ambiguity itself, and thus give non-unique and non-consistent answers. We solve the problem by exploiting Lambertian reflectance maxima and achieve state-of-the-art results with the highest computational efficiency.

Subspace clustering

Principal component analysis is perhaps one of the most ubiquitous methods to analyze data and reduce its dimensionality. This method is designed for data that lives in a single subspace. When data lives in multiple subspaces, a novel approach is needed. We model data in multiple subspaces via the self-expressive model A = AC, where A is the matrix whose columns are the data points, and C is a matrix with the coefficients of the linear combinations. Despite the nonlinearity of the model, we show how to exactly recover both A and C in a number of optimization problems.

Segmentation of dynamic textures

Natural events, such as smoke, flames or water waves, exhibit motion patterns whose complexity can be described be stochastic patterns of dynamic textures. We show how stochastically homogeneous regions can be found in video sequences and automatically segmented with invariance to viewpoint.

Multiview stereo

Depth estimation from multiple views is known to be sensitive to occlusion and clutter. We propose a globally optimal solution that is robust to these challenges. We do so by studying how depth maps are integrated together in a single 3D representation (a level set function).

Real-time structure from motion and virtual object insertion

Given a monocular video sequence portraying a rigidly moving scene, one can recover both the camera motion and the 3D position of points or surfaces in the scene. We have demonstrated the first real-time system based on Kalman filtering at the leading computer vision conferences. The system incorporates a point tracker and can deal with outliers and changes of illumination.

Retinal imaging

The 3D shape of the optical nerve is used to diagnose glaucoma in fundus images. We show the difference between images captured by a monocular and a stereo fundus camera, and define when one can truly recover 3D structures from these images.

Research Topics