Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - RSM (Rich, Structured Models for Scene Recovery, Understanding and Interaction)

Teaser

** What is the problem/issue being addressed?Computer vision has gained considerable momentum in recent years – both in industry and academia. There seems to be a spirit that the time is ripe to realize grand goals and to bring computer vision from the lab into real life...

Summary

** What is the problem/issue being addressed?
Computer vision has gained considerable momentum in recent years – both in industry and academia. There seems to be a spirit that the time is ripe to realize grand goals and to bring computer vision from the lab into real life. But is a vision system already as good as a human is? The answer is: “Unfortunately, not yet.” Given a single image, a child can describe the objects and their relationships in a much more detailed manner than any computer can. Also, humans can quite effortlessly “visually extract” an object from its background, even in the presence of fine details such as hair. Computers cannot yet achieve this automatically. But, for many real-world applications it is absolutely necessary to reach such levels of rich output, accuracy, quality, robustness, and system autonomy. In this proposal we try to get closer to this overarching goal. We believe that the key to success is a richer representation. Here “rich” stands for rich, detailed output, modelling rich, physical and semantic constraints, and learning rich, statistical relations between different aspects of a scene. Towards this end we propose the Rich Scene Model (RSM), which is one joint statistical, structured model of many physical and semantic scene aspects that can take full advantage of the synergy effect between all its components. This effort goes beyond previous attempts, in many respects. However, it is simple to say “We will build the best ever joint, rich scene model”. Accordingly, the crux of this proposal is to design novel models, learning and inference techniques to make the RSM a reality. This proposal addresses not only theoretical questions such as, “What can we infer from a few images of a dynamically changing 3D scene?”, and “Is our RSM rich enough to make statistical learning ‘work better’ than deterministic learning?” we also propose a model that can give new forms of output, better deal with challenging real world scenarios, and can adapt nicely to human and application needs

** Why is it important for society?

Artificial Intelligence, and in particular computer vision, are a very fast growing field with a huge impact for society, such as autonomous driving, robotics, communication, health, etc. In this work we address core computer vision questions. Our research has direct impact on autonomous driving, robotics, augmented reality, and related fields. Since we drift in our project towards research areas of natural science, we start to also have an impact in bio imaging, medicine and, recently, astrophysics.

** What are the overall objectives?

The goal of this work is to develop the “Rich Scene Model (RSM)”, which treats a large number of scene aspects jointly in a statistical fashion. In particular:
1. To find new ways to combine hand-crafted and learned potentials. In this way, physics-based vision can benefit from latest progress in data-driven learning, but also semantic-based vision can benefit from physics, such as “stability” and the “rendering equation”.
2. To perform (approximate) inference for heterogeneous scene aspects, such as continuous and discrete label-states, as well as for new, higher-order constraints and global variables.
3. To explore a novel, and highly original representation, which we denote as the “implicit RSM”.
4. To find new trade-offs between good modelling and feasible inference.
5. To push the limits of 3D-model-based vision for scene understanding and recovery, by exploiting latest trends in graphics on low-dimensional 3D shape-manifolds [3].
6. To realize novel, application/human-driven loss functions, e.g. perceptual loss, user-interaction loss, and detection loss. Such loss functions are practically very relevant when putting the RSM to action.
7. Finally, to launch the synergy challenge with external partners. Here the implicit RSM plays an important role since it lets researchers easily explore the syner

Work performed

In the first half of the ERC grant we performed work in the work packages: W0, W1, W2.1, W2.2, W3.3. In the remaining time we will also conduct work in the work packages W2.3, and W3.1. As mentioned already above, work packages W3.2 is at the moment not planned to be executed. This was also mentioned in the Risk Management plan of the RSM-ERC proposal. This work package is not needed for the final success of the project and we rather spend more time on the other work packages.
The work performed in each of the work packages (W0, W1, W2.1, W2.2, W3.3), together with the main results, is listed below.

** W0. Ground Truth Data collection
We conducted all aspects of this work package. Firstly, we collected a new dataset which combines semantic (instance segmentation) and physical properties (depth). We also worked on generating synthetic data. We did this in the context of augmenting real scenes with synthetic vehicles. We were able to show that we can improve the state of the art on instance segmentation by a considerable margin. Finally, the ERC team decided to not do the synergy challenge but instead do a joint effort with many other groups from MPI Tubingen, TU Munich, ETH Zürich, etc. on conducting a robust vision challenge (ROB 2018). This is highly relevant since many methods over-fit to specific datasets. The ROB challenge has run at CVPR 2018 for the first time.

** W1. Theoretical Foundations for the Rich Scene Model (RSM)
The proposal was written in the beginning of 2014, when deep learning did not play any role in computer vision. Hence deep learning was not mentioned in the proposal. Since about 2015/16, deep learning is dominating computer vision. This unexpected change in methodology changed basically all sub work packages in W1. We still continued some work on graphical models, but started to focus more on deep learning based techniques. In brief the following work was conducted in W1: As described in WP 1.5, we aim at computing not only one solution of a graphical model, but rather many good and diverse solutions. We had an idea of formulating this problem differently to all existing methods, which lead to a series of papers at the top machine learning (NIPS) and computer vision conferences (ICCV). With the rise of deep learning, we looked at the combination of deep learning and graphical models, which is an essential building block for practical applications, in work package W2. Also, with the rise of deep learning we looked at the relationship between deep neural networks and random forests, two prominent and important methodologies in our field. We also continued to push the state of the art inference for challenging graphical models

W2. Building the RSM in parts - physical and semantic aspects

** W2.1. Joint motion estimation, image enhancement, and object segmentation
In a first work we looked at the question of how we can improve on motion blur in conjunction with motion estimation. We conducted this work for a stereo video input, which was not done before. We achieved high quality results, which were published in ECCV. We were able to push the state of the art on the highly competitive field of image enhancement (in particular non-blind image deconvolution). This work also showed a new way to combine hand-crafted and learned potentials.

** W2.3. Joint motion, objects, semantics, attributes, and model-based scene completion
We conducted two works which address the core questions of the ERC: What’s the synergy effect between different tasks: how big is it? and how can we measure it quantitatively? The answers which we give in the related articles are not as simple as I hoped for. For instance, the synergy effect does not always help (see ICRA paper). Also, for scene flow estimation we have seen that bounding box information (for object instances) is as useful as a more precise (but potentially slightly wrong) pixel-wise segmentation. In other works we focused on pushing the state of the art for on

Final results

All of our peer-reviewed publications, associated with the ERC, contain results and methodical advances which are novel and go beyond the state of the art, at the time of publication. In my view the five most significant pieces of work, which go beyond the state of the art, are the following:

1. The idea of incorporating a sampling-based technique in a neural network. We conduct this in the context of incorporating one of the most used algorithm in our field, from 1981, called RANSAC into a neural network. Technically it involves to minimize the expected loss of the network. The associated publication:
E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, C. Rother, “DSAC – Differentiable RANSAC for Camera Localization”, CVPR 2017 (oral).
was nominated for best student paper award at CVPR 2017, which is one of the primary computer vision conferences, with around 3000 submissions.

2. Training data generation becomes a fundamental task in computer vision. In this work we explored a new avenue for training data generation. We take existing footage and combine it with virtual models. In this way we are able to improve a state of the art method for instance segmentation by about 8%.
H. Abu Alhaija, S.K. Mustikovela, L. Mescheder, A. Geiger, C. Rother, “Augmented Reality Meets Computer Vision Efficient Data Generation for Urban Driving Scenes”, IJCV 2018.

3. In this line of work we address the question of finding multiple, different solutions of an energy function (in terms of a structured model), which all have low energy. Technically it is the problem of finding the M-Best-Diverse solutions of a graphical model”. This is interesting for many applications, even bio-imaging (such as cell tracking). One article from this line of work is:
A. Kirillov, A. Shekhovtsov, C. Rother, B. Savchynskyy, “Joint M-Best-Diverse Labelings as a Parametric Submodular Minimization”, NIPS 2016.


4. Our work on relating Auto-context Decision Forests to Deep Neural Networks is interesting since it connects two important areas of research. This work was award the best science paper award at BMVC 2016. The work also highlights my collaboration with researchers from life sciences.
D. L. Richmond, D. Kainmueller, M. Y. Yang, E. W. Myers, C. Rother, “Mapping Auto-context Decision Forests to Deep ConvNets for Semantic Segmentation”, BMVC 2016.

5. Our work on combining object instance recognition and scene flow is at the core of the ERC task – combining physical and semantic information. In the work we explored various levels of integration and concluded that a mid-level integration, in the form of bounding box detection, works overall best. In this work we are able to outperform all state of the art methods for scene flow estimation (KITTI 2015 challenge).
Behl, O. Hosseini Jafari, S. K. Mustikovela, H. Abu Alhaija, C. Rother, A. Geiger, “Bounding Boxes, Segmentations and Object Coordinates: How Important is Recognition for 3D Scene Flow Estimation in Autonomous Driving Scenarios?”, ICCV 2017.

***Expected results until the end of the project

We expect that we will reach nearly all of our goals. In particular, we will reach our main objective to build an RSM. Given the rapid methodical advance in our field, in particular in deep learning, the solution will look different than anticipated at the start of the project. With respect to the detailed objectives (1-7, see above), we will reach the goals 1, 2, 4, and 5. We will quite likely not execute objective 3, i.e. to build the “implicit RSM” – since other tasks demand more time. This was also mentioned in the Risk Management plan of the RSM-ERC proposal. Objective 6 (human-driven loss function) is interesting, however due to the move from Dresden to Heidelberg we are no longer in close contact to partners from human-computer-interaction. On the other hand we collaborate more with partners from natural science, e.g. biology and astrophysics, whi

Website & more info

More info: https://hci.iwr.uni-heidelberg.de/vislearn/.