The objective of SEED is to extract a description that identifies the objects contained in a video, their precise boundaries and spatial layout, and the manner in which those objects move, interact and change over time, based on weakly supervised large-scale machine learning...
The objective of SEED is to extract a description that identifies the objects contained in a video, their precise boundaries and spatial layout, and the manner in which those objects move, interact and change over time, based on weakly supervised large-scale machine learning techniques. 
The goal of SEED is to fundamentally advance the methodology of computer vision by exploiting a dynamic analysis perspective in order to acquire accurate, yet tractable models that can automatically learn to sense our visual world, localize still and animate objects (e.g. chairs, phones, computers, bicycles or cars, people and animals), actions and interactions, as well as qualitative geometrical and physical scene properties, by propagating and consolidating temporal information, with minimal system training and supervision. SEED will extract descriptions that identify the precise boundaries and spatial layout of the different scene components, and the manner they move, interact, and change over time. For this purpose, SEED will develop novel high-order compositional methodologies for the semantic segmentation of video data acquired by observers of dynamic scenes, by adaptively integrating figure-ground reasoning based on bottom-up and top-down information, and by using weakly supervised machine learning techniques that support continuous learning towards an open ended number of visual categories.
The methodology emerging from this research has the potential to impact fields as diverse as automatic personal assistance for people, video editing and indexing, robotics, environmental awareness, augmented reality, human-computer interaction, or manufacturing.
The objective of SEED is to extract a description that identifies the objects contained in a video, their precise boundaries and spatial layout, and the manner in which those objects move, interact and change over time, based on weakly supervised large-scale machine learning techniques. The achievements attained during this reporting period follow the general project planning, and are as follows:
-- Semantic video segmentation. We developed models based on convolutional architectures and spatial transformer recurrent layer that are able to temporally propagate labeling information by means of optical flow, adaptively gated based on its locally estimated uncertainty. The flow, the recognition and the gated propagation modules can be trained jointly, end-to-end. The gated recurrent flow propagation component of our model can be plugged-into any static semantic segmentation architecture and turn it into a weakly supervised video processing one. 
-- Active visual search. One of the most widely used strategies for visual object detection is based on exhaustive spatial hypothesis search. While methods like sliding windows have been successful and effective for many years, they are still brute-force, independent of the image content and the visual category being searched. In this line of work we have developed principled sequential models that accumulate evidence collected at a small set of image locations in order to detect visual objects effectively. By formulating sequential search for visual object categories as deep reinforcement learning of the search policy (including the stopping condition) and the detector response function, our fully trainable model can explicitly balance for each class, specifically, the conflicting goals of exploration – sampling more image regions for better accuracy –, and exploitation – stopping the search efficiently when sufficiently confident about the target’s location. The methodology is general and applicable to any detector response function. 
-- Dynamic structured models for the detection, recognition (semantic segmentation) and 3d reconstruction of humans based on a multi-task architecture. We proposed a deep multitask architecture for fully automatic 2d and 3d human sensing (DMHS), including recognition and reconstruction, in monocular images. The system computes the figure-ground segmentation, semantically identifies the human body parts at pixel level, and estimates the 2d and 3d pose of the person. The model supports the joint training of all components by means of multi-task losses where early processing stages recursively feed into advanced ones for increasingly complex calculations, accuracy and robustness. The design allows us to tie a complete training protocol, by taking advantage of multiple datasets that would otherwise restrictively cover only some of the model components: complex 2d image data with no body part labeling and without associated 3d ground truth, or complex 3d data with limited 2d background variability. 
-- Large-scale weakly supervised kernel methods based on Fourier approximation. We develop methodologies that allow for the first time, the application of non-linear, kernel-based semi-supervised learning methods (so far limited to datasets of only thousands of examples) to large scale data repositories of millions of datapoints.
-- Matrix back-propagation for training deep networks with structured layers with applications to image segmentation, higher-order pooling and learning for graph matching. We have recently developed the methodology of matrix back-propagation which allows the construction and propagation of gradients, in a reverse mode automatic differentiation framework, through complex (structured) deep processing layers like singular value decomposition and eigen-decomposition. Such calculations would allow the end-to-end training of complex models like deep normalized cuts, deep higher-order pooling or deep graph matching, The deep learning
Progress beyond the state of the art has been achieved in the following areas
-- Weakly supervised semantic video segmentation by Nilsson and Sminchisescu, CVPR 2018
-- Deep reinforcement learning of region proposal networks for object detection, by Pirinen and Sminchisescu, CVPR 2018
-- Deep learning of graph matching by Zanfir and Sminchisescu, CVPR 2018
-- Appearance Transfer by Zanfir, Popa, Sminchisescu at CVPR 2018.
-- 3d human pose reconstruction of multiple people in monocular images and video, CVPR 2018.
More info: http://www.maths.lth.se/sminchisescu/.