The goal of the project is to automatically analyze human activities observed in videos. Any solution to this problem will allow the development of novel applications. It could be used to create short videos that summarize daily activities to support patients suffering from...
The goal of the project is to automatically analyze human activities observed in videos. Any solution to this problem will allow the development of novel applications. It could be used to create short videos that summarize daily activities to support patients suffering from Alzheimer\'s disease. It could also be used for education, e.g., by providing a video analysis for a trainee in the hospital that shows if the tasks have been correctly executed. The analysis of complex activities in videos, however, is very challenging since activities vary in temporal duration between minutes and hours, involve interactions with several objects that change their appearance and shape, e.g., food during cooking, and are composed of many sub-activities, which can happen at the same time or in various orders.
While the majority of recent works in action recognition focuses on developing better feature encoding techniques for classifying sub-activities in short video clips of a few seconds, this project moves forward and aims to develop a higher-level representation of complex activities to overcome the limitations of current approaches. This includes the handling of large time variations and the ability to recognize and locate complex activities in videos. To this end, we aim to develop a unified model that provides detailed information about the activities and sub-activities in terms of time and spatial location, as well as involved pose motion, objects and their transformations.
We developed a hierarchical model that models complex activities at different granularities. At the top level, complex activities like “preparing pancakes†or “preparing a fruit salad†are modelled. These complex activities require several sub-activities that need to be executed like “take eggâ€, “crack eggâ€, or “stir doughâ€. These sub-activities are the intermediate representation of the hierarchy. At the lowest level, fine-granular activities or motion primitives are modelled. For instance, cracking eggs involves a sequence of human movements. The hierarchical model processes continuous video streams and predicts for each frame what sub-activity is executed as well as the overall complex activity.
In order to learn the parameters of the model, annotated videos are required. The developed model has the advantage that it can be trained in two ways. In the first setting, we assume that the videos have been annotated in the same way as the model is expected to analyze the videos. This means that for each frame the ongoing sub-activity is annotated. This setting is known as learning with full supervision. Providing such a frame-wise labeling of videos, however, is an enormous effort and can be too expensive for practical applications. We therefore developed a learning procedure that allows to learn the model with less supervision. Instead of annotating each frame, only the sequence of activities occurring in the video is annotated. Such type of annotation is commonly available in form of transcripts, subtitles, or protocols. While learning with weak supervision does not yet achieve the same accuracy as learning with full supervision, it scales better with the amount of training data since it massively reduces the annotation cost.
An important aspect of activities is the motion of the involved humans. During the project, the accuracy of human poses estimation, i.e., the estimation of the positions of each joint in an image, has been greatly improved. Starting with estimating the pose of a single person, we moved towards estimating the human poses of multiple persons that might occlude each other in unconstrained videos. We created a large-scale benchmark as well as an approach that jointly models multi-person pose estimation and tracking in a single formulation. Given the tracked humans, we are also able to localize the activities in the video.
We also addressed the problem how a model that is trained on one domain can be adapted to recognize activities in another domain. For instance, we might have a model trained on videos from Youtube, but we want that it recognizes activities from a video that is captured by camera mounted on a service robot. Since the videos the model has been trained on and the videos the model has to analyze look different, the model needs to be adapted to handle the differences. This problem is also called domain adaptation.
The developed models substantially improved the state-of-the-art for temporally localizing activities in videos. If the models are trained with full supervision, the models already achieve an accuracy that is sufficient for many applications. While there is still a gap in accuracy between models trained with full supervision and models trained with weak supervision, we were able to improve the accuracy of weakly supervised approaches by a factor of two.
While multi-person tracking and human pose estimation have been considered so far as independent research topics, we proposed the first approach that solves both tasks together. The corresponding PoseTrack dataset, which comprises about 250,000 annotated human poses in 1,000 videos, also allows for the first time to train and evaluate such models. The PoseTrack dataset has been downloaded over 10,000 times and about 430 users registered for submitting their results to the evaluation server.
We furthermore introduced the concept of open sets to the domain adaptation problem, i.e., adapting a model trained on a source domain to another target domain. Previous methods for domain adaptation assumed that the sets of categories of the source and target domain are closed, i.e., the images of the source and target domain include only instances of the same set of categories. The assumption that the target domain contains only images of the categories of the source domain is, however, unrealistic. For most applications, the dataset in the target domain contains many images and only a small portion of it might belong to the classes of interest. Open set domain adaptation does not make such an unrealistic assumption and we developed a generic domain adaptation method that can be applied to closed as well as open sets.
More info: http://pages.iai.uni-bonn.de/gall_juergen/.