The objective of this project is to investigate scalability questions arising with a new wave of smart relational data management systems that integrate analytics and query processing. To answer these questions, we take a first-principles approach to machine learning over...
The objective of this project is to investigate scalability questions arising with a new wave of smart relational data management systems that integrate analytics and query processing. To answer these questions, we take a first-principles approach to machine learning over relational databases that exploits and advances recent development in database systems and theory. This development is centred around so-called factorization techniques that reduce the amount of redundancy in the relational data representation and in the computation with such data. It also investigates the interaction of factorization with distribution, approximation, and maintenance of processing results under data updates.
Providing scalable solutions for analytics over relational databases is a defining challenge of our time. While this has always been an important topic of research and commercial interest, three major trends are exacerbating its current widespread recognition and adoption. It is much cheaper to generate data, due to inexpensive sensors, smart devices, social software, multiplayer games, and the Internet of Things, which connects homes, cars, appliances, and other devices. It is much cheaper to process data, due to advances in multicore CPUs, inexpensive cloud computing, and open source software. Finally, our society has become increasingly more computational, with many categories of people becoming intimately involved in the process of generating, processing, and consuming data, such as decision makers, domain scientists, application users, journalists, crowd workers, and everyday consumers. This wide adoption comes with pressing demands for scalable large-scale data management from businesses, governments, academic disciplines, engineering, communities, and individuals.
Ad-hoc mainstream solutions for analytics over relational databases currently represent a highly lucrative business. The input to learning classification and regression models is defined by feature extraction queries over relational databases. Their approach is to materialize the training dataset, export it out of the database, and then learn over it using statistical software packages. These three steps are expensive and unnecessary. Instead, one can cast the machine learning problem as a database problem by decomposing the learning task into a batch of aggregates over the feature extraction query and by computing this batch over the input database. Using factorization techniques, the relational learning problem can be solved with a lower computational complexity even than the materialization of the training dataset used by the mainstream solutions. This is because such factorization techniques benefit tremendously from structural properties of the relational data and of the feature extraction query; such properties may be algebraic (semi-ring), combinatorial (hypertree width), or statistical (sampling).
So far, we have investigated foundational aspects of factorized data management, with contributions on database covers and factorization synthesis. Covers are lossless, succinct representations of query results. We investigate covers of worst-case optimal sizes for the results of join queries and also covers for functional aggregate queries that express computational problems such as aggregate-join queries, matrix chain multiplication, and inference in probabilistic graphical models. Factorization synthesis is reverse query evaluation: Given a relation, we would like to express it as a query over much smaller input relations. We investigate the problem of factorizing relations defined by conjunctive queries with negated sparse relations into unions of products of unary relations.
Our key contributions are on factorized relational learning and dynamic factorized relational processing. We establish a number of fundamental results in this space. We pinpoint the computational complexity of learning a variety of models over relational data as a function of data sparsity via (fractional hypertree and submodular) width parameters for the feature extraction queries. We show how to exploit join dependencies and how to reparameterize models under functional dependencies. We further proposed a unified dynamic computation approach for linear regression, matrix chain multiplication, and SQL queries. Our latest result is a worst-case optimal dynamic algorithm for counting triangles. This result has been recently awarded the best paper at the International Conference on Database Theory 2019 (ICDT\'19).
Our theoretical work is complemented by a significant systems-building effort. We have built systems for learning models over databases and for maintaining results of analytics under data updates.
Factorized relational learning is a key contribution of this project so far with results published in top database venues. There are several early indicators of success for this paradigm. First, there is strong interest from industry. Second, it is drawing sustained attention from academia as detailed below. This learning paradigm has been recognized as key to the relevance of database research in the current research landscape largely dominated by machine learning contributions. Its message is that the data-dependent computation for learning models can be cast as a database problem and solved much more efficiently and with lower complexity than the approaches put forward by the machine learning community. We support this message with both theoretical arguments and extensive system development and benchmarking. The PI has been invited to deliver keynotes at several international venues on this topic: Highlights of Logics, Games and Automata (2018); Conférence sur la Gestion de Données – Principes, Technologies et Applications (BDA); joint keynote at International Conference on Database Theory (ICDT) and Extending Database Technology (EDBT) 2019. This is complemented by many invited talks at academic institutions, workshops, and industry labs. The PI has also been invited to give PhD-level courses on factorized learning at: PhD Open, University of Warsaw (2018); Turing Data Science, The Alan Turing Institute (2018); LogiCS/RiSE Summer School (2017). We also see the preparation of a nine-hour course (with publicly-available slides, video, and exam) on the research results developed in this project as a key achievement that is essential for dissemination. Our ACM PODS\'18 paper on factorized (in-database) learning with sparse tensors has been invited to a special ACM TODS issue of best papers of PODS\'18.
A second key contribution is the worst-case optimal incremental maintenance paradigm. Our result opens up a new line of research on charting the exact complexity of incremental maintenance for relational queries. Its novel technical development on pinpointing the amortized complexity for counting the triangles in a graph can be applied to a large class of queries.
We expect a strengthening of the two aforementioned paradigms in the second half of the project, with a more complete picture on the models that can be learned in a factorized way and of the complexity of functional aggregate queries under data updates. We further expect both theoretical result and system development on the interaction between approximation and factorization for learning models over relational data.
More info: https://fdbresearch.github.io/.