The availability of data and the possibilities for analysis are revolutionalizing our society and our businesses. But the data science process is painful and requires highly skilled experts. Inspired by recent successes in AI towards automating highly complex jobs, the...
The availability of data and the possibilities for analysis are revolutionalizing our society and our businesses. But the data science process is painful and requires highly skilled experts. Inspired by recent successes in AI towards automating highly complex jobs, the goal of SYNTH is to automate the task of the data scientist. This should make data science accessible to the non-expert.
SYNTH wants to automate data science by developing the foundations of a theory and methodology for automatically synthesising inductive data models.
An inductive data model (IDM) consists of 1) a data model (DM) that specifies an adequate data structure for the dataset (just like a database or a spreadsheet), and 2) a set of inductive models (IM), that is, a set of patterns and models that have been discovered in the data. While the DM can be used to retrieve information about the dataset and to answer questions about specific data points, the IMs can be used to make predictions, propose values for missing data, identify outliers, find inconsistencies or violations of constraints, identify prototypical instances, answer what-if questions, find redundancies, etc. A typical IDM will contain multiple IMs each corresponding to the outcome of a specific learning task. The goal is now to automatically synthesise such inductive data models from past data with only minimal supervision by a data scientist, that is, to automate the task of the data scientist to the maximal possible extent. The induced IDM could then be further improved and validated by a data scientist, or directly employed by an end-user. It is assumed that the data set consists of a set of tables in a spreadsheet or relational database, that the end-user interacts with the IDM via a visual interface, and the data scientist has access to a unifying IDM language offering a number of core IMs and learning algorithms.
The key challenges to be tackled in SYNTH are: 1) the system must â€learn the learning taskâ€: it should identify the right learning tasks and learn appropriate IMs for each of these; 2) the system may need to restructure the data set before IM synthesis can start: it should perform the data wrangling step automatically; and 3) a unifying IDM language for a set of core patterns and models must be developed in order to support the data scientist; the IDM language should integrate concepts from logic and databases, probabilistic and constraint programming with machine learning and inductive querying. The approach will be implemented in open source software and evaluated on two challenging application areas: rostering and sports analytics.
One simple yet operational view of automated data science that Synth has contributed is that of predictive autocompletion in a spreadsheet environment. Imagine the end-user of spreadsheet software is filling out some entries, and assume that there are regularities in the data, and that the data has been entered in a systematic manner. The predictive autocompletion task is then to automatically predict not only which cells the user will fill out but also predict the right values for such cells as well as an estimate of the confidence of the prediction. Solving the autocompletion task is, in a nutshell, the overall task addressed by Synth, as it requires solutions to all three challenges mentioned above. Initial approaches to autocompletion have been contributed.
Important progress has been made in that the key components of an automatic data scientist have been identified and that prototypes for all of these are under way.
1) The SynthLog language as the unifying IDM language for supporting data scientists.
The underlying idea of SynthLog is, just like in inductive databases, that data science becomes a querying and inference process in which the patterns and models become first class citizens. But while traditional inductive databases are based on relational databases, SynthLog’s data model is based on the much more expressive probabilistic logical language ProbLog. ProbLog tightly integrates probabilistic models with logic, databases, and constraints, and supports deductive, probabilistic inference as well as constraint solving. SynthLog’s data models (DM) are now viewed as ProbLog programs, which can be used for inference and learning and which can be combined (through a set of algebraic operations) with other data models. Furthermore, the results of learning and inference, the inductive models (IM), are also assimilated as regular data models so that they become first class citizens, and a convenient closure property is satisfied.
SynthLog serves as the back-end of the Synth automated data scientist and it is intended to support the data science expert.
2) A (partially) automated data wrangling system.
In particular, there is the Synth-a-Sizer approach as a preliminary automated data wrangling system that starts from a set of .csv files and transforms these into a format that can be processed by machine learning algorithms. Synth-a-Sizer is being expanded towards more automation and towards coping with richer data formats.
3) Methods for learning inductive models, in particular constraints and predictive models.
The Synth Project has devoted special attention to the learning of constraints and has contributed a wide variety of techniques for learning constraints. Synth is also contributing new algorithms for predictive learning, in particular, it is extending the structure learning approaches for the ProbLog language and it has contributed to the MERCS approach for learning multi-directional ensembles of multi-target decision trees.
4) Prototype implementation.
The Synth framework will consist of both a front- and a back-end. The front-end of SYNTH will extend traditional spreadsheet software with facilities for autocompletion and is intended to support the naive end-user. The back-end of SYNTH is the SynthLog language.
When comparing the Synth project and approach to the state-of-the-art, the following are important.
First, while most approaches to automating data science and machine learning focus on the modeling step, Synth focusses on the overall data science process.
Second, Synth focusses on symbolic and probabilistic modeling and learning methods (rather than neural networks) because it wants to support both learning and reasoning. In addition, Synth has a lot of attention for the learning of constraints. While many use constraints in machine learning and in problem solving, there have been only few attempts at learning them.
Third, Synth’s grand challenge is to automate data science so much that it becomes accessible to non-expert users. Synth addresses this through its front-end and its back-end. Fourth, Synth is aimed at identifying a small and principled set of necessary components for automating data science and at developing a unifying language and framework that incorporates these.
The project is on track and its expected results till the end of the project include a fully worked out proof-of-principle of the sketched approach to automated data science, a novel language for data science (SynthLog), a novel semi-automatic data wrangling system, various approaches to learning constraints and predictive models, prototype implementations of both the front- and back-end, as well as an evaluation of the approach by end-users and on applications in rostering and sports analytics.
More info: https://synth.cs.kuleuven.be/.