Integrating heterogeneous pieces of content has traditionally been a tedious process, involving domain experts who manually and carefully repurpose the different types of contents using rigid formats in order to meet the needs of a specific application. The rapid increase in...
Integrating heterogeneous pieces of content has traditionally been a tedious process, involving domain experts who manually and carefully repurpose the different types of contents using rigid formats in order to meet the needs of a specific application. The rapid increase in the volume and variety of information being made available today–both by human and machines–makes this fully manual process impractical in many settings. Instead, large organizations are today exploring automated, data-driven approaches to integrate heterogeneous pieces of information.
However, classical information integration techniques are ill-suited to meaningfully integrate dynamic and heterogeneous data originating from many different sources. On one hand, Information Retrieval techniques based on synonym sets, query expansion and pseudo-relevance feedback, though highly scalable and efficient, offer limited abstractions beyond Boolean keyword search; typically, they do not support higher-level processing for example based on declarative queries. Database techniques, on the other hand, offer powerful abstractions to syntactically reformulate complex queries posed against one database into equivalent (or subsumed) queries in a different database. However, they require the a priori definition of a global federated schema and complex mappings (views) between the global schema and each source, thus drastically limiting their dynamicity and scalability in practice.
The goal of this project is to propose an ambitious overhaul of data integration techniques to better match today’s world of information profusion. The main idea behind those techniques revolves around a new integration abstraction flexibly connecting all heterogeneous pieces of information through extremely large, data-driven, probabilistic and heterogeneous graphs. The scientific contribution of this project is divided into three distinct though highly interweaved endeavors: i) the creation of new information extraction and semantic lifting approaches to interconnect unstructured and structured content through novel abstraction layers called Heterogeneous Information Graphs (HIGs), ii) the development of new physical structures aiming at efficiently storing, managing, and retrieving the wealth of information considered by a HIG on sets of commodity machines, and iii) the design of new logical abstractions responsible for exposing and serving the richness of HIGs data to external entities through high-level declarative interfaces and abstractions.
We developed a number of novel techniques as part of the project. The main methods we have developed so far include:
- a new technique to infer links in large graphs leveraging hierarchical and overlapping clustering algorithms (see our BigData 2018 publication);
- a new method to for link prediction using uncertainty sampling and deep active learning (see our WWW2019 [Ostapuk] paper);
- a new method combining deep probabilistic modeling and crowdsourcing for data debugging (see our WWW2019 [Jie Yang] paper);
- new techniques to create vector spaces from heterogeneous and hypergraphs leveraging random walks (see our CIKM 2018 and WWW 2019 [Dingqi Yang] papers);
- a new method to obfuscate private data prior to publishing leveraging a fixed distortion budget (see our TKDE 2019 article);
- a new method to create and dynamically maintain compact and fixed-size sketches to approximate similarity metrics (see our TKDE 2018 article and our ICDM 2017 paper).
This project will dramatically reduce the time-to-value for personal and enterprise data and hence will have both a tremendous scientific and practical impact. The issues tackled by this project are exceedingly important today, given that human attention is limited and that both individuals and organizations cannot keep up with the number and variety of the data sources available in our information society. Many application domains such as Social Networking, Big Data analytics, Personal Digital Assistants, Data-Driven Security, eScience, or Enterprise Data Integration require the proper integration of different data silos, and as such will directly benefit from the results of this project.
More info: https://exascale.info/GraphInt.