Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - DMAP (Data Mining Algorithms in Practice)

Teaser

Data mining and machine learning have brought about some of the technologies that drive today\'s world --- their importance is hard to overstate.Some of the techniques used in these areas are heuristic in nature and, therefore, their behavior is not completely understood. A...

Summary

Data mining and machine learning have brought about some of the technologies that drive today\'s world --- their importance is hard to overstate.
Some of the techniques used in these areas are heuristic in nature and, therefore, their behavior is not completely understood. A deep understanding of these techniques and algorithms is necessary for harnessing their full potential, and for controlling their behavior.

This project has two main goals: (i) provide a theoretical study of existing techniques, and (ii) develop new fast, and provably correct, algorithms for important problems in data mining and machine learning, and to study the behavior of existing well-known algorithms; these new algorithms will be designed so to satisfy another important requirement: they will have to be as simple as possible, in order to facilitate their understanding, and their implementation.

Work performed

\"Our first set of results so far revolves around the basic question of approximating properties of users in a social network: e.g., which fraction of people in this social network study computer science? Or, what is the approximate average star-rating of movie X in this social network? These questions can be answered by sampling nodes uniformly at random from the social network. Online social networks, however, severely restrict the type of access to their nodes.
In our WWW 2016 paper (“On Sampling Nodes in a Network”, Chierichetti, Dasgupta, Kumar, Lattanzi, Sarlos) we show how one can sample nodes almost uniformly at random from a social network; the algorithms that we propose are extremely simple to implement. We also analyzed those algorithms both theoretically and empirically.
In a recent ICALP 2018 paper (\"\"On the Complexity of Sampling Vertices Uniformly from a Graph\"\", Chierichetti, Haddadan) we prove that those algorithms are optimal in a natural graph query model.

The second set of results studies some topological properties of a social network --- in particular, the distribution of the k-graphlets: that is, for some integer k >= 2, and for each connected graph H of k nodes, the (approximate) fraction of the induced connected subgraphs of a given social network G, that are isomorphic to H. The distributions of graphlets of size k are used, e.g., to classify networks: for instance, social networks tend to have a large number of cycles of length 3 (people tend to introduce their friends to each other), whereas road networks tend to have fewer triangles (roads tend to be orthogonal to each other).
In \"\"Counting Graphlets: Space vs Time\"\" (Bressan, Chierichetti, Kumar, Leucci, Panconesi, WSDM 2017) and in \"\"Motif Counting Beyond Five Nodes\"\" (Bressan, Chierichetti, Kumar, Leucci, Panconesi, ACM Transactions on KDD 2018) we proposed a number of algorithms for the approximate graphlet-estimation problem. Our algorithms have theoretical guarantees (on the quality of the approximation), and practically outperform existing ones in many directions, for instance, the amount of memory and time required for a given k, and a given network. When applied to real-world networks, our algorithms return good approximations for larger k\'s than in previous work.

The third set of results deals with learning patterns of user behavior. We have studied various classes of the so-called Random Utility Model, which has been used for decades for representing how rational users pick a choice in a set --- in this model, a user is a permutation over all the possible choices; the system picks some set of choices, and presents it to a random user: this user will pick the one choice in the set that she likes best (i.e., according to her permutation). Since a rock and roll fan is likely to order, e.g., venues differently from a classical music fan, as the user changes we might see different choices from the same set.
This model is particularly significant for, e.g., advertisements on the web: we have some distribution over the users (e.g., some percentage of rock fans, and some percentage of classical music fans) and, given a set of choices, we observe some distribution over its elements. The general goal is to learn the distributions of each set of choices, by querying some small number of sets.
In SODA 2018 (\"\"Discrete Choice, Permutations, and Reconstruction\"\", Chierichetti, Kumar, Tomkins) we have studied this general learning problem from a theoretical perspective, obtaining some algorithms, and some strong lower bounds. In ICML 2018 (\"\"Learning a Mixture of Two Multinomial Logits\"\", Chierichetti, Kumar, Tomkins) we have studied a more restricted class of Random Utility Models (that is, a mixture of Multinomial Logits), obtaining some quite simple algorithms. In fact, our results can be seen as an explanation of why neural networks work so well in identifying mixtures of multinomial logits.

Moreover, we have studied:
- recommender systems (\"\"The Limits\"

Final results

We will keep studying machine learning, and data mining, problems, with the goal of obtaining new efficient algorithms for solving them.
The problems that we plan tackle include topic-reconstruction (for which, again, we will look for very simple algorithms, with provable guarantees), clustering and optimization of functions in the presence of noise.

Website & more info

More info: https://sites.google.com/a/di.uniroma1.it/dmap/.