Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 1 - INFERNET (New algorithms for inference and optimization from large-scale biological data)

Teaser

Issues addressedThe perimeter of the project’s research activity has been divided into:A) Research Themes characterised by the toolbox and methods developed: (i) the inference of interaction networks from data, (ii) the analysis of static and dynamical processes on...

Summary

Issues addressed

The perimeter of the project’s research activity has been divided into:

A) Research Themes characterised by the toolbox and methods developed:

(i) the inference of interaction networks from data,
(ii) the analysis of static and dynamical processes on networks.

B) Application Domains divided into four main areas:

(i) the inference and modelling of multi-scale biological networks,
(ii) the rational design of biological molecules,
(iii) the quantitative study of cell energetics in proliferative regimes,
(iv) the characterisation of functional states of large-scale regulatory networks.


Relevance for society

INFERNET application domains and research themes lie at the heart of current trends and emerging paradigms in science and technology. The main activity of the project has been the development of new conceptual tools to organise vast amounts of heterogeneous biological data and to unveil its hidden large-scale relational order for the benefit of the scientific community. The work of WP2 starts from the explicit aim to integrate our results to their accessible web-based services to foster the dissemination of our results, WP7. The work of WP3 is about the prediction and the high-resolution structural modeling of protein-protein interaction networks via high-throughput in-silico methods developed by WP3. The issue of proliferative metabolism tackled in WP4, and WP5 entails clearly with relevant physiological aspects of cancer development.

Overall objectives

Biological data sets are limited and noisy, so it is not necessarily a good strategy to perfectly fit a model to data (due to the risk of over-fitting). The central problem is finding an optimal strategy of learning noisy data sets. To do so we aim at:

(i) Setting up a coherent and effective theoretical framework for the statistical mechanics of inverse problems;
(ii) Exploiting this framework for the development of efficient and distributed algorithms;
(iii) Analysing the intrinsic limits of the techniques developed in terms of theoretical bounds on the statistical relevance of the inferred results as a function of both quality and quantity of available data;
(iv) Integrating different inference schemes from small modules to the full system, so as to go towards true multi-scale inference.

Work performed

Work performed so far

The first six months (M1-M6) of the project were mostly dedicated to set up the administrative organisation of the project: kick-off meeting (M2), project web-portal (M3), Data Management Plan [DMP] (M6). The products of the first semester of activity have been presented in a progress report on M12. The first public outreach activity of the project, has been the successful organisation of an International school held in Bardonecchia (Torino) on January 22-26 (M16) 2018, organised in collaboration with Bocconi University of Milan. The status of activity of the project has been thoroughly discussed on M16 during Mid-Term meeting held in Bruxelles with the project officer. A second International School followed by Workshop has been held in University of Havana on M24.

On a per-package base, the main scientific results achieved so far have been:

WP1 [Algorithms]

Development of novel inference techniques including prior knowledge. Study of the maximum inter-alignment matching of families with paralogs. Message-passing method for sampling high-dimensional polytopes including prior knowledge.

WP2 [Multi-scale biological networks]

Development of a Gold standard database consisting of proteins and protein complexes with sufficient sequences for statistical analysis, and with resolved structures for systematic validation of results against experimental data

Development of improved versions of DCA integrating prior biological and structural knowledge (e.g. solvent accessibility, amino-acid properties, secondary structure etc.);

WP3 [Design of Biological Molecules]

The problem of accurately predicting how somatic mutations affect protein affinities is one of the most fundamental open problem both in biology and in medicine. Thanks to the so called “next generation sequencing” techniques, the problem has in the last decade acquired a big-data perspective. Although some interesting attempts have being pursued in problem of inference from “in vivo” Repertoire Sequencing data, in recent years in vitro selection experiment based on combinatorial libraries are becoming very popular for:

WP4 [Proliferative Metabolism]

We have been working on the development of algorithm to simulate cell metabolism that taking into consideration the (4.1.1) physical, (4.1.2) biochemical, (4.1.3) cell population effects, and heterogeneity (4.1.4) . In the first stage we have been working on these four different modules separately.


WP5 [Regulatory Networks]

We have been working on the development of algorithm to simulate cell metabolism and regulation that taking into consideration the tissue specific gene regulation constraints

Final results

Expected results and potential impact

Establishing novel protocols for the rational design of biological molecules

Molecular modeling ranges across theoretical and computational methods targeted for the design of synthetic bio-molecules with desired characteristics. We aim at exploiting the co-evolutionary information gathered either from existing databases (e.g. Pfam) or from experimental assessment of thousands of functionally active mutants, to build reliable multivariate models of the observed sequence variability. The final goal is to twist the inferred multivariate statistical model to rationally predict novel sequences with the sought characteristics. INFERNET will tender two specific subprojects

1. Predicting super-binding Antibodies from Repertoire Sequencing Data

Thanks to high-throughput sequencing techniques it is now possible to have access to a fairly representative sample (of the order of 10^5 to 10^6 sequences) of the immune repertoire of a given individual. Our approach will be (i) to use the inference machinery discussed above on the sequenced portion of the immune repertoire; (ii) to use the inferred probabilistic model as a score when predicting the neutralization power of a given antibody sequence for the antigen of interest. In the framework of INFERNET, we aim at developing an integrated protocol to: (i) analyse existing RepSeq data to tune the algorithm in general cases of interests, (ii) use the inferred model to design new super-binding antibodies, (iii) test the sequence in wet lab experiments,

2. Protein Design.

DCA, a probabilistic inference technique developed in the framework of the project, produces accurate models for full-length protein sequences, thus assigning a quantitative measure to each amino-acid sequence of interest. this gives the preliminary evidence that DCA models are able to help to design novel amino-acid sequences. To reach this aim, we need to improve DCA along a number of lines: (i) Highly precise inference methods going beyond mean-field and pseudo-likelihood inference are needed to advance from the before mentioned topological network description to quantitative generative statistical models. (ii) Phylogenetic biases in the sequence data need to be encountered for by novel models of epistatic protein evolution. (iii) Integrative Bayesian inference allows for incorporating prior structural and functional information about the protein of interest.

Website & more info

More info: https://www.infernet.eu/.