Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - HBMAP (Decoding, Mapping and Designing the Structural Complexity of Hydrogen-Bond Networks: from Water to Proteins to Polymers)

Teaser

Several of the compounds that are most crucial for life, and that underlie crucial societal challenges from health to energy are held together by a stable yet labile chemical bond that involves two negatively charged atoms and one hydrogen atom - the so-called hydrogen bond...

Summary

Several of the compounds that are most crucial for life, and that underlie crucial societal challenges from health to energy are held together by a stable yet labile chemical bond that involves two negatively charged atoms and one hydrogen atom - the so-called hydrogen bond. To emphasize the versatile nature of the hydrogen bond and its ubiquity, suffices to say that water, DNA, proteins, several polymers such as kevlar, as well as most small organic molecules that are used as drugs, have a structure that is largely determined by hydrogen bonds.

One of the main reasons behind its staggering flexibility is the fact that hydrogen bonds rarely come alone, but often give rise to cooperative networks in which the total is much more than the sum of the parts. Understanding the complexity that arises when thousands of these relatively simple chemical units combine to form a protein or an extended crystals is an enormous challenge, that limits our ability to tune the behavior and performance of all of these materials.

Computer simulations can provide a significant help to elucidate the structure-property relations of H-bonded materials, by giving direct access to the behavior of individual atoms on a length scale of a billionth of a meter, and on a time scale of a less than a billionth of a second. In order to develop their full potential, however, simulations must improve to achieve greater levels of predictive accuracy, e.g. including a full treatment of the quantum mechanical nature of both electrons and light nuclei (such as hydrogen itself). Furthermore, there is great need to use techniques borrowed from research in artificial intelligence to sift through the enormous amount of data generated by large scale simulations.

The objectives of HBMAP revolve around the use of machine-learning techniques to gain a better understanding of hydrogen-bonded materials, from water to drug molecules, and therefore clarify their structure-property relations and help designing more effective drugs, more resistant, lightweight or biodegradable materials. Specific applications of machine learning that will be pursued in HBMAP include the use of pattern-recognition methods to identify recurring motifs in simulations of H-bonded systems, challenging and extending conventional heuristics (e.g. secondary-structure patterns in proteins), dimensionality reduction techniques to obtain a simplified, more intuitive representation of the collective effects that underlie molecular self-assembly, and statistical regression to predict experimentally-accessible properties that can be used to draw a more direct link between simulations and experiments.

Work performed

This project rests on a strong methodological effort, that has materialized itself in a number of fundamental breakthroughs in the application of machine learning methods to atomistic modeling. We have introduced a probabilistic analysis of molecular motifs (PAMM) scheme that is proving invaluable to identify recurring patterns in an atomistic simulation, such as hydrogen-bonding modes in water, protein folds and misfolds, and packing of molecules in a crystal. This method is based on an analysis of the probability of observing a molecular pattern in a computer simulation that is consistent with the relevant thermodynamic conditions, and makes it possible to identify the most frequently occurring configurations as the most important to determine the material\'s behavior.

We have also improved methods to represent the structural landscape of materials, and used them to reveal structure-property relations in materials for organic electronics. Since representing the structural relations between candidate configurations of a given systems is not necessarily sufficient to determine which of them can be realized in experiments, we also proceeded to incorporate energetic information into our analysis procedure. By generalizing in a data-driven fashion the convex hull construction -- one of the main tools used for computational materials discovery -- we have been able to rationalize the rich variety of polymorphs of ice, and propose a couple of dozen candidate structures that show potential for being stabilized by the use of additives, pressure, or external fields.

An in-depth work on physically and mathematically-sound ways of representing atomic structures to be used as inputs to machine-learning techniques has given us the capability of predicting with great accuracy the properties of materials and molecules. One of the defining features of our framework is its universal applicability: materials as diverse as silicon and organic molecules, and properties as diverse as the interatomic potential and the activity of a candidate drug on a given protein target can be predicted with unprecedented accuracy. We have made available on-line tools (http://shiftml.org, http://alphaml.org) that demonstrate some of the applications of this approach.

Final results

Analyses of simulations of materials, particularly those as complex as proteins or molecular crystals, often rely on heuristic rules or empirical principles to rationalize structure-property relations. This project has made it possible to interpret simulations and experiments based on less biased approaches that rely on the analysis of correlations present in the data, rather than on preconceived notions based e.g. on prior knowledge on similar systems. For instance, this has allowed us to objectively assess the extent of hydrogen bonding in water, and to recognize the link between packing motifs in molecular crystals and their electronic properties.

A better understanding of the interplay between data analytical techniques and materials modeling has allowed us to draw an atlas of the known solid phases of water, and to propose more than 20 new candidates that show substantial promise for being synthesized. The need to draw more direct connections to experiments, and to put patter recognition onto firmer bases in terms of the underlying mathematical representation of atomic structures has pushed us to a leap forward in the machine-learning of molecular and materials\' properties, including the prediction of complex properties that transform in non-trivial ways under symmetry operations (e.g. tensors, etc.).

In the coming months we expect that the methodological effort will bring even more fruit, with applications to other classes of materials including proteins, bio-mimetic polymers and porous compounds. Our groundbreaking symmetry-adapted machine-learning techniques, that we have used to predict dielectric response property of aqueous clusters and bulk water, will be used for other materials classes, and to machine-learn other complex properties such as the electron density.

While we have already released several tools to perform most of the analyses and regression tasks we have introduced this far, the implementation is not always user-friendly. Work is underway to release an open-source implementation of much of our recent developments that is realized with very high standards of efficiency and documentation, and which will reduce dramatically the effort needed for other academic and industrial researchers to adopt them.

Website & more info

More info: https://cosmo.epfl.ch/research/hbmap/.