Opendata, web and dolomites

Report

Teaser, summary, work performed and final results

Periodic Reporting for period 2 - HAP-PHEN (From haplotype to phenotype: a systems integration of allelic variation, chromatin state and 3D genome data)

Teaser

High-throughput sequencing methods are breaching the barrier of $1000 per genome. This means that it will become feasible to sequence the genomes of many individuals and create a deep catalog of the bulk of human genetic variation. A great task will lie in assigning function...

Summary

High-throughput sequencing methods are breaching the barrier of $1000 per genome. This means that it will become feasible to sequence the genomes of many individuals and create a deep catalog of the bulk of human genetic variation. A great task will lie in assigning function to all this genetic variation. Genome wide association studies have already shown that the vast majority of all loci significantly associated with disease are found in non-coding, supposedly regulatory regions. One of the current challenges in human genetics is that variants that affect expression on a single allele cannot be directly linked, because only have genotype information, rather then haplotype information. The overarching aim of the project is to resolve haplotypes in order to identify genetic variants that affect gene expression.

When we better understand complex human genetics (i.e. traits that are influenced by multiple genetic loci), we will be able to better predict disease risk. A genetic profile can be used to encourage people to make lifestyle choices that improve healthy living and aging by preventing the onset of disease.

Every individual has millions of genetic variants that can affect phenotypic traits (e.g. height, blood pressure or cardiovascular disease). Because not every genetic variant is functional, assigning function to genetic variants is extremely difficult, particularly when a genetic variant lies in a region that does not code for a protein. However, genome wide association studies have already shown that non-coding genetic variants make up the majority functional genetic variants. We will use a combination of multiple genomics methods to assign function to non-coding genetic variants.

Work performed

In the project we proposed to generate chromosome-wide haplotypes for a set of lymphoblastoid cell lines. We have generated these data and had to develop a computational analysis pipeline that enabled the haplotyping of these cell lines. We have generated RNAseq, nascent RNAseq, RNA polymerase ChIPseq to measure transcription rates in these cells. In addition, we have generated ATACseq data to identify regulatory regions in these cells. We can use the haplotype information to identify allele-biased expression and intersect this with allele-biased regulatory regions. This will enable us to prioritize functional non-coding variants.

Final results

We have used 10X Genomics long range DNA sequencing in combination with Hi-C data to generate whole chromosome de novo haplotypes for four different primary lymphoblastoid cell lines. As far as we are aware this is the first example of high-resolution, de novo (i.e. not requiring trio, parent or population information) whole chromosome haplotypes using short read sequencing. We have phased >99% of all single nucleotide variants and insertions/deletions (indels).

We are now building a statistical framework to identify functional non-coding variants. Once we have identified putative functional non-coding variants we will validate these variants using CRISPR-Cas9 to introduce them in a recipient cell line. If successful our method will be the first method to identify functional non-coding variants in single individuals. This means that to identify putative functional genetic variants it will not be necessary to analyze large cohorts of individuals which is the current practice.