Nowadays computers are used to process huge amounts of information. For example, a major search engine processes tens of thousands of searches per second, and more than 4 terabytes of Internet traffic is generated in 1 minute. This volume of information is expected to be...
Nowadays computers are used to process huge amounts of information. For example, a major search engine processes tens of thousands of searches per second, and more than 4 terabytes of Internet traffic is generated in 1 minute. This volume of information is expected to be multiplied by a factor of 10 in the following 10 years. We are just beginning to be capable of managing this huge amount of information. However, we are not prepared to address the big challenge posed by the explosion of data coming from the DNA sequencing.
Thanks to the falling cost of genome sequencing, biological data doubles every six months. Recent studies estimate that there will be up to 2 billion human genomes sequenced by 2025, increasing five orders of magnitude in 10 years. This increment can be even greater; for instance, for understanding cancer it will be necessary to sequence thousands of cells of the same tumor.
Processing the amount of information that will make personalized medicine a reality, requires a new approach. A new class of data structures has recently been developed to address the new challenges in storing, processing, indexing, searching and navigating biological data. Similar tasks have been also tackled by researchers in the information retrieval community. Synergies of researchers from both fields can lead to new efficient approaches to improve the technology used for analysis of genome-scale data. Only by being capable of processing large amounts of information we will be able to understand diseases or the development, behavior and evolution of species.
The overall goal of BIRDS is to establish a long term international network involving leading researchers in bioinformatics and information retrieval from four different continents, to strengthen the partnership through the exchange of knowledge and expertise, and to develop integrated approaches to improve current ones in both fields.
The project has achieved most of the goals planned for the first periodic report regarding research, training and networking activities. The consortium has achieved relevant scientific contributions, created new collaborations and enhanced existing ones, and proposed and implemented well-attended training and networking activities.
Regarding the first research task (Algorithms for Sequence Analysis), the most important results were in the field of Bioinformatics, where a new time-optimal algorithm for the subproblem contig assembly of the genome assembly problem was proposed. In addition, there were also some advances on the variation calling problem, which allows detecting mutations of an individual with respect to a reference genome.
For the second research task (Compression and indexing techniques for repetitive data), the knowledge of the partners on both fields were combined. As important results, we can enumerate: new universal indexes for highly repetitive document collections or efficient representation of trajectories and event sequences. In addition, there have been some important advances on new compression and indexing techniques for repetitive data not published yet. They will be an important contribution to the field, with applications on both bioinformatics and information retrieval.
Finally, results of the third research task (Data structures and algorithms for networks analysis) include a new compact trip representation over networks, an efficient parallel construction for wavelet trees planar graphs, an efficient representation of multidimensional data over hierarchical domains, a succinct data structure for self-indexing ternary relations, which can be applied to RDF datasets representing information from biological fields, and a compressed representation of dynamic binary relations, in particular of graphs and networks that change over time, which can be exploited to dynamic graphs and networks from any field.
BIRDS will lead to results in the shape of products that have an impact on the research community and can be potentially transferred to the market as complete tools, libraries or frameworks for bioinformatics tasks or in other fields of IR. Some of the potentially exploitable results are:
- Compressed representations of repetitive sequences and repetitive collections. Compression and indexing of repetitive sequences provide the basis for basic search tools and also building blocks to run complex sequence analysis algorithms in reduced space. Preliminary results are already available as scientific publications. Repetitive sequences and collections arise not only in biological data, but also in many other areas of IR. These data structures and algorithms can potentially be exploited as tools in document-based databases or digital libraries and other document repository software, leading to overall improvements in space usage and query efficiency of software tools.
- Compressed representation of trajectories. Preliminary results have already been obtained regarding the compact representation of moving objects. The representation of trajectories is an open problem, especially in the field of geographical information retrieval. The storage of positions and trajectories of cars, ships, people, etc. is a usual requirement in many systems, which need to manage very large amounts of data. Efficient compression of trajectories is an active research field, but most of the solutions for efficient representation are limited to the research community. This will be exploited as a tool inside larger management tools, for companies that manage mobility workers, or container ships. In most cases, this kind of information is stored and transmitted uncompressed, leading to inefficiency.
- Compressed representation of multidimensional data. New algorithms and data structures for the self-indexed representation of multidimensional data provide the basis for advanced information retrieval over aggregated data. Several preliminary research results have already been obtained in this topic, which could be integrated and benchmarked to obtain a prototype compression tool. Also, the integration of compressed representations of multidimensional data and compressed representations of trajectories would provide a better solution for the use cases defined in the previous point.
- Compressed representations of binary and ternary relations, graphs and networks. The representation of binary/ternary relations has many applications in the field of information retrieval, and can be applied in practice as a building block for network analysis and for the representation of sequences in general information retrieval. Specific compressed representations of relations, graphs and networks have been obtained as preliminary research results in the project, and improvements are expected. Representations of graphs, binary and ternary relations arise in many research fields, and efficient data structures and algorithms for the storage of graphs are relevant in many areas.
- New algorithms for sequence alignment/assembly, and sequence analysis. Sequence analysis and specific tasks such as sequence alignment or genome assembly are essential for bioinformatics research. Any prototype or tool that improves space or time efficiency over state-of-the-art solutions would be immediately applied to further research and would also be of great impact in the industry.
More info: http://birdsproject.eu/.