Next Generation Sequencing: Big Data meets HPC

Recent years have seen a tremendous increase in the volume of data generated in the life sciences, especially propelled by the rapid progress of next-generation sequencing (NGS) technologies. Consequently, the number of genome-sequencing projects and the amount of sequencing data is dramatically increasing. Even the sequencing of all (!) existing eukaryotic species within 10 years has recently been proposed. The analysis of these massive datasets, however, poses difficult computational challenges. The goal of this project is to make effective use of the sequence data through the design of big data algorithms and their efficient implementation on modern high performance computing (HPC) systems. In particular, we want to investigate the following two applications/algorithms as case studies:
(i) Detection of cross-species contamination in NGS datasets using a novel k-mer counting algorithm on a multi-GPU system
(ii) Design of big data algorithms for clustering and searching massive sequence datasets using a Spark cluster