Recent years have seen a tremendous increase in the volume of data generated in the life sciences, especially propelled by the rapid progress of next-generation sequencing (NGS) technologies. Consequently, the number of genome-sequencing projects and the amount of sequencing data is dramatically increasing. Even the sequencing of all (!) existing eukaryotic species within 10 years has recently been proposed. The analysis of these massive datasets, however, poses difficult computational challenges. The goal of this project is to make effective use of the sequence data through the design of big data algorithms and their efficient implementation on modern high performance computing (HPC) systems. In particular, we want to investigate the following two applications/algorithms as case studies:
(i) Detection of cross-species contamination in NGS datasets using a novel k-mer counting algorithm on a multi-GPU system
(ii) Design of big data algorithms for clustering and searching massive sequence datasets using a Spark cluster
Drug development is affected by high rates of failure due to unexpected toxic effects, some discovered even after drugs have been marketed. The difficulty of detecting all possible toxic effects of a drug lies in the inability to include in clinical trials all possible influences of life-styles and the person-to-person genetic variability.
In this project, we are exploring the use of social media as a promising source of clinical data that is currently largely untapped. Social media like Twitter offers a wealth of data where users report their own experiences. In this project, we are setting the basis to exploit social media to find hints of connections between drug use and side effects (the Small Data).
Texts indicative of adverse effects will be contrasted with toxicogenomics databases, which report the effects of drugs and pollutants in biological samples (human and animal cell lines, and animal organs). In particular, profiles following the gene expression changes of thousands of genes provide great detail of the effects of chemicals at the molecular level. We will focus on hepatotoxicity and nephrotoxicity where the interplay between the chemical, its metabolized derivatives, and the implication of other tissues (e.g. the kidney clearing metabolized derivatives), generates scenarios that are difficult to relate to the biological models used. Here we will use social media to find the optimal models and cohorts that address particular aspects of drug toxicity.
Welcome to JGU Blogs. This is your first post. Edit or delete it, then start blogging!