Novel scalable approaches for the computational analysis of bacterial genomes
Over the last decades, the giant progress of DNA sequencing led to increased throughput and tremendously reduced costs resulting in a broad accessibility and applicability of these technologies and thus revolutionized the entire field of microbial genomics. Today, these developments allow the sequencing of large groups and entire cohorts of ... bacterial genomes in a timely manner, whereas a mere decade ago, this was only feasible for a few single genomes. Now, hundreds of thousands of sequenced bacterial genomes are available in public databases and vast numbers of genomes are sequenced worldwide on a daily basis without any foreseeable climax. Many fields of research benefit from these developments, in particular medical microbiology and epidemiology. Hence, genome-based analyses have nowadays become essential tools for the detection, classification, typing and comparison of special-interest genes and pathogenic genomes at various levels. At the same time, IT is revolutionized alike by new developments like cloud computing and software containerization techniques. Modern software engineering paradigms and frameworks have recently emerged and provide new opportunities for scalable computations on distributed and heterogeneous infrastructures that in turn imply new technical premises. Albeit the mere sequencing of bacterial genomes as well as computing capacity in general are not the major limiting factors anymore, the comprehensive, timely and standardized analysis of large bacterial whole-genome sequencing data however remains an issue of rising importance. Therefore, it was the aim of this thesis to address these challenges and provide novel bioinformatic approaches and software tools for the scalable high-throughput analysis of whole-genome sequencing data of large bacterial cohorts. An automated and comprehensive workflow was designed and implemented in a portable, scalable and user-friendly software tool ASA³P. It supports data from all contemporary DNA sequencing platforms conducting the streamlined processing and analysis from raw reads to assembled, annotated and comprehensively characterized genomes including comparative analyses. The software provides both vertical and horizontal scalability allowing researchers to take advantage of distributed and versatile computing infrastructures. Results are presented as integrated, human-readable and interactive reports. Two further contributions address issues that have arisen from the design of this workflow. For the integrated analysis of plasmids, a novel methodology has been developed for the automated and taxonomy-independent detection and characterization of plasmid-borne contigs from fragmented bacterial draft assemblies. As a new approach to this problem, the natural distribution bias of protein-coding gene families among chromosomes and plasmids is utilized, which achieves a robust and competitive classification performance. This new methodology was implemented in the software tool Platon, which also provides additional plasmid characterizations. A third contribution addresses the robust and accurate but rapid computation of mutual genome distances that is required for the automated selection of high-quality reference genomes and whole-genome-based taxonomic classifications. As the large amount of available genome sequences poses increasing hurdles to these steps in terms of data accessibility, performance and runtimes, a new software tool called ReferenceSeeker combining existing methodologies was developed and complemented by the provisioning of integrated and customizable databases. Noteworthy, its application is not limited to microbial genomes alone, but DNA sequences in general, including plasmids. These three bioinformatics solutions have been used in various published and unpublished studies and proven as useful software tools for researchers in the field of medical microbiology. In particular, ASA³P enables researchers to take advantage of modern and scalable IT resources and provides access to a diverse set of proven bioinformatics software tools. Hence, even more bacterial genomes and larger cohorts thereof can be processed, characterized and compared among each other, allowing researchers to keep pace with DNA sequencing technologies and future demands. Due to its extensible framework, the application of ASA³P is however not restricted to medical microbiology applications, but can be expanded and adapted to applications within the much larger field of microbial genomics. Furthermore, several ideas for further improvements and potential new software solutions emerged from this work that opened new research questions and established interesting subjects for future investigations.