Bio-informatics: Drastically improving the performance of Genome Sequencing

The five years since the introduction of Next-Generation Sequencing (NGS) technology have seen a major transformation in the way scientists extract genetic information from biological systems, revealing limitless insight about the genome, transcriptome, and epigenome of any species. This ability has catalyzed a number of important breakthroughs, advancing scientific fields from human disease research to agriculture and evolutionary science. But, NGS data output has increased at a rate that outpaces Moore's law, more than doubling each year since it was invented and in 2011, that rate has nearly reached 1 TB of data in a single sequencing run. Therefore, with the ability to rapidly generate large volumes of sequencing data, NGS does crave efficient hardware and software for assembling this data, aka aligning and merging fragments in order to reconstruct the original sequence, in a reasonable amount of time.

The current sequencing machines are capable of generating approximately 1,400 whole genomes (equivalent to 200 TBs of data) per month. The SAGE storage and processing platform is required to enable efficient analysis of such large genomics data sets. Within SAGE, we aim to prepare a scalable system for whole genome sequencing to support fast storage, alignment and operational queries. In SAGE, the computation on genomes will be built around the principle of Percipient Storage, in which computation is enabled at different tiers of the I/O stack rather than streaming data from storage to compute nodes for processing.