Persona: A {High-Performance} Bioinformatics Framework

Stuart Byma; Sam Whitlock; Laura Flueratoru; Ethan Tseng; Christos Kozyrakis; Edouard Bugnion; James Larus

Authors:

Stuart Byma and Sam Whitlock, EPFL; Laura Flueratoru, University Politehnica of Bucharest; Ethan Tseng, CMU; Christos Kozyrakis, Stanford University; Edouard Bugnion and James Larus, EPFL

Abstract:

Next-generation genome sequencing technology has reached a point at which it is becoming cost-effective to sequence all patients. Biobanks and researchers are faced with an oncoming deluge of genomic data, whose processing requires new and scalable bioinformatics architectures and systems. Processing raw genetic sequence data is computationally expensive and datasets are large. Current software systems can require many hours to process a single genome and generally run only on a single computer. Common file formats are monolithic and row-oriented, a barrier to distributed computation.

To address these challenges, we built Persona, a cluster-scale, high-throughput bioinformatics framework. Persona currently supports paired-read alignment, sorting, and duplicate marking using well-known algorithms and techniques. Persona can significantly reduce end-to-end processing times for bioinformatics computations. A new Aggregate Genomic Data (AGD) format unifies sample data and analysis results, while enabling efficient distributed computation and I/O.

In a case study on sequence alignment, Persona sustains 1.353 gigabases aligned per second with 101 base pair reads on a 32-node cluster and can align a full genome in ~16.7 seconds using the SNAP algorithm. Our results demonstrate that: (1) alignment computation with Persona scales linearly across servers with no measurable completion-time imbalance and negligible framework overheads; (2) on a single server, sorting with Persona and AGD is up to 2.3× faster than commonly used tools, while duplicate marking is 3× faster; (3) with AGD, a 7 node COTS network storage system can service up to 60 alignment compute nodes; (4) server cost dominates for a balanced system running Persona, while long-term data storage dwarfs the cost of computation.

Open Access Media

USENIX is committed to Open Access to the research presented at our events. Papers and proceedings are freely available to everyone once the event begins. Any video, audio, and/or slides that are posted after the event are also free and open to everyone. Support USENIX and our commitment to Open Access.

BibTeX

@inproceedings {203159,
author = {Stuart Byma and Sam Whitlock and Laura Flueratoru and Ethan Tseng and Christos Kozyrakis and Edouard Bugnion and James Larus},
title = {Persona: A {High-Performance} Bioinformatics Framework},
booktitle = {2017 USENIX Annual Technical Conference (USENIX ATC 17)},
year = {2017},
isbn = {978-1-931971-38-6},
address = {Santa Clara, CA},
pages = {153--165},
url = {https://www.usenix.org/conference/atc17/technical-sessions/presentation/byma},
publisher = {USENIX Association},
month = jul
}

Download

Persona: A High-Performance Bioinformatics Framework

Open Access Media

Presentation Audio