1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

Slides:

Advertisements

Similar presentations

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.

Advertisements

The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.

Using a Genetic Algorithm for Approximate String Matching on Genetic Code Carrie Mantsch December 5, 2003.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

FLANN Fast Library for Approximate Nearest Neighbors

Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers

Data Formats & QC Analysis for NGS Rosana O. Babu 8/19/20151.

NGS Analysis Using Galaxy

Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.

Whole Exome Sequencing for Variant Discovery and Prioritisation

DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.

File formats Wrapping your data in the right package Deanna M. Church

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

RNAseq analyses -- methods

Massive Parallel Sequencing

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

GNUMAP-SNP Nathan Clement The University of Texas Austin, TX, USA.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

NGS data analysis CCM Seminar series Michael Liang:

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Quick introduction to genomic file types Preliminary quality control (lab)

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

DM ChurchLast Updated: 7 May 2012 Intro to Next Generation Sequencing.

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,

Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.

Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

SAVANT GENOME BROWSER Marc Fiume Department of Computer Science University of Toronto.

Big data Usman Roshan CS 675. Big data Typically refers to datasets with very large number of instances (rows) as opposed to attributes (columns). Data.

RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.

2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.

System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

From Reads to Results Exome-seq analysis at CCBR

Canadian Bioinformatics Workshops

Computing challenges in working with genomics-scale data

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Genomic Data Clustering on FPGAs for Compression

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

CMPT 733, SPRING 2016 Jiannan Wang

Data formats Gabor T. Marth Boston College

Yi Wang, Wei Jiang, Gagan Agrawal

Grid Based Data Integration with Automatic Wrapper Generation

Maximize read usage through mapping strategies

Canadian Bioinformatics Workshops

CMPT 733, SPRING 2017 Jiannan Wang

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Presentation transcript:

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University HiCOMB 2014 May 19 th, Phoenix, Arizona

Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 2

Explosion of Next-Generation Sequencing Data NGS Advantages –Faster and cheaper E.g., over one billion short reads per instrument run –More accurate: higher resolution and deeper coverage Challenges –Urgent need for turning raw data into knowledge –Parallelism is the key 3

Historical Trends in Storage Prices v.s. DNA Sequencing Costs 4 Reported by Lincoln Stein

Varieties of NGS Data Formats Different Formats –SAM (Sequence Alignment/Map) The de-facto text format for storing large nucleotide sequence alignments –BAM (Binary Alignment/Map) The compressed, indexable, binary form of the SAM format Indexing is supported by BAI (BAM Index) file –Other formats BED (Browser Extensible Data), FASTA, FASTQ, WIG(wiggle), GFF(Gene Finding Feature), etc. 5

Analysis Pipeline 6 Current Pipeline –Parallelism mainly focuses on the analysis steps, e.g., SNP discovery and BLAST Reality –Cross-utilization Problem: sequencing data ≠ input –Some other analysis steps stay sequential –Needs for removing other sequential bottlenecks

Motivation: Removing Other Sequential Bottlenecks Parallel Format Conversion –Current format conversion commonly makes use of a single core –Current downstream tools may not be exchanged between different aligners –Not hard to implement but important to scale out Parallelizing Certain Statistical Analysis Steps –E.g., parallel analysis on the histogram data 7

Framework Sequence Data Format Converter –Input: SAM/BAM –Output: BAM/SAM FASTA, FASTQ, BED, BEDGRAPH, JSON and YAML Statistical Analysis Module –Parallelize other statistical analysis steps –E.g., non-local means (NL-Means) and false discovery rate (FDR) computation 8 only discuss the first component today

Outline Introduction Sequence Data Format Converter Design Experimental Results Conclusion 9

Sequence Data Format Converter 3 Converter Instances –SAM Format Converter –BAM Format Converter –Preprocessing-Optimized SAM Format Converter Support partial format conversion on a specific chromosome region 10

SAM Format Converter 11 No communication among procs after partitioning partitioning is the key step for parallelization Extensibility and Programmability

Partitioning Algorithm 12 Key: each SAM record is delimited by a line breaker 1.Initial even partitioning 2.Adjust partition boundaries by detecting line breakers

BAM Format Converter Challenge –No explicit delimiter: –Even partitioning -> unparsable records Solution: add a preprocessing phase –Partition data by supporting random access 13 Cannot be parallelized because of the third-party API

BAMX and BAIX BAMX (BAM eXtended) File –Transform each varying-length BAM record into a regular-layout BAMX record –Align varying-length BAM fields by padding BAIX (BAI eXtended File) –Index file of the BAMX file –Store the alignment starting positions in BAM (logically) and in BAMX (physically) 14

Partial Conversion If only interested in a subset, no need for full conversion Based on the BAIX file –Given logical alignment starting and ending positions, locate the physical starting and ending positions in the BAMX file (by binary search) –Evenly partition the subset and proceed in parallel 15

Preprocessing-Optimized SAM Format Converter Main Ideas –Preprocessing can also optimize the SAM format conversion –Such preprocessing can be parallelized because of the easy partitioning on the SAM format M procsN procsM × N target files

Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 17

Experimental Setup Dataset –Whole genome DNA-sequencing of three mouse samples –Approximately 125 million sequences providing about 40-fold coverage of the genome –In the SAM/BAM format Cluster –8 GB Memory –Up to 32 8-core machines (256 cores in total) 18

Performance of SAM Format Converter Input: 100 GB SAM data Output: BED, BEDGRAPH and FASTA 19

Performance of BAM Format Converter Input: 117 GB BAM data Output: BED, BEDGRAPH and FASTA 20

SAM Format Converter Comparison: Preprocessing-Optimized vs. Original Input: 15.7 GB BAM data Output: BED, BEDGRAPH and FASTA 21

Outline Introduction Sequence Data Format Converter Design Parallelization of Statistical Analysis Steps Experimental Results Conclusion 22

Conclusion In the NGS analysis pipeline, the overall latency cannot be reduced unless all sequential bottlenecks are removed The first framework that can easily support parallel sequence format conversion in distributed environment –SAM format converter –BAM format converter –Preprocessing-optimized SAM format converter 23