PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.

Slides:



Advertisements
Similar presentations
SkewReduce YongChul Kwon Magdalena Balazinska, Bill Howe, Jerome Rolia* University of Washington, *HP Labs Skew-Resistant Parallel Processing of Feature-Extracting.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
SkewTune: Mitigating Skew in MapReduce Applications
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
High Performance Comparison-Based Sorting Algorithm on Many-Core GPUs Xiaochun Ye, Dongrui Fan, Wei Lin, Nan Yuan, and Paolo Ienne Key Laboratory of Computer.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
NGS Analysis Using Galaxy
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
An Approach for Processing Large and Non-uniform Media Objects on MapReduce-Based Clusters Rainer Schmidt and Matthias Rella Speaker: Lin-You Wu.
Face Detection And Recognition For Distributed Systems Meng Lin and Ermin Hodžić 1.
Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.
Storage in Big Data Systems
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Short Read Mapper Evan Zhen CS 124. Introduction Find a short sequence in a very long DNA sequence Motivation – It is easy to sequence everyone’s genome,
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Supporting Load Balancing for Distributed Data-Intensive Applications Leonid Glimcher, Vignesh Ravi, and Gagan Agrawal Department of ComputerScience and.
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
GSVCaller – R-based computational framework for detection and annotation of short sequence variations in the human genome Vasily V. Grinev Associate Professor.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Hadoop MapReduce Framework
Map Reduce.
Introduction to MapReduce and Hadoop
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Supporting Fault-Tolerance in Streaming Grid Applications
MapReduce Simplied Data Processing on Large Clusters
Communication and Memory Efficient Parallel Decision Tree Construction
Data-Intensive Computing: From Clouds to GPU Clusters
MAPREDUCE TYPES, FORMATS AND FEATURES
Resource Allocation for Distributed Streaming Applications
Computational Pipeline Strategies
MapReduce: Simplified Data Processing on Large Clusters
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio State University IPDPS 2014, Phoenix, Arizona

Motivation The sequencing costs are decreasing IPDPS'14 2 *Adapted from genome.gov/sequencingcosts

Big data problem – 1000 Human Genome Project already produced 200 TB data – Parallel processing is inevitable! *Adapted from Motivation IPDPS'14 3

Typical Analysis on Genomic Data Single Nucleotide Polymorphism (SNP) calling IPDPS'14 4 Sequences Read-1 AGCG Read-2 GCGG Read-3 GCGTA Read-4 CGTTCC Alignment File-1 Reference AGCGTACC Sequences Read-1 AGAG Read-2 AGAGT Read-3 GAGT Read-4 GTTCC Alignment File-2 *Adapted from Wikipedia A single SNP may cause Mendelian disease! ✖✓ ✖

Outline Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion IPDPS'14 5

Existing Solutions for Implementation Serial tools – SamTools, VCFTools, BedTools – File merging, sorting etc. – VarScan – SNP calling Parallel implementations – Turboblast, searching local alignments, – SEAL, read mapping and duplicate removal – Biodoop, statistical analysis Middleware Systems – Hadoop Not designed for specific needs of genetic data Limited programmability – Genome Analysis Tool Kit (GATK) Designed for genetic data processing Provides special data traversal patterns Limited parallelization for some of its tools IPDPS'14 6

Outline Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion IPDPS'14 7

Our Goal We want to develop a middleware system – Specific for parallel genetic data processing – Allow parallelization of a variety of genetic algorithms – Be able to work with different popular genetic data formats – Allows use of existing programs IPDPS'14 8

Challenges Load Imbalance due to nature of genomic data – It is not just an array of A, G, C and T characters High overhead of tasks I/O contention IPDPS' Coverage Variance

Our Work PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language IPDPS'14 10

File-m File-2 File-1 Map Reduce Region-1 Map Region-n Intra-dependent Processing IPDPS'14 11 O-1 1 O-1 n Output-1 Map Reduce Region-1 Map Region-n O-m 1 O-m n Output -m Each file is processed independently

Map O1O1 O1O1 OkOk OkOk OnOn OnOn Reduce Output Region-1 Input Files Map Region-k Map Region-n Inter-dependent Processing Each map task processes a particular region of ALL files IPDPS'14 12

What Can PAGE Parallelize? PAGE can parallelize all applications that have the following property M - Map task R, R 1 and R 2 are three regions such that R = concatenation of R 1 and R 2 M (R) = M(R 1 ) ⊕ M(R 2 ) where ⊕ is the reduction function IPDPS'14 13 R1R1 R2R2 R

Data Partitioning Data is NOT packaged into equal-size data blocks as in Hadoop – Each application has a different way of reading the data – Equal-size data block packaging ignores nucleotide base location information Genome structure is divided into regions and each map task is assigned for a region. – Takes account location information – The map task is responsible of accessing particular region of the input files It is a common feature for many genomic tools (GATK, SamTools) IPDPS'14 14

Genome Partition PAGE provides two data partitioning methods – By-locus partitioning: Chromosomes are divided into regions – By-chromosome partitioning: Chromosomes preserve their unity IPDPS'14 15 Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6 Chr-1 Chr-2 Chr-3 Chr-4 Chr-5 Chr-6

Task Scheduling Static Each processor is responsible of regions with equal length. All map tasks should finish before the execution of reduce tasks. Dynamic Map & reduce tasks are assigned by a master process Reduce tasks can start if there are enough available intermediate results. IPDPS'14 16 PAGE provides two types of scheduling schemes.

Applications Developed Using PAGE We parallelized 4 applications – VarScan: SNP detection – Realigner Target Creator: Detects insertion/deletions in alignment files – Indel Realigner: Applies local realignment to improve quality of alignment files – Unified Genotyper: SNP detection IPDPS'14 17

Sample Application Development with PAGE Serial execution command of VarScan Software – samtools mpileup –b file_list -f reference | java -jar VarScan.jar mpileup2snp To parallelize VarScan with PAGE, user needs to define: – Genome Partition: By-Locus – Scheduling Scheme: Dynamic (or Static) – Execution Model: Inter-dependent – Map command: samtools mpileup –b file_list -r regionloc -f reference | java -jar VarScan.jar mpileup2snp >outputloc – Reduction : cat bash shell command IPDPS'14 18

Outline Motivation Existing Solutions for Implementation Our Work Experimental Evaluation Conclusion IPDPS'14 19

Experiments Experimental Setup – In our cluster Each node has 12 GB memory 8 cores (2.53 GHz) – We obtained the data from 1000 Human Genome Project – We evaluated PAGE with 4 applications – We compared PAGE with Hadoop Streaming and GATK IPDPS'14 20

Comparison with GATK IPDPS'14 21 ScalabilityData Size Impact - Indel Realigner tool of GATK Data Size: 11 GB# of cores: x 9x

Comparison with GATK IPDPS'14 22 ScalabilityData Size Impact - Unified Genotyper tool of GATK 10.9x 12.8x Data Size: 34 GB# of cores: 128

ScalabilityData Size Impact - VarScan Application 6.9x 12.7x Comparison with Hadoop Streaming IPDPS'14 23 Data Size: 52 GB# of cores: 128

Summary of Experimental Results When the computing power increased by 16 times IPDPS'14 24 Indel Realigner Unified Genotyper VarScanRealigner Target Creator PAGE9x12.8x12.7x14.1x GATK3.3x10.9x-- Hadoop Streaming --6.9x-

Conclusion We developed a middleware – Easily parallelizes genomic applications – High applicability No restriction on programming language or data format Allows to use existing applications – Provides user to control the parallel execution while hiding the details Alternative scheduling schemes, execution models and data partitioning types – Good Scalability IPDPS'14 25

Thank you for listening … IPDPS'14 26 Questions