Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Using Parallel Genetic Algorithm in a Predictive Job Scheduling
SkewTune: Mitigating Skew in MapReduce Applications
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
OpenFOAM on a GPU-based Heterogeneous Cluster
4/26/05Han: ELEC72501 Department of Electrical and Computer Engineering Auburn University, AL K.Han Development of Parallel Distributed Computing System.
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Science Advisory Committee Meeting - 20 September 3, 2010 Stanford University 1 04_Parallel Processing Parallel Processing Majid AlMeshari John W. Conklin.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
Yongzhi Wang, Jinpeng Wei VIAF: Verification-based Integrity Assurance Framework for MapReduce.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
PAGE: A Framework for Easy Parallelization of Genomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering The Ohio.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Storage in Big Data Systems
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
PARALLEL PROCESSING OF LARGE SCALE GENOMIC DATA Candidacy Examination 08/26/2014 Mucahid Kutlu.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Optimizing multi-pattern searches for compressed suffix arrays Kalle Karhu Department of Computer Science and Engineering Aalto University, School of Science,
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
A Map-Reduce System with an Alternate API for Multi-Core Environments Wei Jiang, Vignesh T. Ravi and Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
RE-PAGE: Domain-Specific REplication and PArallel Processing of GEnomic Applications 1 Mucahid Kutlu Gagan Agrawal Department of Computer Science and Engineering.
Rapid Tomographic Image Reconstruction via Large-Scale Parallelization Ohio State University Computer Science and Engineering Dep. Gagan Agrawal Argonne.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
DOE Network PI Meeting 2005 Runtime Data Management for Data-Intensive Scientific Applications Xiaosong Ma NC State University Joint Faculty: Oak Ridge.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,
Load Rebalancing for Distributed File Systems in Clouds.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Adaptive Online Scheduling in Storm Paper by Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni Presentation by Keshav Santhanam.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Ioannis E. Venetis Department of Computer Engineering and Informatics
Large-scale file systems and Map-Reduce
Parallel Programming By J. H. Wang May 2, 2017.
Parallel Algorithm Design
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Communication and Memory Efficient Parallel Decision Tree Construction
Cse 344 May 4th – Map/Reduce.
Objective of This Course
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2014, Chicago, IL

What is SNP? CCGrid 2014 Stands for Single-Nucleotide Polymorphism DNA sequence variation that occurs when a single nucleotide differs between members of biological species. Essential for medical researches and developing personalized- medicine. A single SNP may cause a Mendelian disease. *Adapted from Wikipedia 2

Motivation The sequencing costs are decreasing CCGrid 2014 *Adapted from genome.gov/sequencingcosts 3

Big data problem – 1000 Human Genome Project already produced 200 TB data – Parallel processing is inevitable! *Adapted from Motivation CCGrid

Outline Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion CCGrid

General Idea of SNP Calling Algorithms CCGrid 2014 Sequences Read-1 AGCG Read-2 GCGG Read-3 GCGTA Read-4 CGTTCC Alignment File-1 Reference AGCGTACC Sequences Read-1 AGAG Read-2 AGAGT Read-3 GAGT Read-4 GTTCC Alignment File-2 ✖✓ ✖ Two main observations: In order to detect an SNP at a certain location, we have to check the alignments in ALL genomes at that location. The existence of an SNP is independent than others 6

Parallel SNP Calling How to distribute data among nodes? Processor 1 Location-basedSample-based CCGrid 2014 Proc 2Proc 2 Proc 2Proc 2 Proc 1 Processor 2 Processor 3 Processor 4 Proc 3Proc 3 Proc 3Proc 3 Proc 4Proc 4 Proc 4Proc 4 Proc 1Proc 1 Proc 1Proc 1 Checkerboard Proc 2 Proc 3 Proc 4 Genome files Requires communication among processes CCGrid

Challenges Load Imbalance due to nature of genomic data – It is not just an array of A, G, C and T characters I/O contention High overhead of random access to a particular region CCGrid Coverage Variance 8

Histogram Showing Coverage Variance Chromosome: 1 Locations: 1-200M Number of samples: 256 Interval size: 1M CCGrid

Outline Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion CCGrid

Proposed Scheduling Schemes Dynamic Scheduling Static Scheduling Combined Scheduling …Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region. CCGrid

Dynamic Scheduling Master & Worker Approach Tasks are assigned dynamically Two types of data-chunks are used – Big chunk: covers B locations – Small chunk: cover S locations – B > S CCGrid B Big chunks are assigned first, then small chunks are assigned B Alignment File -1 Alignment File -2

Static Scheduling Pre-processing step – We count the number of alignments for each region and generate a histogram Estimated Cost – We use an estimation function and our histogram for data partitioning. – k : histogram interval k – T R : cost of accessing/reading the region – T P : processing an alignment – N(l): Number of alignments in location l – Each task is responsible for regions having same estimated cost. CCGrid Alignment File -1 Alignment File -2 Tasks are scheduled statically. No master & Slave approach

Combined Scheduling Combination of Static and Dynamic Scheduling We use small and big chunks as in dynamic scheduling The size of the chunks are determined according to histogram Master-Worker approach CCGrid Alignment File -1 Alignment File -2 Big chunksSmall chunks

Parameters of Scheduling Schemes Our proposed scheduling schemes have user-defined parameters – Dynamic Scheduling Length of big and small chunks – Static Scheduling Histogram interval size Estimation function parameters – Combined Scheduling All parameters for dynamic and static scheduling All parameters can be determined with a offline training phase CCGrid

Outline Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion CCGrid

Experiments Local cluster with nodes 2 quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM We obtained genomes of 256 samples from 1000 Human Genome Project The data is replicated to all local disks unless noted otherwise Parallel implementation: – We implemented VarScan in C programming language We also modified VarScan such that BAM files can be read directly. – Used MPI library for parallelization CCGrid

Experiments: Scalability CCGrid 2014 Scheduling Scheme Scalability Basic8.4x Dynamic10.9x Static19.7x Combined23.5x First 192M location of Chr.1 18

Experiments: Data Size Impact CCGrid cores are allocated 19

Experiments: I/O Contention Impact CCGrid cores are allocated 20 Scheduling Scheme IO Contention Impact (Sec) Basic174 Dynamic229 Static251 Combined220 I/O Contention Impact

Comparison with Hadoop CCGrid First 192M location of Chr.2 in 512 samples are analyzed -Lower (dark) portions of the bars show pre- processing time. 21

Scheduling With Replication Data-Intensive Processing Motivates New Schemes Replicate each chunk fixed/variable number of times Dynamic scheduling while processing only local chunks Interesting new tradeoffs Under submission IPDPS'14 22

Other Work PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014) Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language IPDPS'14 23

PAGE vs. State-of-the-Art A middleware system – Specific for parallel genetic data processing – Allow parallelization of a variety of genetic algorithms – Be able to work with different popular genetic data formats – Allows use of existing programs IPDPS'14 24

Conclusion We have developed a methodology for parallel identification of variants in large-scale genome sequencing data. Coverage variance and I/O contetion are two main problems We proposed 3 scheduling schemes Combined scheduling gives best results. Our approach has good speedup and outperforms Hadoop CCGrid