Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Similar presentations


Presentation on theme: "Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State."— Presentation transcript:

1 Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State University CCGrid 2014, Chicago, IL

2 What is SNP? CCGrid 2014 Stands for Single-Nucleotide Polymorphism DNA sequence variation that occurs when a single nucleotide differs between members of biological species. Essential for medical researches and developing personalized- medicine. A single SNP may cause a Mendelian disease. *Adapted from Wikipedia 2

3 Motivation The sequencing costs are decreasing CCGrid 2014 *Adapted from genome.gov/sequencingcosts 3

4 Big data problem – 1000 Human Genome Project already produced 200 TB data – Parallel processing is inevitable! *Adapted from https://www.nlm.nih.gov/about/2015CJ.html Motivation CCGrid 2014 4

5 Outline Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion CCGrid 2014 5

6 General Idea of SNP Calling Algorithms CCGrid 2014 Sequences 12345678 Read-1 AGCG Read-2 GCGG Read-3 GCGTA Read-4 CGTTCC Alignment File-1 Reference AGCGTACC Sequences 12345678 Read-1 AGAG Read-2 AGAGT Read-3 GAGT Read-4 GTTCC Alignment File-2 ✖✓ ✖ Two main observations: In order to detect an SNP at a certain location, we have to check the alignments in ALL genomes at that location. The existence of an SNP is independent than others 6

7 Parallel SNP Calling How to distribute data among nodes? Processor 1 Location-basedSample-based CCGrid 2014 Proc 2Proc 2 Proc 2Proc 2 Proc 1 Processor 2 Processor 3 Processor 4 Proc 3Proc 3 Proc 3Proc 3 Proc 4Proc 4 Proc 4Proc 4 Proc 1Proc 1 Proc 1Proc 1 Checkerboard Proc 2 Proc 3 Proc 4 Genome files Requires communication among processes CCGrid 2014 7

8 Challenges Load Imbalance due to nature of genomic data – It is not just an array of A, G, C and T characters I/O contention High overhead of random access to a particular region CCGrid 2014 8 1 34 Coverage Variance 8

9 Histogram Showing Coverage Variance Chromosome: 1 Locations: 1-200M Number of samples: 256 Interval size: 1M CCGrid 2014 9

10 Outline Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion CCGrid 2014 10

11 Proposed Scheduling Schemes Dynamic Scheduling Static Scheduling Combined Scheduling …Each scheduling scheme uses location-based data division. That is, the genome is divided into regions and each task is responsible for a region. CCGrid 2014 11

12 Dynamic Scheduling Master & Worker Approach Tasks are assigned dynamically Two types of data-chunks are used – Big chunk: covers B locations – Small chunk: cover S locations – B > S CCGrid 2014 12 B Big chunks are assigned first, then small chunks are assigned B Alignment File -1 Alignment File -2

13 Static Scheduling Pre-processing step – We count the number of alignments for each region and generate a histogram Estimated Cost – We use an estimation function and our histogram for data partitioning. – k : histogram interval k – T R : cost of accessing/reading the region – T P : processing an alignment – N(l): Number of alignments in location l – Each task is responsible for regions having same estimated cost. CCGrid 2014 13 Alignment File -1 Alignment File -2 Tasks are scheduled statically. No master & Slave approach

14 Combined Scheduling Combination of Static and Dynamic Scheduling We use small and big chunks as in dynamic scheduling The size of the chunks are determined according to histogram Master-Worker approach CCGrid 2014 14 Alignment File -1 Alignment File -2 Big chunksSmall chunks

15 Parameters of Scheduling Schemes Our proposed scheduling schemes have user-defined parameters – Dynamic Scheduling Length of big and small chunks – Static Scheduling Histogram interval size Estimation function parameters – Combined Scheduling All parameters for dynamic and static scheduling All parameters can be determined with a offline training phase CCGrid 2014 15

16 Outline Motivation Parallel SNP Calling Proposed Scheduling Schemes Experiments Conclusion CCGrid 2014 16

17 Experiments Local cluster with nodes 2 quad-core 2.53 GHz Xeon(R) processors with 12 GB RAM We obtained genomes of 256 samples from 1000 Human Genome Project The data is replicated to all local disks unless noted otherwise Parallel implementation: – We implemented VarScan in C programming language We also modified VarScan such that BAM files can be read directly. – Used MPI library for parallelization CCGrid 2014 17

18 Experiments: Scalability CCGrid 2014 Scheduling Scheme Scalability Basic8.4x Dynamic10.9x Static19.7x Combined23.5x First 192M location of Chr.1 18

19 Experiments: Data Size Impact CCGrid 2014 128 cores are allocated 19

20 Experiments: I/O Contention Impact CCGrid 2014 128 cores are allocated 20 Scheduling Scheme IO Contention Impact (Sec) Basic174 Dynamic229 Static251 Combined220 I/O Contention Impact

21 Comparison with Hadoop CCGrid 2014 -First 192M location of Chr.2 in 512 samples are analyzed -Lower (dark) portions of the bars show pre- processing time. 21

22 Scheduling With Replication Data-Intensive Processing Motivates New Schemes Replicate each chunk fixed/variable number of times Dynamic scheduling while processing only local chunks Interesting new tradeoffs Under submission IPDPS'14 22

23 Other Work PAGE: A Map-Reduce-like middleware for easy parallelization of genomic applications (IPDPS 2014) Mappers and reducers are executable programs – Allows us to exploit existing applications – No restriction on programming language IPDPS'14 23

24 PAGE vs. State-of-the-Art A middleware system – Specific for parallel genetic data processing – Allow parallelization of a variety of genetic algorithms – Be able to work with different popular genetic data formats – Allows use of existing programs IPDPS'14 24

25 Conclusion We have developed a methodology for parallel identification of variants in large-scale genome sequencing data. Coverage variance and I/O contetion are two main problems We proposed 3 scheduling schemes Combined scheduling gives best results. Our approach has good speedup and outperforms Hadoop CCGrid 2014 25


Download ppt "Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State."

Similar presentations


Ads by Google