Integrated genome analysis using

Integrated genome analysis using
Makeflow + friends Scott Emrich UND Director of Bioinformatics Computer Science & Engineering University of Notre Dame

VectorBase is a Bioinformatics Resource Center
VectorBase is a genome resource for invertebrate vectors of human pathogens Funded by NIH-NIAID as part of a wider group of NIAID BRCs (see above) for biodefense and emerging and re-emerging infectious diseases 3rd contract started Fall (for up to 2 more years)

VectorBase: relevant species of interest

Assembly required…

Current challenges genome informaticians are focusing on
Refactoring genome mapping tools to use HPC/HTC for speed-up Esp. when new faster algorithms are not yet available Using “data intensive” frameworks: mapreduce/Hadoop and Spark Efficiently harnessing resources from heterogenous systems Scalable, elastic workflows with flexible granularity

Accelerating Genomics Workflows In Distributed Environments
Research Update Accelerating Genomics Workflows In Distributed Environments March 8, 2016 Olivia Choudhury, Nicholas Hazekamp, Douglas Thain, Scott Emrich Department of Computer Science and Engineering, University of Notre Dame, IN.

Scaling Up Bioinformatic Workflows with Dynamic Job Expansion: A Case Study Using Galaxy and Makeflow Nicholas Hazekamp, Joseph Sarro, Olivia Choudhury, Sandra Gesing, Scott Emrich and Douglas Thain Cooperative Computing Lab: University of Notre Dame

Using makeflow to express genome variation workflow
WorkQueue master-worker framework Sun Grid Engine (SGE) batch system

Overview of CCL-based solution
We use work queue which is a master-worker framework for submitting, monitoring, and retrieving tasks. We support a number of different execution engines subs as condor, SLURM, TORQUE,etc TCP communication that allows us to utilize systmes and resources that are not part of the shared filesystem. Opens up the opportunity for a larger number of machines and workers. Scaling workers to better accommodate the structure of DAG and the busyiness of the overall system

Realized concurrency (in practice)

Related Work (HPC/HTC; not extensive!)
Jarvis et al - Performance models efficiently manage workloads on clouds Ibrahim et al, Grossman - Balance number of resources and duration of usage Grandl et al, Ranganathan et al, Buyya et al - Scheduling techniques reduce resource utilization Hadoop, Dryad, CIEL support data-intensive workload How to write + discuss related work? Why are we not doing scheduling?

Observations Multi-level concurrency is not high with current bioinformatics tools

Observations Task-level parallelism can get worse
Balancing multi-level concurrency and task-level parallelism easy w/ work queue

Results – Predictive Capability for three tools
Avg. MAPE = 3.1

Estimated Azure Cost ($)
Results – Cost Optimization # Cores/ Task # Tasks Predicted Time (min) Speedup Estimated EC2 Cost ($) Estimated Azure Cost ($) 1 360 70 6.6 50.4 64.8 2 180 38 12.3 25.2 32.4 4 90 24 19.5 18.9 8 45 27 17.3

Galaxy Popular with Biologist and Bioinformatics
Emphasis on reproducibility Varying level of difficult, but mostly boils down to once a tool is installed it has turn-key execution (If everything is defined properly it runs) Provides interface to chaining tools into a workflows, storing, and sharing.

Workflows in Galaxy Intro of short Galaxy Workflow
To the user each tools is a black box that they don’t have to know whats happening in the back Turn-Key execution User doesn’t have to see any of this interaction, just tool execution success or failure Define GALAXY JOB

User-System Interaction

Workflow Dynamically Expanded behind Galaxy
The user needs to know nothing of the specific execution. Padding the complexities and verification behind the Galaxy façade As computational needs increase, so to do the resources needed and how we interact with them. A programmer with a better grasp on the workings of the software can determine a safe means of decomp that then can be harnessed by many different scientists.

New User-System Interaction

Results – Optimal Configuration
For the given dataset, K* = 90, N* = 4

Best Data Partitioning Approaches
Split Ref Split SAM SAMBAM ReadGroups Granularity-based partitioning for parallelized BWA Alignment-based partitioning for parallelized HaplotypeCaller

Full Scale Run 61.5X speedup (Galaxy)
Time (HH:MM) 61.5X speedup (Galaxy) Test tools – BWA and GATK’s HaplotypeCaller Test data fold coverage ILLUMINA HiSeq single-end genome data of 50 northern red oak individuals

Comparison of Sequential and Parallelized Pipelines
BWA Intermediate Steps HaplotypeCaller Pipeline Sequential 4 hrs. 04 mins. 5 hrs. 37 mins. 12 days Parallel 0 hr. 56 mins. 2 hrs. 45 mins. 0 hr. 24 mins. 4 hrs. 05 mins. Run time of parallelized BWA-HaplotypeCaller pipeline with optimized data partitioning

Performance in Real-Life (summer 2016)
100+ Different runs through Workflow Utilizing 500+ Cores with heavy load Data sets ranging from >1GB to 50GB+

VectorBase production example

VB running blast (before)
Condor jobs blast Frontend blast condor Talk directly to condor. Custom condor submit scripts per database. One condor job designated to wait on the rest. blast idle-wait

VB running blast (now) Condor jobs blast Frontend blast makeflow
Makeflow manages workflow and connections to condor. Makeflow files are created on the fly (php code, o custom scripts per database) All condor slots run computations. Jobs take 1/3 the time (Saves about 18s in response time). Similar changes for hmmer and clustal. blast condor blast

VB jobs- future? We use work queue which is a master-worker framework for submitting, monitoring, and retrieving tasks. We support a number of different execution engines subs as condor, SLURM, TORQUE,etc TCP communication that allows us to utilize systmes and resources that are not part of the shared filesystem. Opens up the opportunity for a larger number of machines and workers. Scaling workers to better accommodate the structure of DAG and the busyiness of the overall system

Acknowledgements Questions?
Notre Dame Bioinformatics Lab ( and The Cooperative Computing Lab ( University of Notre Dame NIH/NIAID grant HHSN C and NSF grants SI2-SSE and OCI Questions?

Small Scale Run Query: 600MB Ref: 36MB

Data Transfer – A Hindrance
Workers Data Transferred (MB) Transfer Time (s.) 2 64266 594 5 65913 593 10 67522 598 20 70350 623 50 74534 754 100 80267 765 Amount and time of data transfer with increasing workers

MinHash from 1,000 feet Similarity Signatures Sequence s1 Sequence s2
ACGTGCGAAATTTCTC Sequence s2 SIM(s1,s2) = Intersection / Union AAGTGCGAAATTACTT Signatures SIG(s2) = [h1(s2), h2(s2),...,hk(s2)] SIG(s1) SIG(s2) **Comparing 2 sequences, requires k Integer comparison, where k is constant

Three stages of scaffolding

E. coli K12 50 rearrangements

E. coli K12 500 rearrangements

Application-level Model for Runtime

Application-level Model for Memory

System-level Model for Runtime

System-level Model for Memory

Distribution of Regression Coefficients

Integrated genome analysis using

Similar presentations

Presentation on theme: "Integrated genome analysis using"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Integrated genome analysis using

Similar presentations

Presentation on theme: "Integrated genome analysis using"— Presentation transcript:

Similar presentations

About project

Feedback