High-Performance Reconfigurable Computing for Genome Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia,

Slides:



Advertisements
Similar presentations
Reconstructing Phylogenies from Gene-Order Data Overview.
Advertisements

An Algorithm for Constructing Parsimonious Hybridization Networks with Multiple Phylogenetic Trees Yufeng Wu Dept. of Computer Science & Engineering University.
A Separate Analysis Approach to the Reconstruction of Phylogenetic Networks Luay Nakhleh Department of Computer Sciences UT Austin.
School of CSE, Georgia Tech
. Class 9: Phylogenetic Trees. The Tree of Life Evolution u Many theories of evolution u Basic idea: l speciation events lead to creation of different.
DCJUC: A Maximum Parsimony Simulator for Constructing Phylogenetic Tree of Genomes with Unequal Contents Zhaoming Yin Bader-Polo Joint Group Meeting, Nov.
CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.
Parsimony based phylogenetic trees Sushmita Roy BMI/CS 576 Sep 30 th, 2014.
Algorithm Strategies Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Institute of Intelligent Power Electronics – IPE Page1 Introduction to Basics of Genetic Algorithms Docent Xiao-Zhi Gao Department of Electrical Engineering.
Chapter 3 The Greedy Method 3.
Molecular Evolution Revised 29/12/06
High-Performance Algorithm Engineering for Computational Phylogenetics [B Moret, D Bader] Kexue Liu CMSC 838 Presentation.
Heterogeneous Computing at USC Dept. of Computer Science and Engineering University of South Carolina Dr. Jason D. Bakos Assistant Professor Heterogeneous.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Phylogenetic Trees Presenter: Michael Tung
FPGA Acceleration of Phylogeny Reconstruction for Whole Genome Data Jason D. Bakos Panormitis E. Elenis Jijun Tang Dept. of Computer Science and Engineering.
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Processing Rate Optimization by Sequential System Floorplanning Jia Wang 1, Ping-Chih Wu 2, and Hai Zhou 1 1 Electrical Engineering & Computer Science.
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
FPGA Acceleration of Gene Rearrangement Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA.
Building Phylogenies Parsimony 2.
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
Seven Minute Madness: Reconfigurable Computing Dr. Jason D. Bakos.
Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.
Phylogeny Estimation: Why It Is "Hard", and How to Design Methods with Good Performance Tandy Warnow Department of Computer Sciences University of Texas.
Combinatorial and Statistical Approaches in Gene Rearrangement Analysis Jijun Tang Computer Science and Engineering University of South Carolina
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
Busby, Dodge, Fleming, and Negrusa. Backtracking Algorithm Is used to solve problems for which a sequence of objects is to be selected from a set such.
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Phylogenetics Alexei Drummond. CS Friday quiz: How many rooted binary trees having 20 labeled terminal nodes are there? (A) (B)
GPU Acceleration of Pyrosequencing Noise Removal Dept. of Computer Science and Engineering University of South Carolina Yang Gao, Jason D. Bakos Heterogeneous.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Introduction to Job Shop Scheduling Problem Qianjun Xu Oct. 30, 2001.
Computer Science Research for The Tree of Life Tandy Warnow Department of Computer Sciences University of Texas at Austin.
FPGA FPGA2  A heterogeneous network of workstations (NOW)  FPGAs are expensive, available on some hosts but not others  NOW provide coarse- grained.
Binary Encoding and Gene Rearrangement Analysis Jijun Tang Tianjin University University of South Carolina (803)
PERFORMANCE ANALYSIS cont. End-to-End Speedup  Execution time includes communication costs between FPGA and host machine  FPGA consistently outperforms.
Bioinformatics 2011 Molecular Evolution Revised 29/12/06.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Introduction to Phylogenetic Trees
Ch.6 Phylogenetic Trees 2 Contents Phylogenetic Trees Character State Matrix Perfect Phylogeny Binary Character States Two Characters Distance Matrix.
Thursday, May 9 Heuristic Search: methods for solving difficult optimization problems Handouts: Lecture Notes See the introduction to the paper.
CSE 589 Part VI. Reading Skiena, Sections 5.5 and 6.8 CLR, chapter 37.
Parallel & Distributed Systems and Algorithms for Inference of Large Phylogenetic Trees with Maximum Likelihood Alexandros Stamatakis LRR TU München Contact:
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Introduction to Genetic Algorithms. Genetic Algorithms We’ve covered enough material that we can write programs that use genetic algorithms! –More advanced.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
1 Alignment Matrix vs. Distance Matrix Sequence a gene of length m nucleotides in n species to generate an… n x m alignment matrix n x n distance matrix.
1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.
Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
WABI: Workshop on Algorithms in Bioinformatics
Inferring a phylogeny is an estimation procedure.
BNFO 602 Phylogenetics Usman Roshan.
BNFO 602 Phylogenetics – maximum parsimony
Multiple Genome Rearrangement
Phylogeny.
CS 394C: Computational Biology Algorithms
Algorithms for Inferring the Tree of Life
Presentation transcript:

High-Performance Reconfigurable Computing for Genome Analysis Jason D. Bakos Dept. of Computer Science and Engineering University of South Carolina Columbia, SC USA

UNC-CharlotteMar. 28, High-Performance Reconfigurable Computing Use FPGA as co-processor Example: –Application requires a week of CPU time –One computation consumes 99% of execution time Kernel speedup Application speedup Execution time hours hours hours hours hours

UNC-CharlotteMar. 28, HPRC: Requirements, Pros, Cons Application criteria: –computationally expensive –bottleneck computation… fits on FPGA finely parallelizable has low I/O and storage requirements (relative to computation) Advantage of HPRC: –Cost FPGA card => ~ $15K 128-processor cluster => ~ $150K + maintenance + cooling + electricity + recycling Disadvantage of HPRC: –Programming the FPGA

UNC-CharlotteMar. 28, Programming Requires large-scale digital logic design Must finely parallelize algorithm across FPGA resources –Especially difficult for control-dependent computations Our goal: –Identify, characterize, and accelerate applications in computational biology Our strategy: 1.Develop a library of optimized, parameterizable kernel designs for common applications 2.Develop a design automation tool to generate accelerator architectures

UNC-CharlotteMar. 28, FPGA Acceleration of Computational Biology Aho-Corasick string set matching –Bit-sliced state machines Dandass et al, Mississippi State Univ. Sequence alignment –BLASTP, Smith-Waterman, Needleman-Wunsch –Systolic array –Examples: Chamberlain et al., WUSTL Herbordt et al, Boston University Sotiriades et al, Univ. of Crete Knowles et al, Flinders Univ. Benkrid et al., Univ. of Edinburgh Underwood, Sass et al. etc…

UNC-CharlotteMar. 28, Computational Phylogenetics genus Drosophila

UNC-CharlotteMar. 28, Phylogenetic Analysis Phylogenies are used to infer common characteristics among related species

UNC-CharlotteMar. 28, Phylogenic Analysis Phylogenies help biologists understand and predict: –functions and interactions of genes –genotype => phenotype –host/parasite co-evolution –origins and spread of disease –drug and vaccine development –origins and migrations of humans

UNC-CharlotteMar. 28, Phylogeny Data Structure g1 g2 g5 g4g6 g3 g5 g4 g1g3 g2 g5 g6 g5 g2 g1 g6 g3 g4 Unrooted binary tree n leaf vertices n - 2 internal vertices (degree 3) Tree configurations = (2n - 5) * (2n - 7) * (2n - 9) * … * trillion trees for 16 leaves

UNC-CharlotteMar. 28, Phylogenetic Reconstruction Given input genomes, reconstruct an evolutionary tree –Leaves are inputs, internal nodes are common ancestors –Edges represent evolutionary lineage Several methods exist: –Distance-based (clustering) methods: clustering technique based on pairwise distances –Bayesian methods: maximizes the likelihood of a phylogenetic tree based on probabalistic models –Maximum parsimony: minimizes sum of edge lengths

UNC-CharlotteMar. 28, Reconstruction Method Maximum parsimony: –Goal: Accuracy –Relies on a direct evolutionary model –Search for tree with minimum total edge lengths Direct-optimization method: –To evaluate a fixed tree… 1.Label all internal vertices with gene orders Initialize and iteratively refine until the labels converges 2.Measure edge lengths using distance estimator … …,,

UNC-CharlotteMar. 28, Gene Rearrangement Data Gene rearrangement analysis –Evolution analysis using gene order data Assumes gene-rearrangement model for evolution, i.e.: –Inversion g 0 g 1 g 2 g 3 g 4 g 5 g 0 g 1 –g 4 –g 3 –g 2 g 5 –Transposition g 0 g 1 g 2 g 3 g 4 g 5 g 0 g 2 g 3 g 4 g 1 g 5 –Transversion g 0 g 1 g 2 g 3 g 4 g 5 g 0 –g 4 –g 3 –g 2 g 1 g 5

UNC-CharlotteMar. 28, Breakpoint Distance Metric Estimation of number of rearrangement events between gene orders A and B # of adjacencies: g h in A that doesn’t correspond to g h or –h –g in B Example: –A = –B = –Breakpoint distance = 2

UNC-CharlotteMar. 28, Median ABC M d(A,M) d(B,M) d(C,M) Ancestral vertices are computed using a median computation All internal vertices have degree 3 Find M that optimally minimizes median score score = d(A,M) + d(B,M) + d(C,M) Breakpoint median: –d() is breakpoint distance

UNC-CharlotteMar. 28, Breakpoint Median Implementation Optimal TSP is feasible due to small graph Implemented as a depth-first branch-and-bound search Upper bound is the current best tour Lower-bound is computed using a linear greedy algorithm –Select a set of minimal-weight edges to complete a partially- constructed tour –To tighten: edges not considered that… have been pruned at or above the current level of the search tree that would create a cycle not including all cities

UNC-CharlotteMar. 28, Execution Behavior Evolution Rate of Inputs Execution Time Ratio for Medians 1 0 Application behavior depends on evolution rate of inputs Execution time ratio for median computations: –Asymptotically approaches 100% with diameter of input set Median adopted as kernel computation

UNC-CharlotteMar. 28, Breakpoint Median Construct a fully connected graph containing all g and –g for each gene –w(g,-g) = - –Initialize all other weights to be 3 –For each adjacency gh in the three genomes, decrement weight between vertex –g and h Solve TSP A = B = C = Edges not shown have cost = 3 cost = -  cost = 0 cost = 1 cost = An optimal solution corresponding to genome

UNC-CharlotteMar. 28, Breakpoint Median Algorithm Optimal solution is feasible due to small graph Algorithm: –Represent TSP graph as a list of edges –Test every possible valid combination of edges Implemented as a branch-and-bound search Upper bound is the best tour found so far Lower bound is computed using a greedy algorithm –Loop that inspects each vertex in TSP graph –Accumulates lower bound value (based on search state) –Performed each time an edge is added or deleted from solution state –Requires nearly 100% of median execution time (bottleneck)

UNC-CharlotteMar. 28, Example Breakpoint Median used 1 => 0 -1 => 0 2 => 0 -2 => 0 3 => 0 -3 => 0 4 => 0 -4 => 0 otherEnd 1 => => 1 2 => => 2 3 => => 3 4 => => 4 cost = used 1 => 0 -1 => 0 2 => 0 -2 => 0 3 => 0 -3 => 1 4 => 1 -4 => 0 otherEnd 1 => => 1 2 => => 2 3 => => 3 4 => => 3 cost = 0 sorted edge list: (-3,4,w=0) (2,3,w=1) (1,2,w=2) (-1,-2,w=2) (1,-2,w=2) (-2,-4,w=2) (-1,3,w=2) (-1,-4,w=2) (1,-4,w=2) used 1 => 0 -1 => 0 2 => 1 -2 => 0 3 => 1 -3 => 1 4 => 1 -4 => 0 otherEnd 1 => => 1 2 => => -4 3 => => 3 4 => => -2 cost = 1 pruned

UNC-CharlotteMar. 28, Example Breakpoint Median used 1 => 0 -1 => 0 2 => 0 -2 => 0 3 => 0 -3 => 0 4 => 0 -4 => 0 otherEnd 1 => => 1 2 => => 2 3 => => 3 4 => => 4 cost = used 1 => 0 -1 => 0 2 => 0 -2 => 0 3 => 0 -3 => 1 4 => 1 -4 => 0 otherEnd 1 => => 1 2 => => 2 3 => => 3 4 => => 3 cost = used 1 => 1 -1 => 0 2 => 1 -2 => 0 3 => 0 -3 => 1 4 => 1 -4 => 0 otherEnd 1 => => -2 2 => => -1 3 => => 3 4 => => 3 cost = 2 exclude edge (2,3) used 1 => 1 -1 => 0 2 => 1 -2 => 1 3 => 0 -3 => 1 4 => 1 -4 => 1 otherEnd 1 => => 3 2 => => -1 3 => => 3 4 => => 3 cost = used 1 => 1 -1 => 1 2 => 1 -2 => 1 3 => 1 -3 => 1 4 => 1 -4 => 1 otherEnd 1 => => 3 2 => => -1 3 => => 3 4 => => 3 cost = 6 tour is -1, 1, 2, -2, -4, 4, -3, 3 median is -1, 2, -4, -3 sorted edge list: (-3,4,w=0) (2,3,w=1) (1,2,w=2) (-1,-2,w=2) (1,-2,w=2) (-2,-4,w=2) (-1,3,w=2) (-1,-4,w=2) (1,-4,w=2)

UNC-CharlotteMar. 28, Hardware Median Core Design Top-LevelController

UNC-CharlotteMar. 28, Accelerator Architecture Fill FPGAs with median cores Fan-outs and fan-ins are pipelined to meet PCI-X timing Platform: –Annapolis Wild-Star II Pro –Virtex-2 Pro I/O –Programmed I/O –Hosts polls each core for state –Comm. overhead is significant for easy medians

UNC-CharlotteMar. 28, Phylogeny Scoring Steps g5 g4 g1g3 g2 g5 g6 1.Initialize unlabeled tree Use 3 nearest labels Initialize upper bound from inputs 2.Iteratively refine tree to convergence Use 3 immediate neighbors Initialize upper bound using score of previous label g5 g4 g1g3 g2 g5 g6

UNC-CharlotteMar. 28, First Approach for Parallelization core 0 core 1 core 2 core n-1 initial upper bound = ub = d(A,B) + d(A,C) d(B,A) + d(B,C) d(C,A) + d(C,B) A, B, C … ub ub - 1 ub - 2 ub - n - 1 Core with a lower initial upper bound will converge on solution fastest A d(A,B) 0 B C A d(A,C) A 0 B C B d(B,C) d(A,B) B 0 C C A d(A,C) d(B,C)

UNC-CharlotteMar. 28, Performance Results: Median Computation Average over 1000 median computations 12 cores => 25X speedup

UNC-CharlotteMar. 28, Performance Results: Accelerated GRAPPA Replace software median with driver for FPGA card Initialization phase: –Use 12 median cores Re-labeling phase: –Parallel labeling –Use n - 2 median cores Average over 10 GRAPPA runs

UNC-CharlotteMar. 28, Second Approach for Parallelization Exploit both fine- and coarse- grain parallelism 1.Fine-grain –Unroll loop for lower bound computation –Perform multiple iterations in parallel 2.Coarse-grain –Use parallel median cores for single median computation –Partition search space

UNC-CharlotteMar. 28, Fine-Grain Parallelism 1(1,-4),w=0 -1(-1,9),w=1(-1,25),w=2 2(2,11),w=2(2,-19),w=2(2,-49),w=2 -2(-2,17),w=2(-2,20),w=1. -19(-19,2),w=2(-19,-4),w=2(-19,10),w=2. used table v=2 e 0 =11 e 1 =-19 e 2 =-49 TSP graph representation: otherEnd table v=2 excluded table v=2 if used(v) = 0 then VALID_WEIGHTS=  for i = 0 to edge_count(v) - 1 if used(e i ) = 0 and otherEnd(v) != e i and excluded i (v) != 1 then add weight i to VALID_WEIGHTS end if end loop if VALID_WEIGHTS is empty lower_bound = lower_bound + 3 else lower_bound = min(VALID_WEIGHTS) end if used(v) used(e 0 ) used(e 1 ) used(e 2 ) otherEnd(v) excluded 0 (v) excluded 1 (v) excluded 2 (v) edge_count table v= weight 0 weight 1 weight Lower bound unit: 2 2 2

UNC-CharlotteMar. 28, Coarse-Grain Parallelism Parallelize search => partition TSP search space –Problems: High amount of state information (communication overhead) Dynamic load balancing would be complex (control overhead) Solution: “virtually” partition the TSP search space –Search order determined by ordering of edge list –Use parallel median cores –Each core uses unique search order –All cores share a global upper bound value

UNC-CharlotteMar. 28, Experimental Results: Median Acceleration Average speedup for 1000 median computations

UNC-CharlotteMar. 28, Experimental Results: Application Acceleration Perform end-to-end reconstruction procedure Dispatch all median computations to FPGA Average speedup for 10 end- to-end reconstructions

UNC-CharlotteMar. 28, Tree Generation Accelerator Generate trees in hardware, score in software Core generates and bounds trees –Given number of leaves, step, and offset –Upper bound is global and updates are broadcast Currently operating 64 cores in parallel on FPGA Core array is scanned and the core with the lowest lower bound is scored first Currently achieving 10X speedup

UNC-CharlotteMar. 28, Future Work In Progress: –Additional kernel designs tree generation complete, but working to increase speedup to 100X –Implement heterogeneous mix of kernels on the FPGA according to evolution rate of input set –Design automation tool