Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

Similar presentations


Presentation on theme: "1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation."— Presentation transcript:

1 1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation

2 CMSC 838T – Presentation 2 Talk Overview u Overview of talk  Motivation  Background  Techniques  Evaluation  Related work  Observations

3 CMSC 838T – Presentation 3 Motivation: EST Clustering u Problem: EST Clustering  Cluster fragments of cDNA u Related to ‘fragment assembly’ problem  Detecting overlapping fragments u Overlaps can be computed:  Pairwise alignment algorithm  Dynamic programming u Alternative:  Approximate overlap detection algorithms  Dynamic programming

4 CMSC 838T – Presentation 4 Motivation u Common Tools:  Takes too long l Days for 100,000 ESTs  Runs out of memory u This paper:  PaCE: l Parallel Clustering of ESTs  Efficient parallel EST Clustering l Space efficient algorithm l Reduce total work l Reduce run-time

5 CMSC 838T – Presentation 5 Background: EST Clustering Tools u Three traditional software:  Originally designed for fragment assembly: l TIGR Assembler l Phrap l CAP3 u One parallel software:  UICLUSTER: assumes EST’s from 3’ end

6 CMSC 838T – Presentation 6 EST Clustering Tools u Basic approach  Find pairs of similar sequences  Align similar pairs l Dynamic programing u Quality of EST clustering l Phrap: Fastest u avoids dynamic programming u Relies on approximation, lower quality l CAP: Least # of erroneous clusters

7 CMSC 838T – Presentation 7 EST Clustering Tools’ Performance u With 50,000 maize ESTs  Using PC with dual Pentium 450MHZ, 512 RAM : l TIGR: ran out of memory l Phrap: 40 min l CAP: > 24 hours u With 100,000 maize ESTs l all ran out of memory l CAP would require 4 days

8 CMSC 838T – Presentation 8 Goal u Space efficient algorithm  Space requirement linear in the size of the input data set u Reduce total work  Without sacrificing quality of clustering u Reduce run-time and facilitate the clustering of large data sets  Through parallel processing  Scale memory with # of processors

9 CMSC 838T – Presentation 9 Approach u Expense:  Pairwise alignment (time + memory)  Promising pairs ≈ l Common string: |s|= w l Cost: if common |s|=l > w, then repeats l-w+1 times

10 CMSC 838T – Presentation 10 Approach (Cont..) u Approach:  Use trie structure  Identify promising pairs l Merge clusters with strong overlaps l Avoid storing/testing all similar pairs  Parallel EST Clustering Software: l Generalized Suffix Tree (GST) l Multiple processors: u Maintain and updates EST Clusters u Others generate batches of promising pairs, perform alignment

11 CMSC 838T – Presentation 11 Approach (Cont …)

12 CMSC 838T – Presentation 12 Tries 1)Index for each char 2)N leaves 3)Height N

13 CMSC 838T – Presentation 13 Suffix Tries (Cont..) 1)TRIM suffix trie

14 CMSC 838T – Presentation 14 Suffix Tries (Cont..) 1)Indicies 2)Storage O(n), constant is high though 3)Common string 4)Longest common substring

15 CMSC 838T – Presentation 15 Suffix Tries (Cont..) 1 2 a b a b $ a b $ b 3 $ 4 $ 5 Given a pattern P = ab we traverse the tree according to the pattern.

16 CMSC 838T – Presentation 16 Parallel Generation of GST u GST: Generalized Suffix Tree  Compacted trie  Longest common prefix found in constant time  Used for on-demand pair generation  Sequential: O(nl)  Parallel: O(nl/p)

17 CMSC 838T – Presentation 17 Parallel Generation of GST (Cont …) u Previous implementations: l CRCW/CREW PRAM model l Work-optimal u Involves alphabetical ordering of characters l Unrealistic assumptions u synchronous operation of processors u infinite network bandwidth u no memory contention u Not practically efficient

18 CMSC 838T – Presentation 18 Parallel Generation of GST (Cont …) u Paper’s approach:  EST’s equally distributed among processors  Each processor l Partitions suffixes of ESTs into buckets  Distribute buckets to the processors: l All suffixes in a bucket allocated to the same processor l Total # of suffixes allocated to a processor ≈ O ( )

19 CMSC 838T – Presentation 19 Parallel Generation of GST (Cont …)  Each bucket’s processor: l Compute compacted trie of all its suffixes l Cannot use sequential construction u Suffixes of a string – not in the same bucket  Each bucket: l Subtree in the GST  Nodes: l Depth first search traversal of the trie l Pointer to the right most child

20 CMSC 838T – Presentation 20 On-demand Pair Generation u A pair should be generated if  Share substring of length ≥ treshhold  Maximal  Leaves in a common node l Share a substring of length = depth of node u Parallel algorithm  Each processor works with its trie if l Depth of its root in GST < threshhold

21 CMSC 838T – Presentation 21 On-demand Pair Generation u To process  Sort internal nodes l Decreasing order of depth  Lists of a node l Generated after process l Removed after parent is processed l Limits space O(nl) l Run time ≈ # pairs generated + cost of sorting l Rejected pairs increase run-time by a factor of 2 l Eliminating duplicates reduce run-time

22 CMSC 838T – Presentation 22 Parallel Clustering u Master-Slave paradigm:  Master processor: l Maintains and updates clusters u Using union-find data structure u Receives messages from slave processors – A batch of next promising pairs generated by slave – Results of the pairwise alignment u Determines which ones to explore u Determines if merging should occur  Slave processors: l Generate pairs on demand l Perform pairwise alignments of pairs dispatched by the master processor

23 CMSC 838T – Presentation 23 Parallel Clustering (Cont…) Organization of Parallel Clustering Software Master P Slave P Slave P slave P Batch of promising pairs generated + results of pairwise alignment Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair

24 CMSC 838T – Presentation 24 Parallel Clustering (Cont..) u To start:  Slave P starts with 3× batchsize pairs l Sends the 3rd batch to Master P l Starts alignment on 1st batch l Sends results on 1st + a newly generated batch l While waiting to receive results from Master P, aligns 2nd batch u Processor always has the next batch to work between: – Submitting the results of previous batch – Receiving another set of pairs

25 CMSC 838T – Presentation 25 Parallel Clustering (Cont..) u Improve and control quality l Parameters: u Match and mismatch scores u Gap penalties l Post processing: u Detection of alternating splicing u Consulting protein databases u Organism specific

26 CMSC 838T – Presentation 26 Experimental environment u Used C and MPI u Tested  Quality of software: l Arabidopsis thaliana (due to availability of its genome)  Run-time behavior: l 50,000 Maize ESTs with 32-processor IBM SP l # of processors l Data size l (# of Promising pairs) vs data size l Batchsize vs (# processors) l # of Clusters l Master processor’s time

27 CMSC 838T – Presentation 27 Quality Assessment u To asses quality  A data set and its correct clustering  ESTs from plant Arabidopsis thaliana  Splice program l Align ESTs to the genome l Discard ESTs that u Don’t align u Aligned in multiple spots

28 CMSC 838T – Presentation 28 Quality Assessment (Cont …) u False negative:  A pair in correct clustering is not paired in the output  5% u False positive:  A pair not in correct clustering appears in results  Negligible (< 0.04%)  Due to conservative nature of algorithm

29 CMSC 838T – Presentation 29 Quality Assessment Cluster results Number of singleton clusters Number of non- singleton clusters Benchmark10,80318,727 CAP317,93017,556 PaCE14,80219,536 Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.

30 CMSC 838T – Presentation 30 Quality Assessment (Cont..)

31 CMSC 838T – Presentation 31 Run-time Assessment -Experiment with 50,000 maize ESTs: -32-processor IBM SP-2 -16 minutes

32 CMSC 838T – Presentation 32 Run-time Assessment (Cont …) pPreprocessingClusteringTotal 4273102375 811950169 16612687 32381553 64291039 Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors.

33 CMSC 838T – Presentation 33 Run-time Assessment (Cont..) u Run-time as a function of batchsize  Small batchsize l Increase in communication overhead  Large batchsize l Slaves less responsive to the need of generating pairs l Slave does not use latest clustering results  Optimal batchsize l Determined by experiment u Master processor’s time  Fixed batchsize, increase in # of processors l Gradual increase in Master P’s time  With 32 processors, increase < 1%  Using 1 Master Processor in not bottleneck

34 CMSC 838T – Presentation 34 Results u Space Linear in size of the input data set u Reduced total work without sacrificing quality u Reduced run-time  Parallel processors  Eliminating pairs u Faciliate clustering  Scale memory with # Processors

35 CMSC 838T – Presentation 35 Observations u PaCE: Approaches EST clustering problem directly  Better than l CAP3 l Phrap l TIGR Assembler  Compare time/quality l TIGICL (TIGR Indices Clustering Tool) u Support for PVM l MegaBlast l STACK  Large data sets l Lots of Processors  Can improve clustering time? u Clustering algorithm

36 CMSC 838T – Presentation 36 References u http://www.cs.berkeley.edu/~kubitron/courses/cs258- S02/lectures/eval10-logp.pdf http://www.cs.berkeley.edu/~kubitron/courses/cs258- S02/lectures/eval10-logp.pdf u Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.


Download ppt "1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation."

Similar presentations


Ads by Google