1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation.

Slides:



Advertisements
Similar presentations
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Advertisements

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
Bar Ilan University And Georgia Tech Artistic Consultant: Aviya Amir.
Dr. Kalpakis CMSC 661, Principles of Database Systems Index Structures [13]
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Sandeep Tata, Richard A. Hankins, and Jignesh M. Patel Presented by Niketan Pansare, Megha Kokane.
G ENOME - SCALE D ISK - BASED S UFFIX T REE I NDEXING Phoophakdee and Zaki.
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Efficient Clustering of Large EST Data Sets on Parallel Computers CECS Bioinformatics Journal Club September 17, 2003 Nucleic Acids Research, 2003,
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.
On Demand String Sorting over Unbounded Alphabets Carmel Kent Moshe Lewenstein Dafna Sheinwald.
B+-tree and Hashing.
Paper Title Your Name CMSC 838 Presentation. CMSC 838T – Presentation Motivation u Problem paper is trying to solve  Characteristics of problem  … u.
Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.
Indexed Search Tree (Trie) Fawzi Emad Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
Assembly.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Parallel Computation in Biological Sequence Analysis Xue Wu CMSC 838 Presentation.
Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU.
Accurate Method for Fast Design of Diagnostic Oligonucleotide Probe Sets for DNA Microarrays Nazif Cihan Tas CMSC 838 Presentation.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
1 Efficient Discovery of Conserved Patterns Using a Pattern Graph Inge Jonassen Pattern Discovery Arwa Zabian 13/07/2015.
Building Suffix Trees in O(m) time Weiner had first linear time algorithm in 1973 McCreight developed a more space efficient algorithm in 1976 Ukkonen.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Database Index to Large Biological Sequences Ela Hunt, Malcolm P. Atkinson, and Robert W. Irving Proceedings of the 27th VLDB Conference,2001 Presented.
 B+ Tree Definition  B+ Tree Properties  B+ Tree Searching  B+ Tree Insertion  B+ Tree Deletion.
Heterogeneous Parallelization for RNA Structure Comparison Eric Snow, Eric Aubanel, and Patricia Evans University of New Brunswick Faculty of Computer.
Introduction n – length of text, m – length of search pattern string Generally suffix tree construction takes O(n) time, O(n) space and searching takes.
CAMP: Fast and Efficient IP Lookup Architecture Sailesh Kumar, Michela Becchi, Patrick Crowley, Jonathan Turner Washington University in St. Louis.
Comp. Genomics Recitation 3 The statistics of database searching.
CSC 211 Data Structures Lecture 13
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Large Scale Assembly of DNA Strings using Suffix Trees David Rivshin Parallel 2 4/11/2001.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
LogTree: A Framework for Generating System Events from Raw Textual Logs Liang Tang and Tao Li School of Computing and Information Sciences Florida International.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Author: Haoyu Song, Murali Kodialam, Fang Hao and T.V. Lakshman Publisher/Conf. : IEEE International Conference on Network Protocols (ICNP), 2009 Speaker:
Sequence Alignment.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Comp. Genomics Recitation 3 (week 4) 26/3/2009 Multiple Hypothesis Testing+Suffix Trees Based in part on slides by William Stafford Noble.
Parallel Programming in Chess Simulations Part 2 Tyler Patton.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Advanced Sorting 7 2  9 4   2   4   7
MA/CSSE 473 Day 26 Student questions Boyer-Moore B Trees.
A database index to large biological sequences
Tries 07/28/16 11:04 Text Compression
Top 50 Data Structures Interview Questions
CS522 Advanced database Systems
CS 728 Advanced Database Systems Chapter 18
RE-Tree: An Efficient Index Structure for Regular Expressions
The short-read alignment in distributed memory environment
13 Text Processing Hongfei Yan June 1, 2016.
Sequence Alignment 11/24/2018.
B+-Trees and Static Hashing
Suffix Arrays and Suffix Trees
Presentation transcript:

1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation

CMSC 838T – Presentation 2 Talk Overview u Overview of talk  Motivation  Background  Techniques  Evaluation  Related work  Observations

CMSC 838T – Presentation 3 Motivation: EST Clustering u Problem: EST Clustering  Cluster fragments of cDNA u Related to ‘fragment assembly’ problem  Detecting overlapping fragments u Overlaps can be computed:  Pairwise alignment algorithm  Dynamic programming u Alternative:  Approximate overlap detection algorithms  Dynamic programming

CMSC 838T – Presentation 4 Motivation u Common Tools:  Takes too long l Days for 100,000 ESTs  Runs out of memory u This paper:  PaCE: l Parallel Clustering of ESTs  Efficient parallel EST Clustering l Space efficient algorithm l Reduce total work l Reduce run-time

CMSC 838T – Presentation 5 Background: EST Clustering Tools u Three traditional software:  Originally designed for fragment assembly: l TIGR Assembler l Phrap l CAP3 u One parallel software:  UICLUSTER: assumes EST’s from 3’ end

CMSC 838T – Presentation 6 EST Clustering Tools u Basic approach  Find pairs of similar sequences  Align similar pairs l Dynamic programing u Quality of EST clustering l Phrap: Fastest u avoids dynamic programming u Relies on approximation, lower quality l CAP: Least # of erroneous clusters

CMSC 838T – Presentation 7 EST Clustering Tools’ Performance u With 50,000 maize ESTs  Using PC with dual Pentium 450MHZ, 512 RAM : l TIGR: ran out of memory l Phrap: 40 min l CAP: > 24 hours u With 100,000 maize ESTs l all ran out of memory l CAP would require 4 days

CMSC 838T – Presentation 8 Goal u Space efficient algorithm  Space requirement linear in the size of the input data set u Reduce total work  Without sacrificing quality of clustering u Reduce run-time and facilitate the clustering of large data sets  Through parallel processing  Scale memory with # of processors

CMSC 838T – Presentation 9 Approach u Expense:  Pairwise alignment (time + memory)  Promising pairs ≈ l Common string: |s|= w l Cost: if common |s|=l > w, then repeats l-w+1 times

CMSC 838T – Presentation 10 Approach (Cont..) u Approach:  Use trie structure  Identify promising pairs l Merge clusters with strong overlaps l Avoid storing/testing all similar pairs  Parallel EST Clustering Software: l Generalized Suffix Tree (GST) l Multiple processors: u Maintain and updates EST Clusters u Others generate batches of promising pairs, perform alignment

CMSC 838T – Presentation 11 Approach (Cont …)

CMSC 838T – Presentation 12 Tries 1)Index for each char 2)N leaves 3)Height N

CMSC 838T – Presentation 13 Suffix Tries (Cont..) 1)TRIM suffix trie

CMSC 838T – Presentation 14 Suffix Tries (Cont..) 1)Indicies 2)Storage O(n), constant is high though 3)Common string 4)Longest common substring

CMSC 838T – Presentation 15 Suffix Tries (Cont..) 1 2 a b a b $ a b $ b 3 $ 4 $ 5 Given a pattern P = ab we traverse the tree according to the pattern.

CMSC 838T – Presentation 16 Parallel Generation of GST u GST: Generalized Suffix Tree  Compacted trie  Longest common prefix found in constant time  Used for on-demand pair generation  Sequential: O(nl)  Parallel: O(nl/p)

CMSC 838T – Presentation 17 Parallel Generation of GST (Cont …) u Previous implementations: l CRCW/CREW PRAM model l Work-optimal u Involves alphabetical ordering of characters l Unrealistic assumptions u synchronous operation of processors u infinite network bandwidth u no memory contention u Not practically efficient

CMSC 838T – Presentation 18 Parallel Generation of GST (Cont …) u Paper’s approach:  EST’s equally distributed among processors  Each processor l Partitions suffixes of ESTs into buckets  Distribute buckets to the processors: l All suffixes in a bucket allocated to the same processor l Total # of suffixes allocated to a processor ≈ O ( )

CMSC 838T – Presentation 19 Parallel Generation of GST (Cont …)  Each bucket’s processor: l Compute compacted trie of all its suffixes l Cannot use sequential construction u Suffixes of a string – not in the same bucket  Each bucket: l Subtree in the GST  Nodes: l Depth first search traversal of the trie l Pointer to the right most child

CMSC 838T – Presentation 20 On-demand Pair Generation u A pair should be generated if  Share substring of length ≥ treshhold  Maximal  Leaves in a common node l Share a substring of length = depth of node u Parallel algorithm  Each processor works with its trie if l Depth of its root in GST < threshhold

CMSC 838T – Presentation 21 On-demand Pair Generation u To process  Sort internal nodes l Decreasing order of depth  Lists of a node l Generated after process l Removed after parent is processed l Limits space O(nl) l Run time ≈ # pairs generated + cost of sorting l Rejected pairs increase run-time by a factor of 2 l Eliminating duplicates reduce run-time

CMSC 838T – Presentation 22 Parallel Clustering u Master-Slave paradigm:  Master processor: l Maintains and updates clusters u Using union-find data structure u Receives messages from slave processors – A batch of next promising pairs generated by slave – Results of the pairwise alignment u Determines which ones to explore u Determines if merging should occur  Slave processors: l Generate pairs on demand l Perform pairwise alignments of pairs dispatched by the master processor

CMSC 838T – Presentation 23 Parallel Clustering (Cont…) Organization of Parallel Clustering Software Master P Slave P Slave P slave P Batch of promising pairs generated + results of pairwise alignment Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair

CMSC 838T – Presentation 24 Parallel Clustering (Cont..) u To start:  Slave P starts with 3× batchsize pairs l Sends the 3rd batch to Master P l Starts alignment on 1st batch l Sends results on 1st + a newly generated batch l While waiting to receive results from Master P, aligns 2nd batch u Processor always has the next batch to work between: – Submitting the results of previous batch – Receiving another set of pairs

CMSC 838T – Presentation 25 Parallel Clustering (Cont..) u Improve and control quality l Parameters: u Match and mismatch scores u Gap penalties l Post processing: u Detection of alternating splicing u Consulting protein databases u Organism specific

CMSC 838T – Presentation 26 Experimental environment u Used C and MPI u Tested  Quality of software: l Arabidopsis thaliana (due to availability of its genome)  Run-time behavior: l 50,000 Maize ESTs with 32-processor IBM SP l # of processors l Data size l (# of Promising pairs) vs data size l Batchsize vs (# processors) l # of Clusters l Master processor’s time

CMSC 838T – Presentation 27 Quality Assessment u To asses quality  A data set and its correct clustering  ESTs from plant Arabidopsis thaliana  Splice program l Align ESTs to the genome l Discard ESTs that u Don’t align u Aligned in multiple spots

CMSC 838T – Presentation 28 Quality Assessment (Cont …) u False negative:  A pair in correct clustering is not paired in the output  5% u False positive:  A pair not in correct clustering appears in results  Negligible (< 0.04%)  Due to conservative nature of algorithm

CMSC 838T – Presentation 29 Quality Assessment Cluster results Number of singleton clusters Number of non- singleton clusters Benchmark10,80318,727 CAP317,93017,556 PaCE14,80219,536 Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.

CMSC 838T – Presentation 30 Quality Assessment (Cont..)

CMSC 838T – Presentation 31 Run-time Assessment -Experiment with 50,000 maize ESTs: -32-processor IBM SP minutes

CMSC 838T – Presentation 32 Run-time Assessment (Cont …) pPreprocessingClusteringTotal Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors.

CMSC 838T – Presentation 33 Run-time Assessment (Cont..) u Run-time as a function of batchsize  Small batchsize l Increase in communication overhead  Large batchsize l Slaves less responsive to the need of generating pairs l Slave does not use latest clustering results  Optimal batchsize l Determined by experiment u Master processor’s time  Fixed batchsize, increase in # of processors l Gradual increase in Master P’s time  With 32 processors, increase < 1%  Using 1 Master Processor in not bottleneck

CMSC 838T – Presentation 34 Results u Space Linear in size of the input data set u Reduced total work without sacrificing quality u Reduced run-time  Parallel processors  Eliminating pairs u Faciliate clustering  Scale memory with # Processors

CMSC 838T – Presentation 35 Observations u PaCE: Approaches EST clustering problem directly  Better than l CAP3 l Phrap l TIGR Assembler  Compare time/quality l TIGICL (TIGR Indices Clustering Tool) u Support for PVM l MegaBlast l STACK  Large data sets l Lots of Processors  Can improve clustering time? u Clustering algorithm

CMSC 838T – Presentation 36 References u S02/lectures/eval10-logp.pdf S02/lectures/eval10-logp.pdf u Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.