March 2, 2004, BMI 731 - Biomedical Data Management Improving Performance of Multiple Sequence Alignment Analysis in Multi-client Environments Use of Inexpensive.

Slides:



Advertisements
Similar presentations
Multiple Sequence Alignment Dynamic Programming. Multiple Sequence Alignment VTISCTGSSSNIGAG  NHVKWYQQLPG VTISCTGTSSNIGS  ITVNWYQQLPG LRLSCSSSGFIFSS.
Advertisements

Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina.
SALSA HPC Group School of Informatics and Computing Indiana University.
Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Spark: Cluster Computing with Working Sets
Optimal Sum of Pairs Multiple Sequence Alignment David Kelley.
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Distributed Computations
1 Minggu 12, Pertemuan 23 Introduction to Distributed DBMS (Chapter , 22.6, 3rd ed.) Matakuliah: T0206-Sistem Basisdata Tahun: 2005 Versi: 1.0/0.0.
The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
TurboBLAST: A Parallel Implementation of BLAST Built on the TurboHub Bin Gan CMSC 838 Presentation.
Improving performance of Multiple Sequence Alignment in Multi-client Environments Aaron Zollman CMSC 838 Presentation.
Performance Optimization of Clustal W: Parallel Clustal W, HT Clustal and MULTICLUSTAL Arunesh Mishra CMSC 838 Presentation Authors : Dmitri Mikhailov,
Parallel Computation in Biological Sequence Analysis: ParAlign & TurboBLAST Larissa Smelkov.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Multiple Sequence Alignment
Ch 4. The Evolution of Analytic Scalability
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
July 2003 Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science,
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
SALSA HPC Group School of Informatics and Computing Indiana University.
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
Computer Science Research and Development Department Computing Sciences Directorate, L B N L 1 Storage Management and Data Mining in High Energy Physics.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Problem-solving on large-scale clusters: theory and applications Lecture 4: GFS & Course Wrap-up.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Microarchitecture.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Applying Twister to Scientific Applications
MapReduce Simplied Data Processing on Large Clusters
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Ch 4. The Evolution of Analytic Scalability
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
by Mikael Bjerga & Arne Lange
Parallel Programming in C with MPI and OpenMP
Cluster Computers.
Presentation transcript:

March 2, 2004, BMI Biomedical Data Management Improving Performance of Multiple Sequence Alignment Analysis in Multi-client Environments Use of Inexpensive Storage as Grid Cache Umit Catalyurek, Mike Gray, Eric Stahlberg, Renato Ferreira, Tahsin Kurc, Joel Saltz Department of Biomedical Informatics The Ohio State University Ohio Supercomputer Center

March 2, 2004, BMI Biomedical Data Management Outline Multi Sequence Alignment CLUSTALW Sequence Analysis in Multiple Client Environment –Caching Intermediate Results –Deployment on SMP Machine –Deployment on Distributed Memory Machine Experimental Results Conclusion

March 2, 2004, BMI Biomedical Data Management Sequence Alignment alignment is a mutual arrangement of two sequences –where the two sequences are similar, and where they differ Sequence s: AAT AGCAA AGCACACA Sequence t: TAA ACATA ACACACTA Hamming Dist: 2 3 6

March 2, 2004, BMI Biomedical Data Management Edit Distance Unit Cost: s: AGCACAC-A AG-CACACA t: A-CACACTA or ACACACT-A cost 2 cost 4 distance(s, t) = 2

March 2, 2004, BMI Biomedical Data Management Multiple Sequence Alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDFYPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLTCLVKGFYPSD--IAVEWESNG-- VTISCTGSSSNIG-AGNHVKWYQQLPG VTISCTGTSSNIG--SITVNWYQQLPG LRLSCSSSGFIFS--SYAMYWVRQAPG LSLTCTVSGTSFD--DYYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNW--YVDG ATLVCLISDFYPG--AVTVAW--KADS AALGCLVKDYFPE--PVTVSW--NS-G VSLTCLVKGFYPS--DIAVEW--ESNG Optimal: O(2 n  |s i |) 6 sequences of length 100 if constant is seconds running time 6.4 x 10 4 seconds add 2 sequences running time 2.6 x 10 9 seconds or

March 2, 2004, BMI Biomedical Data Management CLUSTAL W Based on Higgins & Sharp CLUSTAL [Gene88] Progressive alignment-based strategy –Pairwise Alignment (n 2 l 2 ) A distance matrix is computed using either an approximate method (fast) or dynamic programming (more accurate, slower) –Computation of Guide Tree (n 3 ): phylogenetic tree Computed from the distance matrix Iteratively selecting aligned pairs and linking them. –Progressive Alignment (nl 2 ) A series of pairwise alignments computed using full dynamic programming to align larger and larger groups of sequences. The order in the Guide Tree determines the ordering of sequence alignments. At each step; either two sequences are aligned, or a new sequence is aligned with a group, or two groups are aligned. n: number of sequences in the query l : average sequence length

March 2, 2004, BMI Biomedical Data Management Sequence Analysis in Multiple Client Environment Many Gene and Protein databases can be accessed over Internet –Multiple request by multiple client Data Caching –Cache pairwise alignments Most expensive phase Computations are independent

March 2, 2004, BMI Biomedical Data Management Data Caching Low-cost high-performance, high-capacity commodity hardware –Disks are cheap: 100GB EIDE Disks around $250. –A PC costs around $700-$1000 no monitor, no high-end graphics card, moderate size memory (128MB-512MB) –Switched fast ethernet Better performance with channel bonding –In 2001: 6 Pentium III PCs, 1 TB of disk storage < $10,000 –In 2002: 5 Pentium 4 PCs, 2.5TB of disk storage < $9,000 –BMI Storage Cluster  7.2TB, 24 PCs = $50,000-$55,000 –UMD Storage Cluster  9.5 TB, 50 PCs

March 2, 2004, BMI Biomedical Data Management Caching Pairwise Alignment Scores Sequence -> Unique ID (UID): –use Hash (tested 10 hash functions including MD5; 4 of them gives similar result with MD5) –Resolve collisions and assign UID to each sequence For more than 1 million sequences from GenBank max collision per hash value was 3: constant time For each pairwise alignment, store two UIDs and a float score –B-Tree: used GIST B-Tree implementation

March 2, 2004, BMI Biomedical Data Management Sequence -> Unique ID (UID):

March 2, 2004, BMI Biomedical Data Management Deployment on SMP Machine A hash table is used to associate a sequence with a unique integer ID (UID) Partitioned B tree stores pairwise alignment results Cache partition chosen by min (UID1, UID2)% #Partitions Multiple threads for Pairwise alignment computation

March 2, 2004, BMI Biomedical Data Management DataCutter Component Framework for Combined Task/Data Parallelism Core Services –Indexing Service: Multilevel hierarchical indexes based on R-tree indexing method. –Filtering Service: Distributed C++ component framework User defines sequence of pipelined components (filters and filter groups) –Pleasingly Parallel –Generalized Reduction User directive tells preprocessor/runtime system to generate and instantiate copies of filters Stream based communication Multiple filter groups can be active simultaneously Flow control between transparent filter copies –Replicated individual filters –Transparent: single stream illusion

March 2, 2004, BMI Biomedical Data Management Deployment on Distributed Memory Machine DataCutter version of ClustalW – v1 Hash Filter –Stores/computes sequence to unique IDs mapping –Partitioned (declustered) hash Cache Filter –Partitioned (declustered) cache –computes pairwise alignment if it doesn’t exist in the cache Owner computes: computational imbalance CLUSTALW Filter –computes guide tree generation and progressive alignment CLUSTALW Hash (UniqueID) Cache & Compute

March 2, 2004, BMI Biomedical Data Management DataCutter version of ClustalW – v2 DC-ClustalW-v1 + Separate Pairwise Alignment Filter –Cache misses computed in Pairwise Align –Balanced computation Handles multiple queries –multiple copies of CLUSTALW filter CLUSTALW Hash (UniqueID) Cache Pairwise Align Deployment on Distributed Memory Machine

March 2, 2004, BMI Biomedical Data Management Multiple Query Processing -QueryManager Filter -ClustalW Filter -Hash Filter -Cache Filter -Pairwise Alignment Filter CW H C P Host-1 Host-n+1 CW Host-n H C P Host-2n QM Host-0 Deployment on Distributed Memory Machine DataCutter version of ClustalW – v2

March 2, 2004, BMI Biomedical Data Management Experimental Setup 1.Pentium III 650 MHz, 768MB Memory 1000 random sequences from GPCR Average length 450 amino acids per sequence 2.24-Processor Sun Fire 6800, 750MHz, 24GB Memory 350 MSA queries from GPCR; from 2 sequences per query to over 200 sequences per query 3.16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence

March 2, 2004, BMI Biomedical Data Management Experiment 1 – Execution Time of CLUSTAL W Pentium III 650 MHz, 768MB Memory 1000 random sequences from GPCR Average length 450 amino acids per sequence

March 2, 2004, BMI Biomedical Data Management Experiment 2 - SMP Results 24-Processor Sun Fire 6800, 750MHz, 24GB Memory 350 MSA queries from GPCR; from 2 sequences per query to over 200 sequences per query

March 2, 2004, BMI Biomedical Data Management Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence

March 2, 2004, BMI Biomedical Data Management 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence Experiment 3 – Distributed Memory DataCutter version of ClustalW – v1

March 2, 2004, BMI Biomedical Data Management Experiment 3 – Distributed Memory DataCutter version of ClustalW – v2 1 ClustalW filter intra-query parallelization 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence

March 2, 2004, BMI Biomedical Data Management Experiment 3 – Distributed Memory DataCutter version of ClustalW – v2 Multiple ClustalW filters inter-query parallelization 16 Pentium III 933MHz, 512MB Memory, 3x100GB IDE disk 8 running a copy of Hash, Cache and PairAlign, 8 running ClustalW 64 queries each consist of 40 unique protein sequences from GPCR Average length 450 amino acids per sequence

March 2, 2004, BMI Biomedical Data Management Conclusion Caching intermediate results –computational intensive application  data intensive application SMP Distributed Memory implementation with DataCutter