P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

Slides:

Advertisements

Similar presentations

Prepared 7/28/2011 by T. O’Neil for 3460:677, Fall 2011, The University of Akron.

Advertisements

Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.

Longest Common Subsequence

CPSC 335 Dynamic Programming Dr. Marina Gavrilova Computer Science University of Calgary Canada.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.

Fast Paths in Concurrent Programs Wen Xu, Princeton University Sanjeev Kumar, Intel Labs. Kai Li, Princeton University.

Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.

A Parallel Computational Model for Heterogeneous Clusters Jose Luis Bosque, Luis Pastor, IEEE TRASACTION ON PARALLEL AND DISTRIBUTED SYSTEM, VOL. 17, NO.

{kajny, GridModelica: Modeling and Simulating on the Grid Håkan Mattsson, Christoph W. Kessler, Kaj Nyström, Peter Fritzson Programming.

An Overview of the BSP Model of Parallel Computation Overview Only.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

Performance Analysis of MPI Communications on the SGI Altix 3700 Nor Asilah Wati Abdul Hamid, Paul Coddington, Francis Vaughan Distributed & High Performance.

Models of Parallel Computation

Whole Genome Alignment using Multithreaded Parallel Implementation Hyma S Murthy CMSC 838 Presentation.

Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:

CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.

May 29, Final Presentation Sajib Barua1 Development of a Parallel Fast Fourier Transform Algorithm for Derivative Pricing Using MPI Sajib Barua.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

1 A Linear Space Algorithm for Computing Maximal Common Subsequences Author: D.S. Hirschberg Publisher: Communications of the ACM 1975 Presenter: Han-Chen.

Monica Garika Chandana Guduru. METHODS TO SOLVE LINEAR SYSTEMS Direct methods Gaussian elimination method LU method for factorization Simplex method of.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Tomographic mammography parallelization Juemin Zhang (NU) Tao Wu (MGH) Waleed Meleis (NU) David Kaeli (NU)

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Parallel Programming in C with MPI and OpenMP

Bulk-Synchronous Parallel

Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.

Lappeenranta University of Technology / JP CT30A7001 Concurrent and Parallel Computing Introduction to concurrent and parallel computing.

Neural and Evolutionary Computing - Lecture 10 1 Parallel and Distributed Models in Evolutionary Computing  Motivation  Parallelization models  Distributed.

1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.

Bulk Synchronous Parallel Processing Model Jamie Perkins.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

SICSA Concordance Challenge: Using Groovy and the JCSP Library Jon Kerridge.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.

Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,

Parallel Suffix Array Construction by Accelerated Sampling Matthew Felice Pace University of Warwick Joint work with Alexander Tiskin University of Warwick.

Yaomin Jin Design of Experiments Morris Method.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Domain Decomposed Parallel Heat Distribution Problem in Two Dimensions Yana Kortsarts Jeff Rufinus Widener University Computer Science Department.

RAM, PRAM, and LogP models

LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.

Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.

Tests and tools for ENEA GRID Performance test: HPL (High Performance Linpack) Network monitoring A.Funel December 11, 2007.

Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.

A High Performance Middleware in Java with a Real Application Fabrice Huet*, Denis Caromel*, Henri Bal + * Inria-I3S-CNRS, Sophia-Antipolis, France + Vrije.

Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.

Chapter 9 Efficiency of Algorithms. 9.3 Efficiency of Algorithms.

Automatic Parameterisation of Parallel Linear Algebra Routines Domingo Giménez Javier Cuenca José González University of Murcia SPAIN Algèbre Linéaire.

An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.

High Performance LU Factorization for Non-dedicated Clusters Toshio Endo, Kenji Kaneda, Kenjiro Taura, Akinori Yonezawa (University of Tokyo) and the future.

On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.

edit type on title master Fortran ISV Release I to L LINPACK TOP500 Technical Systems Division * Scalable Computing Lab 2 Hsin-Ying Lin

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Network Weather Service. Introduction “NWS provides accurate forecasts of dynamically changing performance characteristics from a distributed set of metacomputing.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Parallel Density-based Hybrid Clustering

Course Description Algorithms are: Recipes for solving problems.

COMP60621 Fundamentals of Parallel and Distributed Systems

By Brandon, Ben, and Lee Parallel Computing.

Memory System Performance Chapter 3

Course Description Algorithms are: Recipes for solving problems.

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 2 Outline 1.Introduction  LLCS Computation  The BSP Model 2.Problem Definition and Algorithms  Standard Algorithm  Parallel Algorithm 3.Experiments  Experiment Setup  Predictions  Speedup

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 3 Motivation Computing the (Length of the) Longest Common Subsequence is representative of a class of dynamic programming algorithms. Hence, we want to  Examine the suitability of high-level BSP programming for such problems  Compare different BSP libraries on different systems  See what happens when there is good sequential performance  Examine performance predictability

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 4 Related Work Sequential dynamic programming algorithm (Hirschberg, ’75) Crochemore, Iliopoulos, Pinzon, Reid: A fast and practical bit-vector algorithm for the Longest Common Subsequence problem (2001) Alves, Cáceres, Dehne: Parallel dynamic programming for solving the string editing problem on a CGM/BSP (2002). Garcia, Myoupo, Semé: A coarse-grained multicomputer algorithm for the longest common subsequence problem (2003).

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 5 Our Work  Combination of bit-parallel algorithms and fast BSP-style communication  A BSP performance model and predictions  Comparison using different libraries on different systems  Estimation of block size parameter before calculation for better speedup

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 6 The BSP Model  p identical processor/memory pairs (computing nodes)  Computation speed f on every node  Arbitrary interconnection network, latency l, bandwidth gap g

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 7 BSP Programs  SPMD execution, takes place in supersteps  Communication may be delayed until the end of the superstep  Time/Cost Formula : T = f ·W + g · H + l · S Bytes will be used as a base unit for communication size

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 8 Problem Definition Let X = x 1 x 2...x m and Y = y 1 y 2...y n be two strings on a finite alphabet  Subsequence U of string X: U can be obtained by deleting zero or more elements from X i.e. U = x i 1 x i 2...x i k and i q < i q+1 for all q with 1 ≤ q < k.  Strings X and Y : LCS (X, Y) is any string which is subsequence of both X and Y and has maximum possible length.  Length of these sequences: LLCS (X, Y).

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 9 Sequential Algorithm  Dynamic programming matrix L 0..m,0..n  L i,j = LLCS( x 1 x 2 …x i, y 1 y 2 …y j )  The values in this matrix can be computed in O(mn) time and space

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 10 Parallel Algorithm  Based on a simple parallel algorithm for grid DAG computation  Dynamic programming matrix L is partitioned into a grid of rectangular blocks of size (m/G)×(n/G) (G : grid size)  Blocks in a wavefront can be processed in parallel  Assumptions: ▫ Strings of equal length m = n ▫ Ratio  = G/p is an integer

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 11 Parallel Cost Model  Input/output data distribution is block-cyclic ► Can keep data for block-columns locally  Running time: ► Parameter a can be used to tune performance

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 12 Bit-Parallel Algorithms  Bit-parallel computation processes  entries of L in parallel ( : machine word size)  This leads to substantial speedup for the sequential computation phase and slightly lower communication cost per superstep.

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 13 Systems Used Measurements on parallel machines at the Centre for Scientific Computing:  aracari : IBM cluster, 64 × 2-way SMP Pentium3 1.4 GHz/128 GB of memory (Interconnection Network: Myrinet 2000, MPI: mpich-gm)  argus : Linux cluster, 31 × 2-way SMP Pentium4 Xeon 2.6 GHz processors/62 GB of memory (Interconnection Network: 100Mbit Ethernet, MPI: mpich-p4)  skua : SGI Altix shared memory machine, 56 × Itanium GHz processors / 112 GB of memory (MPI: SGI native)

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 14 BSP Libraries Used  The Oxford BSP Toolset on top of MPI ( oxtool/ )  PUB on top of MPI (except on the SGI) ( wwwcs.uni-paderborn.de/~bsp/ )  A simple BSPlib implementation based on MPI(-2)

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 15 Input and Parameters  Input strings generated randomly of equal length  Predictability examined for string lengths between 8192 and 65536, grid size parameter a between 1 and 5  Values of l, g measured by timing random permutations

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 16 Experimental Values of f and f´  Simple Algorithm (f) skua ns/op 130 M op/s argus ns/op 61 M op/s aracari ns/op 86 M op/s  Bit-Parallel Algorithm (f´) skua ns/op 4.5 G op/s argus ns/op 2.9 G op/s aracari ns/op 1.8 G op/s

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 17 Predictions Good results on distributed memory systems aracari/ MPI – 32 Processors

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 18 Predictions Slightly worse results on shared memory ( skua, MPI, p=32)

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 19 Problems when Predicting Performance  Results for PUB less accurate on shared memory  Setup costs only covered by parameter l ► difficult to measure ► Problems on the shared memory machine when communication size is small  PUB has performance break-in when communication size reaches a certain value  Busy communication network can create ‘spikes’

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 20 Predictions for the Bit-Parallel Version  Good results on distributed memory systems  Results on the SGI have larger prediction error because local computations use block sizes for which f´is not stable

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 21 Speedup Results (LLCS, aracari )

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 22 Speedup for the Bit-Parallel Version  Speedup slightly lower than for the standard version  However, overall running times for same problem sizes are shorter  Can expect parallel speedup for larger problem sizes

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 23 Speedup for the Bit-Parallel Version argus, p=10 skua, p=32

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 24 Result Summary

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 25 Summary and Outlook  Summary ▫ High-level BSP programming is efficient for the dynamic programming problem we considered. ▫ Implementations benefit from a low latency implementation (The Oxford BSP toolset/PUB) ▫ Very good predictability  Outlook ▫ Different modeling of bandwidth allows better predictions ▫ Lower latency possible by using subgroup synchronization ▫ Extraction of LCS possible, using post processing step or other algorithm...