Download presentation
Presentation is loading. Please wait.
1
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006
2
P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 2 Outline 1.Introduction LLCS Computation The BSP Model 2.Problem Definition and Algorithms Standard Algorithm Parallel Algorithm 3.Experiments Experiment Setup Predictions Speedup
3
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 3 Motivation Computing the (Length of the) Longest Common Subsequence is representative of a class of dynamic programming algorithms. Hence, we want to Examine the suitability of high-level BSP programming for such problems Compare different BSP libraries on different systems See what happens when there is good sequential performance Examine performance predictability
4
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 4 Related Work Sequential dynamic programming algorithm (Hirschberg, ’75) Crochemore, Iliopoulos, Pinzon, Reid: A fast and practical bit-vector algorithm for the Longest Common Subsequence problem (2001) Alves, Cáceres, Dehne: Parallel dynamic programming for solving the string editing problem on a CGM/BSP (2002). Garcia, Myoupo, Semé: A coarse-grained multicomputer algorithm for the longest common subsequence problem (2003).
5
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 5 Our Work Combination of bit-parallel algorithms and fast BSP-style communication A BSP performance model and predictions Comparison using different libraries on different systems Estimation of block size parameter before calculation for better speedup
6
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 6 The BSP Model p identical processor/memory pairs (computing nodes) Computation speed f on every node Arbitrary interconnection network, latency l, bandwidth gap g
7
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 7 BSP Programs SPMD execution, takes place in supersteps Communication may be delayed until the end of the superstep Time/Cost Formula : T = f ·W + g · H + l · S Bytes will be used as a base unit for communication size
8
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 8 Problem Definition Let X = x 1 x 2...x m and Y = y 1 y 2...y n be two strings on a finite alphabet Subsequence U of string X: U can be obtained by deleting zero or more elements from X i.e. U = x i 1 x i 2...x i k and i q < i q+1 for all q with 1 ≤ q < k. Strings X and Y : LCS (X, Y) is any string which is subsequence of both X and Y and has maximum possible length. Length of these sequences: LLCS (X, Y).
9
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 9 Sequential Algorithm Dynamic programming matrix L 0..m,0..n L i,j = LLCS( x 1 x 2 …x i, y 1 y 2 …y j ) The values in this matrix can be computed in O(mn) time and space
10
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 10 Parallel Algorithm Based on a simple parallel algorithm for grid DAG computation Dynamic programming matrix L is partitioned into a grid of rectangular blocks of size (m/G)×(n/G) (G : grid size) Blocks in a wavefront can be processed in parallel Assumptions: ▫ Strings of equal length m = n ▫ Ratio = G/p is an integer
11
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 11 Parallel Cost Model Input/output data distribution is block-cyclic ► Can keep data for block-columns locally Running time: ► Parameter a can be used to tune performance
12
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 12 Bit-Parallel Algorithms Bit-parallel computation processes entries of L in parallel ( : machine word size) This leads to substantial speedup for the sequential computation phase and slightly lower communication cost per superstep.
13
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 13 Systems Used Measurements on parallel machines at the Centre for Scientific Computing: aracari : IBM cluster, 64 × 2-way SMP Pentium3 1.4 GHz/128 GB of memory (Interconnection Network: Myrinet 2000, MPI: mpich-gm) argus : Linux cluster, 31 × 2-way SMP Pentium4 Xeon 2.6 GHz processors/62 GB of memory (Interconnection Network: 100Mbit Ethernet, MPI: mpich-p4) skua : SGI Altix shared memory machine, 56 × Itanium-2 1.6 GHz processors / 112 GB of memory (MPI: SGI native)
14
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 14 BSP Libraries Used The Oxford BSP Toolset on top of MPI ( www.bsp-worldwide.org/implmnts/ oxtool/ ) PUB on top of MPI (except on the SGI) ( wwwcs.uni-paderborn.de/~bsp/ ) A simple BSPlib implementation based on MPI(-2)
15
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 15 Input and Parameters Input strings generated randomly of equal length Predictability examined for string lengths between 8192 and 65536, grid size parameter a between 1 and 5 Values of l, g measured by timing random permutations
16
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 16 Experimental Values of f and f´ Simple Algorithm (f) skua 0.008 ns/op 130 M op/s argus 0.016 ns/op 61 M op/s aracari 0.012 ns/op 86 M op/s Bit-Parallel Algorithm (f´) skua 0.00022 ns/op 4.5 G op/s argus 0.00034 ns/op 2.9 G op/s aracari 0.00055 ns/op 1.8 G op/s
17
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 17 Predictions Good results on distributed memory systems aracari/ MPI – 32 Processors
18
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 18 Predictions Slightly worse results on shared memory ( skua, MPI, p=32)
19
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 19 Problems when Predicting Performance Results for PUB less accurate on shared memory Setup costs only covered by parameter l ► difficult to measure ► Problems on the shared memory machine when communication size is small PUB has performance break-in when communication size reaches a certain value Busy communication network can create ‘spikes’
20
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 20 Predictions for the Bit-Parallel Version Good results on distributed memory systems Results on the SGI have larger prediction error because local computations use block sizes for which f´is not stable
21
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 21 Speedup Results (LLCS, aracari )
22
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 22 Speedup for the Bit-Parallel Version Speedup slightly lower than for the standard version However, overall running times for same problem sizes are shorter Can expect parallel speedup for larger problem sizes
23
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 23 Speedup for the Bit-Parallel Version argus, p=10 skua, p=32
24
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 24 Result Summary
25
May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 25 Summary and Outlook Summary ▫ High-level BSP programming is efficient for the dynamic programming problem we considered. ▫ Implementations benefit from a low latency implementation (The Oxford BSP toolset/PUB) ▫ Very good predictability Outlook ▫ Different modeling of bandwidth allows better predictions ▫ Lower latency possible by using subgroup synchronization ▫ Extraction of LCS possible, using post processing step or other algorithm...
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.