P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous Parallelism Peter Krusche and Alexander Tiskin Department of Computer Science University of Warwick May 09/2006

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 2 Outline 1.Introduction  LLCS Computation  The BSP Model 2.Problem Definition and Algorithms  Standard Algorithm  Parallel Algorithm 3.Experiments  Experiment Setup  Predictions  Speedup

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 3 Motivation Computing the (Length of the) Longest Common Subsequence is representative of a class of dynamic programming algorithms. Hence, we want to  Examine the suitability of high-level BSP programming for such problems  Compare different BSP libraries on different systems  See what happens when there is good sequential performance  Examine performance predictability

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 4 Related Work Sequential dynamic programming algorithm (Hirschberg, ’75) Crochemore, Iliopoulos, Pinzon, Reid: A fast and practical bit-vector algorithm for the Longest Common Subsequence problem (2001) Alves, Cáceres, Dehne: Parallel dynamic programming for solving the string editing problem on a CGM/BSP (2002). Garcia, Myoupo, Semé: A coarse-grained multicomputer algorithm for the longest common subsequence problem (2003).

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 5 Our Work  Combination of bit-parallel algorithms and fast BSP-style communication  A BSP performance model and predictions  Comparison using different libraries on different systems  Estimation of block size parameter before calculation for better speedup

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 6 The BSP Model  p identical processor/memory pairs (computing nodes)  Computation speed f on every node  Arbitrary interconnection network, latency l, bandwidth gap g

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 7 BSP Programs  SPMD execution, takes place in supersteps  Communication may be delayed until the end of the superstep  Time/Cost Formula : T = f ·W + g · H + l · S Bytes will be used as a base unit for communication size

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 8 Problem Definition Let X = x 1 x 2...x m and Y = y 1 y 2...y n be two strings on a finite alphabet  Subsequence U of string X: U can be obtained by deleting zero or more elements from X i.e. U = x i 1 x i 2...x i k and i q < i q+1 for all q with 1 ≤ q < k.  Strings X and Y : LCS (X, Y) is any string which is subsequence of both X and Y and has maximum possible length.  Length of these sequences: LLCS (X, Y).

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 9 Sequential Algorithm  Dynamic programming matrix L 0..m,0..n  L i,j = LLCS( x 1 x 2 …x i, y 1 y 2 …y j )  The values in this matrix can be computed in O(mn) time and space

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 10 Parallel Algorithm  Based on a simple parallel algorithm for grid DAG computation  Dynamic programming matrix L is partitioned into a grid of rectangular blocks of size (m/G)×(n/G) (G : grid size)  Blocks in a wavefront can be processed in parallel  Assumptions: ▫ Strings of equal length m = n ▫ Ratio  = G/p is an integer

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 11 Parallel Cost Model  Input/output data distribution is block-cyclic ► Can keep data for block-columns locally  Running time: ► Parameter a can be used to tune performance

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 12 Bit-Parallel Algorithms  Bit-parallel computation processes  entries of L in parallel ( : machine word size)  This leads to substantial speedup for the sequential computation phase and slightly lower communication cost per superstep.

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 13 Systems Used Measurements on parallel machines at the Centre for Scientific Computing:  aracari : IBM cluster, 64 × 2-way SMP Pentium3 1.4 GHz/128 GB of memory (Interconnection Network: Myrinet 2000, MPI: mpich-gm)  argus : Linux cluster, 31 × 2-way SMP Pentium4 Xeon 2.6 GHz processors/62 GB of memory (Interconnection Network: 100Mbit Ethernet, MPI: mpich-p4)  skua : SGI Altix shared memory machine, 56 × Itanium-2 1.6 GHz processors / 112 GB of memory (MPI: SGI native)

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 14 BSP Libraries Used  The Oxford BSP Toolset on top of MPI ( www.bsp-worldwide.org/implmnts/ oxtool/ )  PUB on top of MPI (except on the SGI) ( wwwcs.uni-paderborn.de/~bsp/ )  A simple BSPlib implementation based on MPI(-2)

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 15 Input and Parameters  Input strings generated randomly of equal length  Predictability examined for string lengths between 8192 and 65536, grid size parameter a between 1 and 5  Values of l, g measured by timing random permutations

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 16 Experimental Values of f and f´  Simple Algorithm (f) skua 0.008 ns/op 130 M op/s argus 0.016 ns/op 61 M op/s aracari 0.012 ns/op 86 M op/s  Bit-Parallel Algorithm (f´) skua 0.00022 ns/op 4.5 G op/s argus 0.00034 ns/op 2.9 G op/s aracari 0.00055 ns/op 1.8 G op/s

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 17 Predictions Good results on distributed memory systems aracari/ MPI – 32 Processors

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 18 Predictions Slightly worse results on shared memory ( skua, MPI, p=32)

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 19 Problems when Predicting Performance  Results for PUB less accurate on shared memory  Setup costs only covered by parameter l ► difficult to measure ► Problems on the shared memory machine when communication size is small  PUB has performance break-in when communication size reaches a certain value  Busy communication network can create ‘spikes’

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 20 Predictions for the Bit-Parallel Version  Good results on distributed memory systems  Results on the SGI have larger prediction error because local computations use block sizes for which f´is not stable

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 21 Speedup Results (LLCS, aracari )

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 22 Speedup for the Bit-Parallel Version  Speedup slightly lower than for the standard version  However, overall running times for same problem sizes are shorter  Can expect parallel speedup for larger problem sizes

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 23 Speedup for the Bit-Parallel Version argus, p=10 skua, p=32

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 24 Result Summary

May 09/2006P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism 25 Summary and Outlook  Summary ▫ High-level BSP programming is efficient for the dynamic programming problem we considered. ▫ Implementations benefit from a low latency implementation (The Oxford BSP toolset/PUB) ▫ Very good predictability  Outlook ▫ Different modeling of bandwidth allows better predictions ▫ Lower latency possible by using subgroup synchronization ▫ Extraction of LCS possible, using post processing step or other algorithm...

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

Similar presentations

Presentation on theme: "P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous.

Similar presentations

Presentation on theme: "P.Krusche / A. Tiskin - Efficient LLCS Computation using Bulk-Synchronous Parallelism Efficient Longest Common Subsequence Computation using Bulk-Synchronous."— Presentation transcript:

Similar presentations

About project

Feedback