Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble, France Contents I. What is a processor-oblivious parallel algorithm ? II. Work-stealing scheduling of parallel algorithms III.Processor-oblivious parallel prefix computation Workshop “Scheduling Algorithms for New Emerging Applications” - CIRM Luminy -May 29th-June 2nd, 2006

Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode The problem Problem: compute f(a) Sequential algorithm parallel P=2 parallel P=100 parallel P=max...... Multi-user SMP serverGridHeterogeneous network ? Which algorithm to choose ? ……

Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode => motivates « processor-oblivious » parallel algorithm that : + is independent from the underlying architecture: no reference to p nor  i (t) = speed of processor i at time t nor … + on a given architecture, has performance guarantees : behaves as well as an optimal (off-line, non-oblivious) one Problem: often, the larger the parallel degree, the larger the #operations to perform ! Processor-oblivious algorithms

Prefix problem : input : a 0, a 1, …, a n output :  0,  1, …,  n with Sequential algorithm : for (i= 0 ; i <= n; i++ )  [ i ] =  [ i – 1 ] * a [ i ] ; Fine grain optimal parallel algorithm [Ladner-Fischer] : Prefix computation Critical time W  =2. log n but performs W 1 = 2.n ops Twice more expensive than the sequential … a 0 a 1 a 2 a 3 a 4 … a n-1 a n **** Prefix of size n/2  1  3 …  n  2  4 …  n-1 *** performs W 1 = W  = n operations

Any parallel algorithm with critical time W  runs on p processors in time – strict lower bound : block algorithm + pipeline [Nicolau&al. 1996] –Question : How to design a generic parallel algorithm, independent from the architecture, that achieves optimal performance on any given architecture ? –> to design a malleable algorithm where scheduling suits the number of operations performed to the architecture Prefix computation : an example where parallelism always costs

- Heterogeneous processors with changing speed [Bender-Rabin02] =>  i (t) = instantaneous speed of processor i at time t in #operations per second - Average speed per processor for a computation with duration T : - Lower bound for the time of prefix computation : Architecture model

Work-stealing (1/2) «Depth » W  = #ops on a critical path (parallel time on   resources) Workstealing = “ greedy ” schedule but distributed and randomized Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) « Work » W 1 = #total operations performed

Work-stealing (2/2) «Depth » W  = #ops on a critical path (parallel time on   resources) « Work » W 1 = #total operations performed Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> if W  small enough near-optimal processor-oblivious schedule with good probability on p processors with average speeds  ave NB : #succeeded steals = #task migrations < p W  [Blumofe 98, Narlikar 01, Bender 02] Implementation: work-first principle [Cilk serie-parallel, Kaapi dataflow] -> Move scheduling overhead on the steal operations (infrequent case) -> General case : “ local parallelism ” implemented by sequential function call

General approach: to mix both a sequential algorithm with optimal work W 1 and a fine grain parallel algorithm with minimal critical time W  Folk technique : parallel, than sequential Parallel algorithm until a certain « grain »; then use the sequential one Drawback : W  increases ;o) …and, also, the number of steals Work-preserving speed-up technique [Bini-Pan94] sequential, then parallel Cascading [Jaja92] : Careful interplay of both algorithms to build one with both W  small and W 1 = O( W seq ) Use the work-optimal sequential algorithm to reduce the size Then use the time-optimal parallel algorithm to decrease the time Drawback : sequential at coarse grain and parallel at fine grain ;o( How to get both optimal work W 1 and W  small?

Alternative : concurrently sequential and parallel Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead  use parallel algorithm only if a processor becomes idle (ie steals) by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm – Self-adaptive granularity based on work-stealing SeqCompute Extract_par LastPartComputation SeqCompute

Parallel Sequential   0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Work- stealer 1 Main Seq.  Work- stealer 2 Adaptive Prefix on 3 processors 11 Steal request

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 77 33 Steal request 22 66  i =a 5 *…*a i

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 44 Preempt  10  i =a 9 *…*a i 88 88

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11 88 Preempt  11 88

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88

Parallel Sequential Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88 Implicit critical path on the sequential process

Analysis of the algorithm Execution time Sketch of the proof : –Dynamic coupling of two algorithms that completes simultaneously: –Sequential: (optimal) number of operations S on one processor –Parallel : minimal time but performs X operations on other processors dynamic splitting always possible till finest grain BUT local sequential –Critical path small ( eg : log X) –Each non constant time task can potentially be splitted (variable speeds) –Algorithmic scheme ensures T s = T p + O(log X) => enables to bound the whole number X of operations performed and the overhead of parallelism = (s+X) - #ops_optimal Lower bound

Adaptive prefix : experiments1 Single-user context : processor-oblivious prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors - Less sensitive to system overhead : even better than the theoretically “optimal” off-line parallel algorithm on p processors : Optimal off-line on p procs Oblivious Prefix sum of 8.10 6 double on a SMP 8 procs (IA64 1.5GHz/ linux) Time (s) #processors Pure sequential Single user context

Adaptive prefix : experiments 2 Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule, Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule, External charge (9-p external processes) Off-line parallel algorithm for p processors Oblivious Prefix sum of 8.10 6 double on a SMP 8 procs (IA64 1.5GHz/ linux) Time (s) #processors Multi-user context :

Conclusion The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors with changing speeds - practically, achieves near-optimal performances on multi-user SMPs Generic adaptive scheme to implement parallel algorithms with provable performance - work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]

Thank you ! Interactive Distributed Simulation [B Raffin &E Boyer] - 5 cameras, - 6 PCs 3D-reconstruction + simulation + rendering ->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp

The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first

Adaptive prefix : some experiments Single user context Adaptive is equivalent to : - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm External charge Parallel Adaptive Parallel Adaptive Prefix of 10000 elements on a SMP 8 procs (IA64 / linux) #processors Time (s) #processors

With * = double sum ( r[i]=r[i-1] + x[i] ) Single userProcessors with variable speeds Remark for n=4.096.000 doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and 0.175 on 2 procs (close to lower bound) Finest “grain” limited to 1 page = 16384 octets = 2048 double

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,

Similar presentations

Presentation on theme: "Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,

Similar presentations

Presentation on theme: "Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,"— Presentation transcript:

Similar presentations

About project

Feedback