Download presentation

Presentation is loading. Please wait.

Published byDerek Parks Modified about 1 year ago

1
SIAM Parallel Processing’ Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15Hybrids in exact linear algebra Dave Saunders U. Delaware, USA 10h45Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, Gudula Runger, U. Bayreuth, Germany 11h15 Cache-Oblivious algorithms Michael Bender, Stony Brook U., USA Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting Objective: towards an analysis based on the algorithm performance

2
Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram IMAG-INRIA Workgroup on “Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

3
Why adaptive algorithms and how? Input data vary Resources availability is versatile Adaptation to improve performances Scheduling partitioning load-balancing work-stealing Measures on resources Measures on data Calibration tuning parameters block size/ cache choice of instructions, … priority managing Choices in the algorithm sequential / parallel(s) approximated / exact in memory / out of core … An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem

4
Modeling an hybrid algorithm Several algorithms to solve a same problem f : –Eg : algo_f 1, algo_f 2 (block size), … algo_f k : –each algo_f k being recursive Adaptation to choose algo_f j for each call to f algo_f i ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; … }. E.g. “practical” hybrids: Atlas, Goto, FFPack FFTW cache-oblivious B-tree any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…

5
How to manage overhead due to choices ? Classification 1/2 : –Simple hybrid iff O(1) choices [eg block size in Atlas, …] –Baroque hybrid iff an unbounded number of choices [eg recursive splitting factors in FFTW] choices are either dynamic or pre-computed based on input properties.

6
Choices may or may not be based on architecture parameters. Classification 2/2. : an hybrid is –Oblivious: control flow does not depend neither on static properties of the resources nor on the input [eg cache-oblivious algorithm [ Bender ] –Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ] Engineered tuned orself tuned [eg ATLAS and GOTO libraries, FFTW, …] [eg [LinBox/FFLAS] [ Saunders&al] –Adaptive : self-configuration of the algorithm, dynamlc Based on input properties or resource circumstances discovered at run-time [eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]

7
Examples BLAS libraries –Atlas: simple tuned (self-tuned) –Goto : simple engineered (engineered tuned) –LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al] FFTW –Halving factor : baroque tuned –Stopping criterion : simple tuned Parallel algorithm and scheduling : –Choice of parallel degree : eg Tlib [Rauber&Rünger] –Work-stealing schedile : baroque hybrid

8
Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

9
Work-stealing (1/2) «Depth » W = #ops on a critical path (parallel time on resources) Workstealing = “ greedy ” schedule but distributed and randomized Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) « Work » W 1 = #total operations performed

10
Work-stealing (2/2) «Depth » W = #ops on a critical path (parallel time on resources) « Work » W 1 = #total operations performed Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> with good probability, near-optimal schedule on p processors with average speeds ave T p < W 1 /(p ave ) + O ( W / ave ) NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02] Implementation: work-first principle [Cilk, Kaapi] Local parallelism is implemented by sequential function call Restrictions to ensure validity of the default sequential schedule - serie-parallel/Cilk - reference order/Kaapi

11
Work-stealing and adaptability Work-stealing ensures allocation of processors to tasks transparently to the application with provable performances Support to addition of new resources Support to resilience of resources and fault-tolerance (crash faults, network, … ) Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, … ] “ Baroque hybrid ” adaptation: there is an -implicit- dynamic choice between two algorithms a sequential (local) algorithm : depth-first (default choice) A parallel algorithm : breadth-first Choice is performed at runtime, depending on resource idleness Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]: Parallel Divide&Conquer computations Tree searching, Branch&X … -> suited when both sequential and parallel algorithms perform (almost) the same number of operations

12
Solution: to mix both a sequential and a parallel algorithm Basic technique : Parallel algorithm until a certain « grain »; then use the sequential one Problem : W increases also, the number of migration … and the inefficiency ;o( Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92] Careful interplay of both algorithms to build one with both W small and W 1 = O( W seq ) Divide the sequential algorithm into block Each block is computed with the (non-optimal) parallel algorithm Drawback : sequential at coarse grain and parallel at fine grain ;o( Adaptive granularity : dual approach : Parallelism is extracted at run-time from any sequential task But often parallelism has a cost !

13
Self-adaptive grain algorithm Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm –Examples : - iterated product [Vernizzi 05]- gzip / compression [Kerfali 04] - MPEG-4 / H264 [Bernard 06]- prefix computation [Traore 06] SeqCompute Extract_par LastPartComputation SeqCompute

14
Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

15
Sequential algorithm : for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ; Parallel algorithm [Ladner-Fischer] : Prefix computation : an example where parallelism always costs 1 = a 0 * a 1 2 =a 0 * a 1 * a 2 … n =a 0 * a 1 * … * a n W =2. log n but W 1 = 2.n Twice more expensive than the sequential … a 0 a 1 a 2 a 3 a 4 … a n-1 a n **** Prefix of size n/2 1 3 … n 2 4 … n-1 *** W 1 = W = n

16
Adaptive prefix computation – Any (parallel) prefix performs at least W 1 2.n - W ops – Strict-lower bound on p identical processors: T p 2n/(p+1) block algorithm + pipeline [Nicolau&al. 2000] Application of adaptive scheme : – One process performs the main “sequential” computation – Other work-stealer processes computes parallel « segmented » prefix –Near-optimal performance on processors with changing speeds : T p < 2n/((p+1). ave ) + O ( log n / ave ) lower bound

17
0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Work- stealer 1 Main Seq. Work- stealer 2 Adaptive Prefix on 3 processors 11 Steal request

18
Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 77 33 Steal request 22 66 i =a 5 *…*a i

19
Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 77 33 44 22 66 i =a 5 *…*a i a 9 a 10 a 11 a 12 88 44 Preempt 10 i =a 9 *…*a i 88 88

20
Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 8 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 77 33 44 22 66 i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55 10 i =a 9 *…*a i 99 66 11 88 Preempt 11 88

21
Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 8 11 a 12 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 77 33 44 22 66 i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55 10 i =a 9 *…*a i 99 66 11 12 10 77 11 88

22
Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 8 11 a 12 Work- stealer 1 Main Seq. 11 Work- stealer 2 a 5 a 6 a 7 a 8 77 33 44 22 66 i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55 10 i =a 9 *…*a i 99 66 11 12 10 77 11 88 Implicit critical path on the sequential process

23
Adaptive prefix : some experiments Single user context Adaptive is equivalent to : - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm External charge Parallel Adaptive Parallel Adaptive Prefix of elements on a SMP 8 procs (IA64 / linux) #processors Time (s) #processors Join work with Daouda Traore

24
The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first

25
Conclusion Adaptive : what choices and how to choose ? Illustration : Adaptive parallel prefix based on work-stealing - self-tuned baroque hybrid : O(p log n ) choices - achieves near-optimal performance processor oblivious Generic adaptive scheme to implement parallel algorithms with provable performance

26
Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15Hybrids in exact linear algebra Dave Saunders, U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany 11h15 Cache-Obloivious algorithms Michael Bender, Stony Brook U., USA Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting Objective: towards an analysis based on the algorithm performance

27
Questions ?

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google