SIAM Parallel Processing’2006 - Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis.

SIAM Parallel Processing’2006 - Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15Hybrids in exact linear algebra Dave Saunders U. Delaware, USA 10h45Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, Gudula Runger, U. Bayreuth, Germany 11h15 Cache-Oblivious algorithms Michael Bender, Stony Brook U., USA Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting Objective: towards an analysis based on the algorithm performance

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram IMAG-INRIA Workgroup on “Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Why adaptive algorithms and how? Input data vary Resources availability is versatile Adaptation to improve performances Scheduling partitioning load-balancing work-stealing Measures on resources Measures on data Calibration tuning parameters block size/ cache choice of instructions, … priority managing Choices in the algorithm sequential / parallel(s) approximated / exact in memory / out of core … An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem

Modeling an hybrid algorithm Several algorithms to solve a same problem f : –Eg : algo_f 1, algo_f 2 (block size), … algo_f k : –each algo_f k being recursive Adaptation to choose algo_f j for each call to f algo_f i ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; … }. E.g. “practical” hybrids: Atlas, Goto, FFPack FFTW cache-oblivious B-tree any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…

How to manage overhead due to choices ? Classification 1/2 : –Simple hybrid iff O(1) choices [eg block size in Atlas, …] –Baroque hybrid iff an unbounded number of choices [eg recursive splitting factors in FFTW] choices are either dynamic or pre-computed based on input properties.

Choices may or may not be based on architecture parameters. Classification 2/2. : an hybrid is –Oblivious: control flow does not depend neither on static properties of the resources nor on the input [eg cache-oblivious algorithm [ Bender ] –Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ] Engineered tuned orself tuned [eg ATLAS and GOTO libraries, FFTW, …] [eg [LinBox/FFLAS] [ Saunders&al] –Adaptive : self-configuration of the algorithm, dynamlc Based on input properties or resource circumstances discovered at run-time [eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]

Examples BLAS libraries –Atlas: simple tuned (self-tuned) –Goto : simple engineered (engineered tuned) –LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al] FFTW –Halving factor : baroque tuned –Stopping criterion : simple tuned Parallel algorithm and scheduling : –Choice of parallel degree : eg Tlib [Rauber&Rünger] –Work-stealing schedile : baroque hybrid

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Work-stealing (1/2) «Depth » W  = #ops on a critical path (parallel time on   resources) Workstealing = “ greedy ” schedule but distributed and randomized Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) « Work » W 1 = #total operations performed

Work-stealing (2/2) «Depth » W  = #ops on a critical path (parallel time on   resources) « Work » W 1 = #total operations performed Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> with good probability, near-optimal schedule on p processors with average speeds  ave T p < W 1 /(p  ave ) + O ( W  /  ave ) NB : #succeeded steals = #task migrations < p W  [Blumofe 98, Narlikar 01, Bender 02] Implementation: work-first principle [Cilk, Kaapi] Local parallelism is implemented by sequential function call Restrictions to ensure validity of the default sequential schedule - serie-parallel/Cilk - reference order/Kaapi

Work-stealing and adaptability Work-stealing ensures allocation of processors to tasks transparently to the application with provable performances Support to addition of new resources Support to resilience of resources and fault-tolerance (crash faults, network, … ) Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, … ] “ Baroque hybrid ” adaptation: there is an -implicit- dynamic choice between two algorithms a sequential (local) algorithm : depth-first (default choice) A parallel algorithm : breadth-first Choice is performed at runtime, depending on resource idleness Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]: Parallel Divide&Conquer computations Tree searching, Branch&X … -> suited when both sequential and parallel algorithms perform (almost) the same number of operations

Solution: to mix both a sequential and a parallel algorithm Basic technique : Parallel algorithm until a certain « grain »; then use the sequential one Problem : W  increases also, the number of migration … and the inefficiency ;o( Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92] Careful interplay of both algorithms to build one with both W  small and W 1 = O( W seq ) Divide the sequential algorithm into block Each block is computed with the (non-optimal) parallel algorithm Drawback : sequential at coarse grain and parallel at fine grain ;o( Adaptive granularity : dual approach : Parallelism is extracted at run-time from any sequential task But often parallelism has a cost !

Self-adaptive grain algorithm Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm –Examples : - iterated product [Vernizzi 05]- gzip / compression [Kerfali 04] - MPEG-4 / H264 [Bernard 06]- prefix computation [Traore 06] SeqCompute Extract_par LastPartComputation SeqCompute

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Sequential algorithm : for (i= 0 ; i <= n; i++ )  [ i ] =  [ i – 1 ] * a [ i ] ; Parallel algorithm [Ladner-Fischer] : Prefix computation : an example where parallelism always costs  1 = a 0 * a 1  2 =a 0 * a 1 * a 2 …  n =a 0 * a 1 * … * a n W  =2. log n but W 1 = 2.n Twice more expensive than the sequential … a 0 a 1 a 2 a 3 a 4 … a n-1 a n **** Prefix of size n/2  1  3 …  n  2  4 …  n-1 *** W 1 = W  = n

Adaptive prefix computation – Any (parallel) prefix performs at least W 1  2.n - W  ops – Strict-lower bound on p identical processors: T p  2n/(p+1) block algorithm + pipeline [Nicolau&al. 2000] Application of adaptive scheme : – One process performs the main “sequential” computation – Other work-stealer processes computes parallel « segmented » prefix –Near-optimal performance on processors with changing speeds : T p < 2n/((p+1).  ave ) + O ( log n /  ave ) lower bound

  0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Work- stealer 1 Main Seq.  Work- stealer 2 Adaptive Prefix on 3 processors 11 Steal request

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 77 33 Steal request 22 66  i =a 5 *…*a i

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 44 Preempt  10  i =a 9 *…*a i 88 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11 88 Preempt  11 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88 Implicit critical path on the sequential process

Adaptive prefix : some experiments Single user context Adaptive is equivalent to : - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm External charge Parallel Adaptive Parallel Adaptive Prefix of 10000 elements on a SMP 8 procs (IA64 / linux) #processors Time (s) #processors Join work with Daouda Traore

The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first

Conclusion Adaptive : what choices and how to choose ? Illustration : Adaptive parallel prefix based on work-stealing - self-tuned baroque hybrid : O(p log n ) choices - achieves near-optimal performance processor oblivious Generic adaptive scheme to implement parallel algorithms with provable performance

Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15Hybrids in exact linear algebra Dave Saunders, U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany 11h15 Cache-Obloivious algorithms Michael Bender, Stony Brook U., USA Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting Objective: towards an analysis based on the algorithm performance

Questions ?

SIAM Parallel Processing’2006 - Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis.

Similar presentations

Presentation on theme: "SIAM Parallel Processing’2006 - Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SIAM Parallel Processing’2006 - Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis.

Similar presentations

Presentation on theme: "SIAM Parallel Processing’2006 - Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis."— Presentation transcript:

Similar presentations

About project

Feedback