Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS.

Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15Hybrids in exact linear algebra Dave Saunders U. Delaware, USA 10h45Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, Gudula Runger, U. Bayreuth, Germany 11h15 Cache-Oblivious algorithms Michael Bender, Stony Brook U., USA Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting Objective: towards an analysis based on the algorithm performance

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram IMAG-INRIA Workgroup on “Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Why adaptive algorithms and how? Input data vary Resources availability is versatile Adaptation to improve performances Scheduling partitioning load-balancing work-stealing Measures on resources Measures on data Calibration tuning parameters block size/ cache choice of instructions, … priority managing Choices in the algorithm sequential / parallel(s) approximated / exact in memory / out of core … An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem

Modeling an hybrid algorithm Several algorithms to solve a same problem f : –Eg : algo_f 1, algo_f 2 (block size), … algo_f k : –each algo_f k being recursive Adaptation to choose algo_f j for each call to f algo_f i ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; … }. E.g. “practical” hybrids: Atlas, Goto, FFPack FFTW cache-oblivious B-tree any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…

How to manage overhead due to choices ? Classification 1/2 : –Simple hybrid iff O(1) choices [eg block size in Atlas, …] –Baroque hybrid iff an unbounded number of choices [eg recursive splitting factors in FFTW] choices are either dynamic or pre-computed based on input properties.

Choices may or may not be based on architecture parameters. Classification 2/2. : an hybrid is –Oblivious: control flow does not depend neither on static properties of the resources nor on the input [eg cache-oblivious algorithm [ Bender ] –Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ] Engineered tuned orself tuned [eg ATLAS and GOTO libraries, FFTW, …] [eg [LinBox/FFLAS] [ Saunders&al] –Adaptive : self-configuration of the algorithm, dynamlc Based on input properties or resource circumstances discovered at run-time [eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]

Examples BLAS libraries –Atlas: simple tuned (self-tuned) –Goto : simple engineered (engineered tuned) –LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al] FFTW –Halving factor : baroque tuned –Stopping criterion : simple tuned Parallel algorithm and scheduling : –Choice of parallel degree : eg Tlib [Rauber&Rünger] –Work-stealing schedile : baroque hybrid

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Work-stealing (1/2) «Depth » W  = #ops on a critical path (parallel time on   resources) Workstealing = “ greedy ” schedule but distributed and randomized Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) « Work » W 1 = #total operations performed

Work-stealing (2/2) «Depth » W  = #ops on a critical path (parallel time on   resources) « Work » W 1 = #total operations performed Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> with good probability, near-optimal schedule on p processors with average speeds  ave T p < W 1 /(p  ave ) + O ( W  /  ave ) NB : #succeeded steals = #task migrations < p W  [Blumofe 98, Narlikar 01, Bender 02] Implementation: work-first principle [Cilk, Kaapi] Local parallelism is implemented by sequential function call Restrictions to ensure validity of the default sequential schedule - serie-parallel/Cilk - reference order/Kaapi

Work-stealing and adaptability Work-stealing ensures allocation of processors to tasks transparently to the application with provable performances Support to addition of new resources Support to resilience of resources and fault-tolerance (crash faults, network, … ) Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, … ] “ Baroque hybrid ” adaptation: there is an -implicit- dynamic choice between two algorithms a sequential (local) algorithm : depth-first (default choice) A parallel algorithm : breadth-first Choice is performed at runtime, depending on resource idleness Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]: Parallel Divide&Conquer computations Tree searching, Branch&X … -> suited when both sequential and parallel algorithms perform (almost) the same number of operations

Solution: to mix both a sequential and a parallel algorithm Basic technique : Parallel algorithm until a certain « grain »; then use the sequential one Problem : W  increases also, the number of migration … and the inefficiency ;o( Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92] Careful interplay of both algorithms to build one with both W  small and W 1 = O( W seq ) Divide the sequential algorithm into block Each block is computed with the (non-optimal) parallel algorithm Drawback : sequential at coarse grain and parallel at fine grain ;o( Adaptive granularity : dual approach : Parallelism is extracted at run-time from any sequential task But often parallelism has a cost !

Self-adaptive grain algorithm Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm –Examples : - iterated product [Vernizzi 05]- gzip / compression [Kerfali 04] - MPEG-4 / H264 [Bernard 06]- prefix computation [Traore 06] SeqCompute Extract_par LastPartComputation SeqCompute

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Sequential algorithm : for (i= 0 ; i <= n; i++ )  [ i ] =  [ i – 1 ] * a [ i ] ; Parallel algorithm [Ladner-Fischer] : Prefix computation : an example where parallelism always costs  1 = a 0 * a 1  2 =a 0 * a 1 * a 2 …  n =a 0 * a 1 * … * a n W  =2. log n but W 1 = 2.n Twice more expensive than the sequential … a 0 a 1 a 2 a 3 a 4 … a n-1 a n **** Prefix of size n/2  1  3 …  n  2  4 …  n-1 *** W 1 = W  = n

Adaptive prefix computation – Any (parallel) prefix performs at least W 1  2.n - W  ops – Strict-lower bound on p identical processors: T p  2n/(p+1) block algorithm + pipeline [Nicolau&al. 2000] Application of adaptive scheme : – One process performs the main “sequential” computation – Other work-stealer processes computes parallel « segmented » prefix –Near-optimal performance on processors with changing speeds : T p < 2n/((p+1).  ave ) + O ( log n /  ave ) lower bound

Scheme of the proof Dynamic coupling of two algorithms that completes simultaneously: –Sequential: (optimal) number of operations S –Parallel : : performs X operations dynamic splitting always possible till finest grain BUT local sequential Scheduled by workstealing on p-1 processors –Critical path small (log X) –Each non constant time task can be splitted (variable speeds) Analysis : Algorithmic scheme ensures Ts = Tp + O(log X) => enables to bound the whole number X of operations performed and the overhead of parallelism = (s+X) - #ops_optimal Comparison to the lower bound on the number of operations.

  0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Work- stealer 1 Main Seq.  Work- stealer 2 Adaptive Prefix on 3 processors 11 Steal request

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 77 33 Steal request 22 66  i =a 5 *…*a i

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 44 Preempt  10  i =a 9 *…*a i 88 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11 88 Preempt  11 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88 Implicit critical path on the sequential process

Adaptive prefix : some experiments Single user context Adaptive is equivalent to : - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm External charge Parallel Adaptive Parallel Adaptive Prefix of 10000 elements on a SMP 8 procs (IA64 / linux) #processors Time (s) #processors Joint work with Daouda Traore

The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first

With * = double sum ( r[i]=r[i-1] + x[i] ) Single userProcessors with variable speeds Remark for n=4.096.000 doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and 0.175 on 2 procs (close to lower bound) Finest “grain” limited to 1 page = 16384 octets = 2048 double

E.g.Triangular system solving 0.x = b Sequential algorithm : T 1 = n 2 /2; T  = n (fine grain) 0.x = b A 1/ x 1 = - b 1 / a 11 2/ For k=2..n b k = b k - a k1.x 1 0.x = b system of dimension n-1 system of dimension n

E.g.Triangular system solving 0.x = b Sequential algorithm : T 1 = n 2 /2; T  = n (fine grain) Using parallel matrix inversion : T 1 = n 3 ; T  = log 2 n (fine grain) 0 A 21 A 22 A 11 = 0 S A 22 A 11 S= -A 22. A 21. A 11 with A = and x=A -1.b Self-adaptive granularity algorithm : T 1 = n 2 ; T  =  n.log n 0.x = b ExtractPar and self-adaptive scalar product self adaptive sequential algorithm self-adaptivematrix inversion choice of h =  m h m

Conclusion Adaptive : what choices and how to choose ? Illustration : Adaptive parallel prefix based on work-stealing - self-tuned baroque hybrid : O(p log n ) choices - achieves near-optimal performance processor oblivious Generic adaptive scheme to implement parallel algorithms with provable performance

Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15Hybrids in exact linear algebra Dave Saunders, U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany 11h15 Cache-Obloivious algorithms Michael Bender, Stony Brook U., USA Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting Objective: towards an analysis based on the algorithm performance

Questions ?

Some examples (1/2) Adaptive algorithms used empirically an theoretically : –Atlas [2001] dense linear algebra library Instruction set and instruction schedule Self-camobration pg yjr blpvk idr §<“§!uuuuuuuuuu de la taille des blocs à l’installation sur la machine –FFTW (1998, … ) ; FFT (n) <= p FFT(q) and q FFT(n) For any n, for any recursive call FFT(n) : pre-compite the nest value for p Pré-calcul de la découpe optimale pour la taille n du vecteur sur la machine –Cache-oblivious B-trees : Block recursive splitting to minimize #page faults Self adaptation to memory hierarchy –Workstealing (Cilk (1998, …) (2000, …) : recursive parallelism Choice between sequential depth-first schedule and breadth-first schedule « Work-first principle » : to optimize local sequentilal execution and put overhead on rare steals from idle processors. Implicitly adaptive

–Moldable tasks : Ordonnancement bi-critère avec garantie [Trystram&al 2004] Combinaison récursive alternatiive d’approximation pour chaque critère Auto-adaptation avec performance garantie pour chaque critère –Algorithmes « Cache-Oblivious » [Bender&al 2004] Découpe récursive par bloc qui minimise les défauts de page Auto-adaptation à la hiérarchie mémoire (B-tree) –Algorithmes « Processor-Oblivious » [Roch&al 2005] Combinaison récursive de 2 algorithmes séquentiel et parallèle Auto-adaptation à l’inactivité des ressources Some examples (2/2)

Best case : parallel algorithm is efficient W  is small and W 1 = W seq The parallel algorithm is an optimal sequential one Exemples: parallel D&C algorithms Implementation: work-first principle - no overhead when local execution of tasks Examples : Cilk : THE protocol Kaapi : Compare&swap only

Experimentation: knary benchmark SMP Architecture Origin 3800 (32 procs) Cilk / Athapascan Distributed Archi. iCluster Athapascan #procsSpeed- Up 87,83 1615,6 3230,9 6459,2 10090,1 T s = 2397 s  T 1 = 2435

In « practice »: coarse granularity Splitting into p = #resources Drawback : heterogeneous architecture, dynamic:  i (t) : speed of processor i at time t In « theory »: fine granularity Maximal parallelism Drawback : overhead of tasks management How to choose/adapt granularity ? a b H(a) O(b,7) F(2,a) G(a,b)H(b) High potential degree of parallelism

How to obtain an efficient fine-grain algorithm ? Hypothesis for efficiency of work-stealing : the parallel algorithm is « work-optimal » T  is very small (recursive parallelism) Problem : Fine grain (T  small) parallel algorithms may involve a large overhead with respect to a sequential efficient algorithm: Overhead due to parallelism creation and synchronization But also arithmetic overhead

Self-grain Adaptive algorithms Recursive computations –Local sequential computation Special case: –recursive extraction of parallelism when a resource becomes idle –But local execution of a sequential algorithm Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm Example : –- iterated product [Vernizzi]- gzip / compression [Kerfali] –- MPEG-4 / H264 [Bernard ….]- prefix computation [Traore]

Adaptive Prefix versus optimal on identical processors

Illustration: adaptive parallel prefix Adaptive parallel computing on non- uniform and shared resources Example of adaptive prefix computation

Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; W 1 = n Parallel algorithm [Ladner-Fischer] : Indeed parallelism often costs... eg : Prefix computation P 1 = a 0 * a 1, P 2 =a 0 * a 1 * a 2, …, P n =a 0 * a 1 * … * a n a0a0 a1a1 a2a2 anan a3a3 *** a n-1 Prefix ( n / 2 ) P1P1 P3P3 PnPn * P4P4 * P2P2 * P n-1 W  =2. log n but W 1 = 2.n Twice more expensive than the sequential

Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS.

Similar presentations

Presentation on theme: "Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS.

Similar presentations

Presentation on theme: "Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS."— Presentation transcript:

Similar presentations

About project

Feedback