Presentation is loading. Please wait.

Presentation is loading. Please wait.

Laboratoire Informatique de Grenoble Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch,

Similar presentations


Presentation on theme: "Laboratoire Informatique de Grenoble Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch,"— Presentation transcript:

1 Laboratoire Informatique de Grenoble Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch, Denis Trystram Projet MOAIS [ INRIA / CNRS / INPG / UJF ] moais.imag.fr

2 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

3 Why adaptive algorithms and how? Input data vary Resources availability is versatile Adaptation to improve performances Scheduling partitioning load-balancing work-stealing Measures on resources Measures on data Calibration tuning parameters block size/ cache choice of instructions, … priority managing Choices in the algorithm sequential / parallel(s) approximated / exact in memory / out of core … An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem

4 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

5 Modeling an hybrid algorithm Several algorithms to solve a same problem f : –Eg : algo_f 1, algo_f 2 (block size), … algo_f k : –each algo_f k being recursive Adaptation to choose algo_f j for each call to f algo_f i ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; … }. E.g. practical hybrids: Atlas, Goto, FFPack FFTW cache-oblivious B-tree any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…

6 How to manage overhead due to choices ? Classification 1/2 : –Simple hybrid iff O(1) choices [eg block size in Atlas, …] –Baroque hybrid iff an unbounded number of choices [eg recursive splitting factors in FFTW] choices are either dynamic or pre-computed based on input properties.

7 Choices may or may not be based on architecture parameters. Classification 2/2. : an hybrid is –Oblivious: control flow does not depend neither on static properties of the resources nor on the input [eg cache-oblivious algorithm [ Bender ] –Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ] Engineered tuned orself tuned [eg ATLAS and GOTO libraries, FFTW, …] [eg [LinBox/FFLAS] [ Saunders&al] –Adaptive : self-configuration of the algorithm, dynamlc Based on input properties or resource circumstances discovered at run-time [eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]

8 Examples BLAS libraries –Atlas: simple tuned (self-tuned) –Goto : simple engineered (engineered tuned) –LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al] FFTW –Halving factor : baroque tuned –Stopping criterion : simple tuned

9 Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode Adaptation in parallel algorithms Problem: compute f(a) Sequential algorithm parallel P=2 parallel P=100 parallel P=max Multi-user SMP serverGridHeterogeneous network ? Which algorithm to choose ? ……

10 Dynamic architecture : non-fixed number of resources, variable speeds eg: grid, … but not only: SMP server in multi-users mode => motivates « processor-oblivious » parallel algorithm that : + is independent from the underlying architecture: no reference to p nor i (t) = speed of processor i at time t nor … + on a given architecture, has performance guarantees : behaves as well as an optimal (off-line, non-oblivious) one Processor-oblivious algorithms

11 How to adapt the application ? By minimizing communications e.g. amortizing synchronizations in the simulation [Beaumont, Daoudi, Maillard, Manneback, Roch - PMAA 2004] adaptive granularity By contolling latency (interactivity constraints) : FlowVR [Allard, Menier, Raffin] overhead By managing node failures and resilience [Checkpoint/restart][checkers] FlowCert [Jafar, Krings, Leprevost; Roch, Varrette] By adapting granularity malleable tasks [Trystram, Mounié] dataflow cactus-stack : Athapascan/Kaapi [Gautier] recursive parallelism by « work-stealling » [Blumofe-Leiserson 98, Cilk, Athapascan,... ] Self-adaptive grain algorithms dynamic extraction of paralllelism [Daoudi, Gautier, Revire, Roch - J. TSI 2004 ]

12 Parallelism and efficiency Difficult in general (coarse grain) But easy if T small (fine grain) T p = T 1 /p + T [Greedy scheduling, Graham69] Expensive in general (fine grain) But small overhead if coarse grain Scheduling efficient policy (close to optimal) control of the policy (realisation) Problem : how to adapt the potential parallelism to the resources ? «Depth » parallel time on resources T = #ops on a critcal path « Work » sequential time T 1 = #operations => to have T small with coarse grain control

13 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

14 Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Moais team - LIG Grenoble, France Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] Conclusion

15 A COMPLETER Denis

16 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

17 Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Moais team - LIG Grenoble, France Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] Conclusion

18 Work-stealing (1/2) «Depth » W = #ops on a critical path (parallel time on resources) Workstealing = greedy schedule but distributed and randomized Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) « Work » W 1 = #total operations performed

19 Work-stealing (2/2) «Depth » W = #ops on a critical path (parallel time on resources) « Work » W 1 = #total operations performed Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> if W small enough near-optimal processor-oblivious schedule with good probability on p processors with average speeds ave NB : #succeeded steals = #task migrations < p W [Blumofe 98, Narlikar 01, Bender 02] Implementation: work-first principle [Cilk serie-parallel, Kaapi dataflow] -> Move scheduling overhead on the steal operations (infrequent case) -> General case : local parallelism implemented by sequential function call

20 Work stealing scheduling of a parallel recursive fine-grain algorithm Work-stealing scheduling an idle processor steals the oldest ready task Interests : => #succeeded steals suited to heterogeneous architectures [Bender-Rabin 03,....] Hypothesis for efficient parallel executions: the parallel algorithm is « work-optimal » T is very small (recursive parallelism) a « sequential » execution of the parallel algorithm is valid e.g. : search trees, Branch&Bound,... Implementation : work-first principle [Multilisp, Cilk, … ] overhead of task creation only upon steal request: sequential degeneration of the parallel algorithm cactus-stack management

21 Int é rêt : Grain fin « statique », mais contrôle dynamique Inconv é nient: surcôut possible de l algorithme parall è le [ex. pr é fixes] f2 Implementation of work-stealing fork f2 f1() { …. fork f2 ; … } steal f1 P + non-préemptive execution of ready task P Hypothesis : a sequential schedule is valid f1 Stack

22

23 Modèle de coût : avec une grande probabilité, sur p proc. Identiques - Temps dexécution = - nombre de requètes de vols =

24

25 Experimentation: knary benchmark SMP Architecture Origin 3800 (32 procs) Cilk / Athapascan Distributed Archi. iCluster Athapascan #procsSpeed- Up 87, ,6 3230,9 6459, ,1 T s = 2397 s T 1 = 2435

26 How to obtain an efficient fine-grain algorithm ? Hypothesis for efficiency of work-stealing : the parallel algorithm is « work-optimal » T is very small (recursive parallelism) Problem : Fine grain (T small) parallel algorithms may involve a large overhead with respect to a sequential efficient algorithm: Overhead due to parallelism creation and synchronization But also arithmetic overhead

27 Work-stealing and adaptability Work-stealing ensures allocation of processors to tasks transparently to the application with provable performances Support to addition of new resources Support to resilience of resources and fault-tolerance (crash faults, network, … ) Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, … ] Baroque hybrid adaptation: there is an -implicit- dynamic choice between two algorithms a sequential (local) algorithm : depth-first (default choice) A parallel algorithm : breadth-first Choice is performed at runtime, depending on resource idleness Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]: Parallel Divide&Conquer computations Tree searching, Branch&X … -> suited when both sequential and parallel algorithms perform (almost) the same number of operations

28 Self-adaptive grain algorithm Principle : To save parallelism overhead by privilegiating a sequential algorithm : => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm SeqCompute Extract_par LastPartComputation SeqCompute

29 Generic self-adaptive grain algorithm

30 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault tolerance Conclusion

31 General approach: to mix both a sequential algorithm with optimal work W 1 and a fine grain parallel algorithm with minimal critical time W Folk technique : parallel, than sequential Parallel algorithm until a certain « grain »; then use the sequential one Drawback : W increases ;o) …and, also, the number of steals Work-preserving speed-up technique [Bini-Pan94] sequential, then parallel Cascading [Jaja92] : Careful interplay of both algorithms to build one with both W small and W 1 = O( W seq ) Use the work-optimal sequential algorithm to reduce the size Then use the time-optimal parallel algorithm to decrease the time Drawback : sequential at coarse grain and parallel at fine grain ;o( How to get both optimal work W 1 and W small?

32 Illustration : f(i), i= LastPart(w) W= SeqComp(w) sur CPU=A f(1)

33 Illustration : f(i), i= LastPart(w) W= SeqComp(w) sur CPU=A f(1);f(2)

34 Illustration : f(i), i= LastPart(w) on CPU=B W= SeqComp(w) sur CPU=A f(1);f(2)

35 Illustration : f(i), i= SeqComp(w) sur CPU=A f(1);f(2) LastPart(w) on CPU=B W=3..51 SeqComp(w) LastPart(w) W= LastPart(w)

36 Illustration : f(i), i= SeqComp(w) sur CPU=A f(1);f(2) W=3..51 SeqComp(w) LastPart(w) W= LastPart(w)

37 Illustration : f(i), i= SeqComp(w) sur CPU=A f(1);f(2) W=3..51 SeqComp(w) sur CPU=B f(52) LastPart(w) W= LastPart(w)

38 Produit iteré Séquentiel, parallèle, adaptatif [Davide Vernizzi] Séquentiel : Entrée: tableau de n valeurs Sortie: c/c++ code: for (i=0; i

39 Variante : somme de pages Entrée: ensemble de n pages. Chaque page est un tableau de valeurs Sortie: une page où chaque élément est la somme des éléments de même indice des pages pr é c é dentes c/c++ code: for (i=0; i

40 Démonstration sur ensibull Script: demo]$ more go-tout.sh #!/bin/sh./spg /tmp/data &./ppg /tmp/data 1 --a1 -thread.poolsize 3 &./apg /tmp/data 1 --a1 -thread.poolsize 3 & Résultat: demo]$./go-tout.sh Page size: 4096 Memory allocated 0:In main: th = 1, parallel 0: : res = e+07 0: time = sADAPTATIF (3 procs) 0: Threads created: 54 0: : res = e+07 0: time = sPARALLELE(3 procs) 0: #fork = : : : res = e+07 : time = sSEQUENTIEL (1 proc) :

41 Algorithme adaptatif (1/3) Hypothèse: ordonnancement non préemptif - de type work-stealing Couplage séquentiel adaptatif : void Adaptative (a1::Shared_w *resLocal, DescWork dw) { // cout << "Adaptative" << endl; a1::Shared resLPC; a1::Fork () (resLPC, dw); Page resSeq (pageSize); AdaptSeq (dw, &resSeq); a1::Fork () (resLPC, *resLocal, resSeq); }

42 Algorithme adaptatif (2/3) Côt é s é quentiel : void AdaptSeq (DescWork dw, Page *resSeq){ DescLocalWork w; Page resLoc (pageSize); double k; while (!dw.desc->extractSeq(&w)) { for (int i=0; i

43 Algorithme adaptatif (3/3) Côt é extraction = algorithme parall è le : struct LPC { void operator () (a1::Shared_w resLPC, DescWork dw){ DescWork dw2; dw2.Allocate(); dw2.desc->l.initialize(); if (dw.desc->extractPar(&dw2)) { a1::Shared res2; a1::Fork () (res2, dw2.desc->i, dw2.desc->j); a1::Shared resLPCold; a1::Fork () (resLPCold, dw); a1::Fork () (resLPCold, res2, resLPC); } };

44 Application : gzip parallelisation Gzip : Utilisé (web) et coûteux bien que de complexité linéaire Code source :10000 lignes C, structures de données complexes Principe : LZ77 + arbre Huffman Pourquoi gzip ? Problème P-complet, mais parallélisation pratique possible similar to iterated product Inconvénient: toute parallélisation (connue) entraîne un surcoût -> perte de taux de compression

45 Fichier compressé Fichier en entrée Compression à la volée Algorithme Partition dynamique en blocs Parallélisation « facile »,100% compatible avec gzip/gunzip Problèmes : perte de taux de compression, grain dépend de la machine, surcoût Blocs compressés Compression parallèle Partition statique en blocs Parallélisation => Comment paralléliser gzip ?

46 Output compressed file Input File Compression à la volée SeqComp LastPartComputation Output compressed blocks Parallel compression Parallélisation gzip à grain adaptatif Dynamic partition in blocks cat

47 Performances sur SMP Pentium 4x200 Mhz

48 Performances en distribué Séquentiel Pentium 4x200 Mhz SMP Pentium 4 x200 Mhz Architecture distribuée Myrinet Pentium 4 x200 Mhz + 2 x333 Mhz Recherche distribuée dans 2 répertoires de même taille chacun sur un disque distant (NFS)

49 Taille Fichiers Gzip Adaptatif 2 procs Adaptatif 8 procs Adaptatif 16 procs 0,86 Mo ,2 Mo1,023Mo1,027Mo1,05Mo1,08 Mo 9,4 Mo6,60 Mo6,62 Mo6,73 Mo6,79 Mo 10 Mo1,12 Mo1,13 Mo1,14 Mo1,17 Mo 5,2 Mo3,35 s0,96 s0,55 s 9,4 Mo7,67 s6,73 s6,79 s 10 Mo6,79 s1,71 s0,88 s Surcoût en taille de fichier comprimé Gain en temps

50 Performances Pentium 4x200Mhz

51 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault tolerance Conclusion

52 Solution: to mix both a sequential and a parallel algorithm Basic technique : Parallel algorithm until a certain « grain »; then use the sequential one Problem : W increases also, the number of migration … and the inefficiency ;o( Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92] Careful interplay of both algorithms to build one with both W small and W 1 = O( W seq ) Divide the sequential algorithm into block Each block is computed with the (non-optimal) parallel algorithm Drawback : sequential at coarse grain and parallel at fine grain ;o( Adaptive granularity : dual approach : Parallelism is extracted at run-time from any sequential task But often parallelism has a cost !

53 Prefix problem : input : a 0, a 1, …, a n output : 0, 1, …, n with Sequential algorithm : for (i= 0 ; i <= n; i++ ) [ i ] = [ i – 1 ] * a [ i ] ; Fine grain optimal parallel algorithm [Ladner-Fischer] : Prefix computation Critical time W =2. log n but performs W 1 = 2.n ops Twice more expensive than the sequential … a 0 a 1 a 2 a 3 a 4 … a n-1 a n **** Prefix of size n/2 1 3 … n 2 4 … n-1 *** performs W 1 = W = n operations

54 Any parallel algorithm with critical time W runs on p processors in time – strict lower bound : block algorithm + pipeline [Nicolau&al. 1996] –Question : How to design a generic parallel algorithm, independent from the architecture, that achieves optimal performance on any given architecture ? –> to design a malleable algorithm where scheduling suits the number of operations performed to the architecture Prefix computation : an example where parallelism always costs

55 - Heterogeneous processors with changing speed [Bender-Rabin02] => i (t) = instantaneous speed of processor i at time t in #operations per second - Average speed per processor for a computation with duration T : - Lower bound for the time of prefix computation : Architecture model

56 Alternative : concurrently sequential and parallel Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead use parallel algorithm only if a processor becomes idle (ie steals) by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm – Self-adaptive granularity based on work-stealing SeqCompute Extract_par LastPartComputation SeqCompute

57 Parallel Sequential 0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Work- stealer 1 Main Seq. Work- stealer 2 Adaptive Prefix on 3 processors 1 Steal request

58 Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 1 Work- stealer 2 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a Steal request 2 6 i =a 5 *…*a i

59 Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 1 Work- stealer 2 a 5 a 6 a 7 a i =a 5 *…*a i a 9 a 10 a 11 a Preempt 10 i =a 9 *…*a i 8 8

60 Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a 4 8 Work- stealer 1 Main Seq. 1 Work- stealer 2 a 5 a 6 a 7 a i =a 5 *…*a i a 9 a 10 a 11 a i =a 9 *…*a i Preempt 11 8

61 Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a a 12 Work- stealer 1 Main Seq. 1 Work- stealer 2 a 5 a 6 a 7 a i =a 5 *…*a i a 9 a 10 a 11 a i =a 9 *…*a i

62 Parallel Sequential Adaptive Prefix on 3 processors 0 a 1 a 2 a 3 a a 12 Work- stealer 1 Main Seq. 1 Work- stealer 2 a 5 a 6 a 7 a i =a 5 *…*a i a 9 a 10 a 11 a i =a 9 *…*a i Implicit critical path on the sequential process

63 Analysis of the algorithm Execution time Sketch of the proof : –Dynamic coupling of two algorithms that completes simultaneously: –Sequential: (optimal) number of operations S on one processor –Parallel : minimal time but performs X operations on other processors dynamic splitting always possible till finest grain BUT local sequential –Critical path small ( eg : log X) –Each non constant time task can potentially be splitted (variable speeds) –Algorithmic scheme ensures T s = T p + O(log X) => enables to bound the whole number X of operations performed and the overhead of parallelism = (s+X) - #ops_optimal Lower bound

64 Adaptive prefix : experiments1 Single-user context : processor-oblivious prefix achieves near-optimal performance : - close to the lower bound both on 1 proc and on p processors - Less sensitive to system overhead : even better than the theoretically optimal off-line parallel algorithm on p processors : Optimal off-line on p procs Oblivious Prefix sum of double on a SMP 8 procs (IA64 1.5GHz/ linux) Time (s) #processors Pure sequential Single user context

65 Adaptive prefix : experiments 2 Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule, Multi-user context : Additional external charge: (9-p) additional external dummy processes are concurrently executed Processor-oblivious prefix computation is always the fastest 15% benefit over a parallel algorithm for p processors with off-line schedule, External charge (9-p external processes) Off-line parallel algorithm for p processors Oblivious Prefix sum of double on a SMP 8 procs (IA64 1.5GHz/ linux) Time (s) #processors Multi-user context :

66 The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first

67 Conclusion The interplay of an on-line parallel algorithm directed by work-stealing schedule is useful for the design of processor-oblivious algorithms Application to prefix computation : - theoretically reaches the lower bound on heterogeneous processors with changing speeds - practically, achieves near-optimal performances on multi-user SMPs Generic adaptive scheme to implement parallel algorithms with provable performance - work in progress : parallel 3D reconstruction [oct-tree scheme with deadline constraint]

68 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

69 A COMPLETER Bruno / Luciano

70 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

71 A COMPLETER Bruno / Jean-Denis

72 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation [Bruno/Luciano] VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII. Adaptation to support fault-tolerance by work-stealing Conclusion

73 The dataflow graph is a portable distributed execution of the execution : can be used to adapt to resource resilience by computing a coherent global state

74

75

76

77

78 SEL : protocole par journalisation du graphe de flot de données

79

80

81

82

83 Conclusion tolérance défaillances Kaapi tolère ajout et défaillance de machines –Protocole TIC : période à ajuster –Détecteur défaillances signal erreurs +heartbeat –http://kaapi.gforge.inria.fr Athapascan / Kaapi Pile [TIC] oui Pas besoin Locale ou globale oui FullDag[SEL] Dataflow graph

84 Parallel algorithms and scheduling: adaptive parallel programming and applications Contents I. Motivations for adaptation and examples. II. Off-line scheduling and adaptation : moldable/malleable [Denis ?] III On-line work-stealing scheduling and parallel adaptation IV. A first example : iterative product ; application to gzip V. Processor-oblivious parallel prefix computation VI. Adaptation to time-constraints : oct-tree computation VII. Bi-criteria latency/bandwith [Bruno / Jean-Denis] VIII Adaptation for fault tolerance Conclusion

85 Thank you ! Interactive Distributed Simulation [B Raffin &E Boyer] - 5 cameras, - 6 PCs 3D-reconstruction + simulation + rendering ->Adaptive scheme to maximize 3D-reconstruction precision within fixed timestamp

86 Laboratoire Informatique de Grenoble Multi-Programming and Scheduling Design for Applications of Interactive Simulation Comité dévaluation du LIG - 23/01/2006 Louvre, Musée de lHomme Sculpture (Tête) Artist : Anonyme Origin: Rapa Nui [Easter Island] Date : between the XIst and the XVth century Dimensions : 1,70 m high

87 Who are the Moais today ? Permanent staff (7) : INRIA (2) + INPG (3) + UJF (2) Visiting professor (1) + PostDoc (1) ITA : Adminstration (1.25) + Engineer (1) PhD students (14) : 6 contracts + 2 joined (co-tutelle) Vincent Danjean [MdC UJF] Thierry Gautier [CR INRIA] Guillaume Huard [MdC UJF] Grégory Mounié [MdC INPG] Bruno Raffin [CR INRIA] Jean-Louis Roch [MdC INPG] Denis Trystram [Prof INPG] Axel Krings [CNRS/RAGTIME, Univ Idaho - 1/09/2004->31/08/05} Luciano Suares [PostDoc INRIA, 2006] Admin. : Barbara Amouroux [INRIA, 50%] Annie-Claude Vial-Dallais [INPG, 50%] Evelyne Feres[UJF, 50%] Engineer : Joelle Prévost [INPG, 50%], IE [CNRS, 50%] Julien Bernard (BDI ST) Florent Blanchot (Cifre ST) Guillaume Dunoyer (DCN, MOAIS/POPART/E-MOTION) Lionel Eyraud (MESR) Feryal-Kamila Moulai (Bull) Samir Jafar (ATER) Clément Menier (MESR, MOAIS/MOVI) Jonathan Pecero-Sanchez(Egide Mexique) Laurent Pigeon (Cifre IFP) Krzysztof Rzadca (U Warsaw, Poland) Daouda Traore (Egide/Mali) Sébastien Varrette (U Luxembourg) Thomas Arcila (Cifre Bull) Eric Saule (MESR) Jérémie Allard (MESR, 12/2005) Luiz-Angelo Estefanel (Egide/Brasil 11/2005) Hamid-Reza Hamidi (Egide/Iran 10/2005) Jaroslaw Zola (U Czestochowa, Poland 12/2005) +4

88 Objective Programming on virtual networks (clusters and lightweight grids) applications where the multi-criteria performance depends on the resources number with provable performances –Adaptability : Static to the platform : the devices evolute gradually Dynamic to the execution context : data and resources usage Target applications : interactive and distributed simulations –Virtual reality, compute intensive applications (process engineering, optimization, bio-computing) computation acquisition visualization Eg Grimage platform

89 MOAIS approach Algorithm + Scheduling + Programming –Large degree of parallelism w.r.t. the architecture –both local preemptive (system) and non-preemptive (application) scheduling to obtain global provable performances dataflow, distributed, recursive/multithreaded Research thema –Scheduling off-line, on-line, multi-criteria –Adaptive (poly-)algorithms Various level of malleability : parallelism, precision, refresh rate, … –Coupling : efficient description of synchronizations KAAPI software : fine grain dynamic scheduling –Interactivity FlowVR software : coarse grain scheduling

90 Scientific challenges To schedule with provable performances for a multicriteria objective on complex architectures eg - Time / Memory - Makespan / MinSum - Time / Work Approximation, Game theory, … To design distributed adaptive algorithms Efficiency/scalability : local decision but global performance Dynamic local choices - performance with good probability To implement on emerging platforms Challenging applications (partners) Grimage (MOVI) H.264 / MPEG-4 encoding on mpScNet [ST] QAP/Nugent on Grid5000 [PRISM, GILCO, LIFL]

91 Objective Programming of applications where performance is a matter of resources: take benefit of more and suit to less –eg : a global computing platform (P2P) Application code : independent from resources and adaptive Target applications: interactive simulation – virtual observatory MOAIS intearction simulation rendering Performance is related to #resources - simulation : precision=size/order #procs, memory space - rendering : images wall, sounds,... #video-projectors, #HPs,... - interaction : acquisition peripherals #cameras, #haptic sensors, …

92 GRIMAGE platform 2003 : 11 PCs, 8 projectors and 4 cameras First demo : 12/ : 30 PCs, 16 projectors and 20 cameras –A display wall : Surface: 2x2.70 m Resolution: 4106x3172 pixels Very bright: daylight work Commodity components [B. Raffin]

93 Video [J Allard, C Menier]


Download ppt "Laboratoire Informatique de Grenoble Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch,"

Similar presentations


Ads by Google