Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS.

Slides:



Advertisements
Similar presentations
Laboratoire Informatique de Grenoble Parallel algorithms and scheduling: adaptive parallel programming and applications Bruno Raffin, Jean-Louis Roch,
Advertisements

Introduction to Computer Science 2 Lecture 7: Extended binary trees
1 Processor-oblivious parallel algorithms with provable performances - Applications Jean-Louis Roch Lab. Informatique Grenoble, INRIA, France.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
PRAM (Parallel Random Access Machine)
SIAM Parallel Processing’ Feb 22 Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis.
Processor-oblivious parallel algorithms and scheduling Illustration on parallel prefix Jean-Louis Roch, Daouda Traore INRIA-CNRS Moais team - LIG Grenoble,
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
1 JuliusC A practical Approach to Analyze Divide-&-Conquer Algorithms Speaker: Paolo D'Alberto Authors: D'Alberto & Nicolau Information & Computer Science.
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
Analysis of greedy active learning Sanjoy Dasgupta UC San Diego.
Implicit Hitting Set Problems Richard M. Karp Harvard University August 29, 2011.
On-line adaptive parallel prefix computation Jean-Louis Roch, Daouda Traoré and Julien Bernard Presented by Andreas Söderström, ITN.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
1 Parallel Algorithms Design and Implementation Jean-Louis.Roch at imag.fr MOAIS / Lab. Informatique Grenoble, INRIA, France.
Fundamental Techniques
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
The Group Runtime Optimization for High-Performance Computing An Install-Time System for Automatic Generation of Optimized Parallel Sorting Algorithms.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Microprocessor-based systems Curse 7 Memory hierarchies.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization A. Epshteyn 1, M. Garzaran 1, G. DeJong 1, D. Padua 1, G. Ren 1, X. Li 1,
CS453 Lecture 3.  A sequential algorithm is evaluated by its runtime (in general, asymptotic runtime as a function of input size).  The asymptotic runtime.
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
Algorithms  Al-Khwarizmi, arab mathematician, 8 th century  Wrote a book: al-kitab… from which the word Algebra comes  Oldest algorithm: Euclidian algorithm.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
On-line adaptative parallel prefix computation Jean-Louis Roch, Daouda Traore, Julien Bernard INRIA-CNRS Moais team - LIG Grenoble, France Contents I.
Antonio Javier Cuenca Muñoz Dpto. Ingeniería y Tecnología de Computadores Processes Distribution of Homogeneous Parallel Linear Algebra Routines on Heterogeneous.
An Evaluation of Partitioners for Parallel SAMR Applications Sumir Chandra & Manish Parashar ECE Dept., Rutgers University Submitted to: Euro-Par 2001.
Data Structures and Algorithms in Parallel Computing Lecture 7.
Static Process Scheduling
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
Monte Carlo Linear Algebra Techniques and Their Parallelization Ashok Srinivasan Computer Science Florida State University
ICDCS 05Adaptive Counting Networks Srikanta Tirthapura Elec. And Computer Engg. Iowa State University.
1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
Computer Architecture: Parallel Task Assignment
Distributed Processors
Conception of parallel algorithms
Introduction to parallel algorithms
Department of Computer Science University of California, Santa Barbara
Objective of This Course
Introduction to parallel algorithms
Parallelismo.
Multiprocessor and Real-Time Scheduling
COMP60621 Fundamentals of Parallel and Distributed Systems
Low Depth Cache-Oblivious Algorithms
Atlas: An Infrastructure for Global Computing
Chapter 01: Introduction
Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat
Maximizing Speedup through Self-Tuning of Processor Allocation
Department of Computer Science University of California, Santa Barbara
COMP60611 Fundamentals of Parallel and Distributed Systems
Introduction to parallel algorithms
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Presentation transcript:

Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15Hybrids in exact linear algebra Dave Saunders U. Delaware, USA 10h45Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, Gudula Runger, U. Bayreuth, Germany 11h15 Cache-Oblivious algorithms Michael Bender, Stony Brook U., USA Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting Objective: towards an analysis based on the algorithm performance

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier, Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram IMAG-INRIA Workgroup on “Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Why adaptive algorithms and how? Input data vary Resources availability is versatile Adaptation to improve performances Scheduling partitioning load-balancing work-stealing Measures on resources Measures on data Calibration tuning parameters block size/ cache choice of instructions, … priority managing Choices in the algorithm sequential / parallel(s) approximated / exact in memory / out of core … An algorithm is « hybrid » iff there is a choice at a high level between at least two algorithms, each of them could solve the same problem

Modeling an hybrid algorithm Several algorithms to solve a same problem f : –Eg : algo_f 1, algo_f 2 (block size), … algo_f k : –each algo_f k being recursive Adaptation to choose algo_f j for each call to f algo_f i ( n, … ) { …. f ( n - 1, … ) ; …. f ( n / 2, … ) ; … }. E.g. “practical” hybrids: Atlas, Goto, FFPack FFTW cache-oblivious B-tree any parallel program with scheduling support: Cilk, Athapascan/Kaapi, Nesl,TLib…

How to manage overhead due to choices ? Classification 1/2 : –Simple hybrid iff O(1) choices [eg block size in Atlas, …] –Baroque hybrid iff an unbounded number of choices [eg recursive splitting factors in FFTW] choices are either dynamic or pre-computed based on input properties.

Choices may or may not be based on architecture parameters. Classification 2/2. : an hybrid is –Oblivious: control flow does not depend neither on static properties of the resources nor on the input [eg cache-oblivious algorithm [ Bender ] –Tuned : strategic choices are based on static parameters [eg block size w.r.t cache, granularity, ] Engineered tuned orself tuned [eg ATLAS and GOTO libraries, FFTW, …] [eg [LinBox/FFLAS] [ Saunders&al] –Adaptive : self-configuration of the algorithm, dynamlc Based on input properties or resource circumstances discovered at run-time [eg : idle processors, data properties, …] [eg TLib Rauber&Rünger]

Examples BLAS libraries –Atlas: simple tuned (self-tuned) –Goto : simple engineered (engineered tuned) –LinBox / FFLAS : simple self-tuned,adaptive [Saunders&al] FFTW –Halving factor : baroque tuned –Stopping criterion : simple tuned Parallel algorithm and scheduling : –Choice of parallel degree : eg Tlib [Rauber&Rünger] –Work-stealing schedile : baroque hybrid

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Work-stealing (1/2) «Depth » W  = #ops on a critical path (parallel time on   resources) Workstealing = “ greedy ” schedule but distributed and randomized Each processor manages locally the tasks it creates When idle, a processor steals the oldest ready task on a remote -non idle- victim processor (randomly chosen) « Work » W 1 = #total operations performed

Work-stealing (2/2) «Depth » W  = #ops on a critical path (parallel time on   resources) « Work » W 1 = #total operations performed Interests : -> suited to heterogeneous architectures with slight modification [Bender-Rabin02] -> with good probability, near-optimal schedule on p processors with average speeds  ave T p < W 1 /(p  ave ) + O ( W  /  ave ) NB : #succeeded steals = #task migrations < p W  [Blumofe 98, Narlikar 01, Bender 02] Implementation: work-first principle [Cilk, Kaapi] Local parallelism is implemented by sequential function call Restrictions to ensure validity of the default sequential schedule - serie-parallel/Cilk - reference order/Kaapi

Work-stealing and adaptability Work-stealing ensures allocation of processors to tasks transparently to the application with provable performances Support to addition of new resources Support to resilience of resources and fault-tolerance (crash faults, network, … ) Checkpoint/restart mechanisms with provable performances [Porch, Kaapi, … ] “ Baroque hybrid ” adaptation: there is an -implicit- dynamic choice between two algorithms a sequential (local) algorithm : depth-first (default choice) A parallel algorithm : breadth-first Choice is performed at runtime, depending on resource idleness Well suited to applications where a fine grain parallel algorithm is also a good sequential algorithm [Cilk]: Parallel Divide&Conquer computations Tree searching, Branch&X … -> suited when both sequential and parallel algorithms perform (almost) the same number of operations

Solution: to mix both a sequential and a parallel algorithm Basic technique : Parallel algorithm until a certain « grain »; then use the sequential one Problem : W  increases also, the number of migration … and the inefficiency ;o( Work-preserving speed-up [Bini-Pan 94] = cascading [Jaja92] Careful interplay of both algorithms to build one with both W  small and W 1 = O( W seq ) Divide the sequential algorithm into block Each block is computed with the (non-optimal) parallel algorithm Drawback : sequential at coarse grain and parallel at fine grain ;o( Adaptive granularity : dual approach : Parallelism is extracted at run-time from any sequential task But often parallelism has a cost !

Self-adaptive grain algorithm Based on the Work-first principle : Executes always a sequential algorithm to reduce parallelism overhead => use parallel algorithm only if a processor becomes idle by extracting parallelism from a sequential computation Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation : at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm –Examples : - iterated product [Vernizzi 05]- gzip / compression [Kerfali 04] - MPEG-4 / H264 [Bernard 06]- prefix computation [Traore 06] SeqCompute Extract_par LastPartComputation SeqCompute

Adaptive algorithms Theory and applications Van Dat Cung, Jean-Guillaume Dumas, Thierry Gautier,Guillaume Huard, Bruno Raffin, Jean-Louis Roch, Denis Trystram INRIA-CNRS Project on“Adaptive and Hybrid Algorithms” Grenoble, France Contents I. Some criteria to analyze for adaptive algorithms II. Work-stealing and adaptive parallel algorithms III.Adaptive parallel prefix computation

Sequential algorithm : for (i= 0 ; i <= n; i++ )  [ i ] =  [ i – 1 ] * a [ i ] ; Parallel algorithm [Ladner-Fischer] : Prefix computation : an example where parallelism always costs  1 = a 0 * a 1  2 =a 0 * a 1 * a 2 …  n =a 0 * a 1 * … * a n W  =2. log n but W 1 = 2.n Twice more expensive than the sequential … a 0 a 1 a 2 a 3 a 4 … a n-1 a n **** Prefix of size n/2  1  3 …  n  2  4 …  n-1 *** W 1 = W  = n

Adaptive prefix computation – Any (parallel) prefix performs at least W 1  2.n - W  ops – Strict-lower bound on p identical processors: T p  2n/(p+1) block algorithm + pipeline [Nicolau&al. 2000] Application of adaptive scheme : – One process performs the main “sequential” computation – Other work-stealer processes computes parallel « segmented » prefix –Near-optimal performance on processors with changing speeds : T p < 2n/((p+1).  ave ) + O ( log n /  ave ) lower bound

Scheme of the proof Dynamic coupling of two algorithms that completes simultaneously: –Sequential: (optimal) number of operations S –Parallel : : performs X operations dynamic splitting always possible till finest grain BUT local sequential Scheduled by workstealing on p-1 processors –Critical path small (log X) –Each non constant time task can be splitted (variable speeds) Analysis : Algorithmic scheme ensures Ts = Tp + O(log X) => enables to bound the whole number X of operations performed and the overhead of parallelism = (s+X) - #ops_optimal Comparison to the lower bound on the number of operations.

  0 a 1 a 2 a 3 a 4 a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 Work- stealer 1 Main Seq.  Work- stealer 2 Adaptive Prefix on 3 processors 11 Steal request

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 a 9 a 10 a 11 a 12 77 33 Steal request 22 66  i =a 5 *…*a i

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 44 Preempt  10  i =a 9 *…*a i 88 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11 88 Preempt  11 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88

Adaptive Prefix on 3 processors   0 a 1 a 2 a 3 a 4  8  11 a 12 Work- stealer 1 Main Seq. 11  Work- stealer 2  a 5 a 6 a 7 a 8 77 33 44 22 66  i =a 5 *…*a i a 9 a 10 a 11 a 12 88 55  10  i =a 9 *…*a i 99 66  11  12  10 77  11 88 Implicit critical path on the sequential process

Adaptive prefix : some experiments Single user context Adaptive is equivalent to : - sequential on 1 proc - optimal parallel-2 proc. on 2 processors - … - optimal parallel-8 proc. on 8 processors Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm Multi-user context Adaptive is the fastest 15% benefit over a static grain algorithm External charge Parallel Adaptive Parallel Adaptive Prefix of elements on a SMP 8 procs (IA64 / linux) #processors Time (s) #processors Joint work with Daouda Traore

The Prefix race: sequential/parallel fixed/ adaptive Adaptative 8 proc. Parallel 8 proc. Parallel 7 proc. Parallel 6 proc. Parallel 5 proc. Parallel 4 proc. Parallel 3 proc. Parallel 2 proc. Sequential On each of the 10 executions, adaptive completes first

With * = double sum ( r[i]=r[i-1] + x[i] ) Single userProcessors with variable speeds Remark for n= doubles : - “pure” sequential : 0,20 s - minimal ”grain” = 100 doubles : 0.26s on 1 proc and on 2 procs (close to lower bound) Finest “grain” limited to 1 page = octets = 2048 double

E.g.Triangular system solving 0.x = b Sequential algorithm : T 1 = n 2 /2; T  = n (fine grain) 0.x = b A 1/ x 1 = - b 1 / a 11 2/ For k=2..n b k = b k - a k1.x 1 0.x = b system of dimension n-1 system of dimension n

E.g.Triangular system solving 0.x = b Sequential algorithm : T 1 = n 2 /2; T  = n (fine grain) Using parallel matrix inversion : T 1 = n 3 ; T  = log 2 n (fine grain) 0 A 21 A 22 A 11 = 0 S A 22 A 11 S= -A 22. A 21. A 11 with A = and x=A -1.b Self-adaptive granularity algorithm : T 1 = n 2 ; T  =  n.log n 0.x = b ExtractPar and self-adaptive scalar product self adaptive sequential algorithm self-adaptivematrix inversion choice of h =  m h m

Conclusion Adaptive : what choices and how to choose ? Illustration : Adaptive parallel prefix based on work-stealing - self-tuned baroque hybrid : O(p log n ) choices - achieves near-optimal performance processor oblivious Generic adaptive scheme to implement parallel algorithms with provable performance

Mini Symposium Adaptive Algorithms for Scientific computing 9h45 Adaptive algorithms - Theory and applications Jean-Louis Roch &al. AHA Team INRIA-CNRS Grenoble, France 10h15Hybrids in exact linear algebra Dave Saunders, U. Delaware, USA 10h45 Adaptive programming with hierarchical multiprocessor tasks Thomas Rauber, U. Bayreuth, Germany 11h15 Cache-Obloivious algorithms Michael Bender, Stony Brook U., USA Adaptive, hybrids, oblivious : what do those terms mean ? Taxonomy of autonomic computing [Ganek & Corbi 2003] : – Self-configuring / self-healing / self-optimising / self-protecting Objective: towards an analysis based on the algorithm performance

Questions ?

Some examples (1/2) Adaptive algorithms used empirically an theoretically : –Atlas [2001] dense linear algebra library Instruction set and instruction schedule Self-camobration pg yjr blpvk idr §<“§!uuuuuuuuuu de la taille des blocs à l’installation sur la machine –FFTW (1998, … ) ; FFT (n) <= p FFT(q) and q FFT(n) For any n, for any recursive call FFT(n) : pre-compite the nest value for p Pré-calcul de la découpe optimale pour la taille n du vecteur sur la machine –Cache-oblivious B-trees : Block recursive splitting to minimize #page faults Self adaptation to memory hierarchy –Workstealing (Cilk (1998, …) (2000, …) : recursive parallelism Choice between sequential depth-first schedule and breadth-first schedule « Work-first principle » : to optimize local sequentilal execution and put overhead on rare steals from idle processors. Implicitly adaptive

–Moldable tasks : Ordonnancement bi-critère avec garantie [Trystram&al 2004] Combinaison récursive alternatiive d’approximation pour chaque critère Auto-adaptation avec performance garantie pour chaque critère –Algorithmes « Cache-Oblivious » [Bender&al 2004] Découpe récursive par bloc qui minimise les défauts de page Auto-adaptation à la hiérarchie mémoire (B-tree) –Algorithmes « Processor-Oblivious » [Roch&al 2005] Combinaison récursive de 2 algorithmes séquentiel et parallèle Auto-adaptation à l’inactivité des ressources Some examples (2/2)

Best case : parallel algorithm is efficient W  is small and W 1 = W seq The parallel algorithm is an optimal sequential one Exemples: parallel D&C algorithms Implementation: work-first principle - no overhead when local execution of tasks Examples : Cilk : THE protocol Kaapi : Compare&swap only

Experimentation: knary benchmark SMP Architecture Origin 3800 (32 procs) Cilk / Athapascan Distributed Archi. iCluster Athapascan #procsSpeed- Up 87, ,6 3230,9 6459, ,1 T s = 2397 s  T 1 = 2435

In « practice »: coarse granularity Splitting into p = #resources Drawback : heterogeneous architecture, dynamic:  i (t) : speed of processor i at time t In « theory »: fine granularity Maximal parallelism Drawback : overhead of tasks management How to choose/adapt granularity ? a b H(a) O(b,7) F(2,a) G(a,b)H(b) High potential degree of parallelism

How to obtain an efficient fine-grain algorithm ? Hypothesis for efficiency of work-stealing : the parallel algorithm is « work-optimal » T  is very small (recursive parallelism) Problem : Fine grain (T  small) parallel algorithms may involve a large overhead with respect to a sequential efficient algorithm: Overhead due to parallelism creation and synchronization But also arithmetic overhead

Self-grain Adaptive algorithms Recursive computations –Local sequential computation Special case: –recursive extraction of parallelism when a resource becomes idle –But local execution of a sequential algorithm Hypothesis : two algorithms : - 1 sequential : SeqCompute - 1 parallel : LastPartComputation => at any time, it is possible to extract parallelism from the remaining computations of the sequential algorithm Example : –- iterated product [Vernizzi]- gzip / compression [Kerfali] –- MPEG-4 / H264 [Bernard ….]- prefix computation [Traore]

Adaptive Prefix versus optimal on identical processors

Illustration: adaptive parallel prefix Adaptive parallel computing on non- uniform and shared resources Example of adaptive prefix computation

Sequential algorithm : for (i= 0 ; i <= n; i++ ) P[ i ] = P[ i – 1 ] * a [ i ] ; W 1 = n Parallel algorithm [Ladner-Fischer] : Indeed parallelism often costs... eg : Prefix computation P 1 = a 0 * a 1, P 2 =a 0 * a 1 * a 2, …, P n =a 0 * a 1 * … * a n a0a0 a1a1 a2a2 anan a3a3 *** a n-1 Prefix ( n / 2 ) P1P1 P3P3 PnPn * P4P4 * P2P2 * P n-1 W  =2. log n but W 1 = 2.n Twice more expensive than the sequential