Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 1 Algorithms for Independent Task Placement and Their Tuning in.

Similar presentations


Presentation on theme: "Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 1 Algorithms for Independent Task Placement and Their Tuning in."— Presentation transcript:

1 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 1 Algorithms for Independent Task Placement and Their Tuning in Demand-Driven Ray Tracing Algorithms for Independent Task Placement and Their Tuning in the Context of Demand-Driven Parallel Ray Tracing

2 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 2 Overview Demand-driven process farm Abstraction from assignment mechanism, examples of trade-offs Formal problem definition Analysis of chunking and factoring strategies Experiments with chunking and factoring, with manual tuning of parameters Tool for the prediction of efficiency for process farms (more precisely, “post-prediction”) If the parameters are tuned then the choice of assignment strategy does not significantly influence efficiency of parallel computation in the context of parallel ray tracing on contemporary machines with an “everyday” input, “everyday” quality settings etc.

3 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 3 Demand-Driven Process Farm Eye (camera) Output image MASTER 12 34 result LOADBALANCER job req job WORKERS How many tasks should the LOADBALANCER process assign in one job? (Badouel et al. [1994] suggest 9 pixels in one job, Freisleben et al. [1998] suggest 4096 pixels in one job… Where does this difference come from?)

4 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 4 Abstraction from assignment mechanism LB WORKER1 WORKER2WORKER3WORKER4 Message passing network (send, receive) WORKER1 WORKER2WORKER3WORKER4 Shared memory (central queue, locking) The assignment mechanism is not important. Assignment of one job costs constant time L, no matter how many tasks are in the job which is assigned to a worker.

5 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 5 Trade-offs for chunking (which assigns fixed-size jobs) Largest jobs (for 2 workers) 12 Smallest jobs (1 job=1 task) Problem: imbalance Problem: many messages How large should the jobs be? (so that the parallel time is minimal!) The job “shape” is irrelevant, only the size is important

6 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 6 Problem definition Given: Nnr. of worker processes, all equally fast Wnr. of tasks, independent on each other (not even spatially coherent) Llatency; i.e. penalty for assigning 1 job to a worker (a constant time which does not depend on the number of tasks in the job or anything else) Unknown: Task time complexities. Goal: Minimise the makespan (the parallel time required for assigning & processing of all tasks). The LOADBALANCER must make a decision as to how many tasks to pack into a job immediately after receiving a job request. (This is not quite online… note that W is constant!) Probabilistic model  average tasks’ time complexity  std. dev. of tasks’ complexities Goal: minimise expected makespan Deterministic model T max maximal tasks’ time complexity T min minimal tasks’ time complexity Goal: minimise maximal makespan (for worst possible task arrangement)

7 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 7 Chunking strategy (fixed-size chunks) LB_CHUNKING(float T max, int W, int N, float L) { int work = W; K=???; wait for a job request; if (K > work) K = work; assign job of size K to the idle WORKER; work = work – K; }

8 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 8 Chunking, analysis N nr. of workers W nr. of tasks L latency T max max. task complexity Unknown: K opt (chunk size) The time diagram below depicts the structure of the worst case (maximal makespan): One of the workers always gets the tasks of time complexity T max. (The last extra-round is the result of integer arithmetic.) KK … KK N L+KT max

9 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 9 Chunking, probabilistic model Chunking (Kruskal and Weiss) for large W and K and K>>log N. for K<< W / N and small sqrt(K) / N. for K<< W / N and large sqrt(K) / N.

10 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 10 Factoring strategy, example N=2 T=3 Parameterisation: N nr. of workers W nr. of tasks T max. ratio of tasks’ complexities (T=T max /T min ) Unknown: the job size K used for the next round t 1 sect 2 sec ≤ T· t sec ≤ T· 3 ≥  x = 1 Example

11 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 11 Factoring, analysis wi denotes the number of yet unassigned tasks after round i. Obviously, the larger the assigned job sizes K i are, the smaller is the assignment overhead. Hence, we want: Left-hand side: Right-hand side, simplified (the assignment latency is only counted once): Note that this simplification ignores the assignment latency. Solving the simplified equation above yields (we denote T=T max /T min )

12 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 12 Factoring (simplified), analysis This work remaining after round i yields

13 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 13 Factoring (simplified), analysis Makespan, upper-bound Makespan, lower-bound

14 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 14 Factoring, probabilistic model Factoring (Flynn-Hummel) w i is the rest of work at the beginning of round i 1 / (N x i ) is the division factor K i is the chunk size for round i In their experiments [1991, 1995], the authors did not attempt to estimate the covariance σ/µ. They used constant division factor 1/(2N) (this means x i =2).

15 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 15 Experiments: data

16 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 16 Experiments: setting (Machine: hpcLine in PC2) (Application: parallel ray tracing with “everyday setting”) Given: N=1…128nr. of worker processes, all equally fast W=720*576nr. of tasks, independent on each other L=0.007latency; i.e. penalty for assigning 1 job to a worker (a constant time which does not depend on the number of tasks in the job or anything else) Estimated from measured data: T max =0.00226average time on one pixel (360 pixels) T min =0.00075 factor between the times on atomic jobs of size 360 pixels (i.e. T=ca. 3)

17 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 17 Experiments: empirical optimal chunk size (90 workers)

18 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 18 Experiments: chunking efficiency (K=360)

19 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 19 Experiments: factoring efficiency (atomic_job_size=360)

20 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 20 Tuning of assignment strategies: estimation of future K=f(W, N, L, T min, T max ) Suggestion: the unknown parameters T min and T max can be initially estimated. This estimation is continually adjusted according to measured run-time statistics. (The estimation of remaining time needed for copying files in Windows uses a similar approach.) The optimal chunk size for chunking and factoring strategies depends on parameters which are unknown. (However, all these parameters are known when the computation finishes; this is what we call “post-prediction”!)

21 Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 21 Conclusions Farming yields almost a linear speedup (efficiency 95% with 128 workers) for parallel ray tracing (POV||Ray) on a fairly complex “everyday” scene. Trivial chunking algorithm with optimal chunk size does not perform worse than a theoretically better factoring algorithm with optimal chunk size; for the particular machine, particular nr. of processors, particular input, particular quality settings, particular room temperature etc. used during experiments. Efficiency of chunking/factoring can be predicted (or at least “post-predicted”) for a particular machine, particular nr. of processors, particular input, particular room temperature etc. In experiments with process farming, parameters W (nr of tasks), N (nr of workers), L (latency), T min and T max (min/max tasks’ or jobs’ time complexities) must be reported. Reporting only some of these parameters is insufficient for drawing conclusions from experiments with process farming (e.g. chunking). The parameters specific to chunking/factoring can (and must) be tuned automatically in run-time.


Download ppt "Bratislava, January 2005 Tomas Plachetka, University of Bratislava / Paderborn / Bristol 1 Algorithms for Independent Task Placement and Their Tuning in."

Similar presentations


Ads by Google