Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto.

Similar presentations


Presentation on theme: "A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto."— Presentation transcript:

1 A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto Lab 12-Jun-15

2 2 Agenda Background Motivation Previous Work Adaptive Schedulers IBM Power 5 Architecture A Multi-Level Hierarchical Scheduler Evaluation Future Work

3 3 Simultaneous Multi-Threading Architecture Several threads per physical processor Threads share  Caches  Registers  Functional Units

4 4 Power 5 SMT

5 5 OpenMP  A standard API for shared memory programming  Add directives for parallel regions Standard Loop Schedulers  Static  Dynamic  Guided  Runtime

6 6 OpenMP API #pragma omp parallel for shared(a, b) private(i, j) schedule(runtime) for ( i = 0; i < 100; i ++ ) { for ( j = 0; j < 100; j ++) { a[i, j] = a[i, j] + b[i, j]; } An example of a parallel loop in C code. (Similar in Fortran) …….. …. j i T0 Tn

7 7 Motivation OpenMP Applications  Designed for SMP systems  Not aware of HT technology  Understanding and controlling performance of OpenMP applications on SMT processors is not trivial Important performance issues on SMP system with SMT nodes  Inter-thread data locality  Instruction Mix  SMT-related Load Balance

8 8 Scaling (Spec & NAS) 1 Thread per Processor1-2 Threads per Processor 4 Intel Xeon Processors with Hyperthreading

9 9 Why do they scale poorly? Inter-thread data locality  cache misses Instruction Mix  functional units sharing  benefit gained this way may outweigh cache misses SMT-related Load Balance We should balance work loads well among:  processors  threads running on the same physical processor.

10 10 Previous Work: Runtime Adaptive Scheduler Hierarchical Scheduling  Upper level scheduler  Lower level scheduler Select scheduler and the number of threads to run at runtime  One thread per physical processor  Two threads per physical processor

11 11 Two-Level Hierarchical Scheduler

12 12 Traditional Scheduling …….. …. Static Scheduling …….. …. TnT0 TnTiTk Dynamic Scheduling jj ii

13 13 Hierarchical Scheduling Dynamic Scheduling T 01 T 00 T i0 T i1 …….. …. Static Scheduling i j …….. …. P0Pi …….. ….

14 14 Why can we benefit from runtime scheduler selection? Many parallel loops in OpenMP applications are executed again and again. Example # of calls vs. Execution time < 10 times 10 – 40 times > 40 times ammp0% 84.20% apsi0% 82.55% art100%0% equake0.05%0%98.23% mgrid0%0.11%95.95% swim0.09%0%99.25% wupwise0.12%0%99.49% BT0% 100% CG0.92%3.5%92.57% EP100%0% MG12.73%12.87%71.91% SP1.02%0%92.71% for (k = 1; k<100; k++) { …………. calculate(); …………. } void calculate () { #pragma omp parallel for schedule(runtime) for (i = 1; i<100; i++) { ……………; // calculation } }

15 15 Adaptive Schedulers Region Based Scheduler  Select loop schedulers at runtime  Parallel loops in one parallel region have to use the same scheduler which may not be the best Loop Based Scheduler  Higher runtime overhead  More accurate loop scheduler for each parallel loop

16 16 Sample from NAS2004 !$omp parallel default(shared) private(i,j,k) !$omp do schedule(runtime) do j=1,lastrow-firstrow+1 do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1 enddo !$omp end do nowait !$omp do schedule(runtime) do i = 1, na+1 x(i) = 1.0D0 enddo !$omp end do nowait !$omp do schedule(runtime) do j=1, lastcol-firstcol+1 q(j) = 0.0d0 z(j) = 0.0d0 r(j) = 0.0d0 p(j) = 0.0d0 enddo !$omp end do nowait !$omp end parallel loop based scheduler picks a scheduler region based scheduler picks one scheduler that applies to all three loops loop based scheduler picks a scheduler

17 17 Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 P1 P0 T1T0 P3 P2 T3T2 Static Scheduler

18 18 Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 P1 P0 T1T0 P3 P2 T3T2 Dynamic Scheduler

19 19 Runtime Loop Scheduler Selection Phase 1: try upper level scheduler, run with 4 threads………… M1 P1 P0 T1T0 P3 P2 T3T2 Affinity Scheduler

20 20 Runtime Loop Scheduler Selection Phase 1: Made a decision on upper level scheduler, try lower level scheduler, run with 8 threads………… T0 M1 P1 P0 T3T2T1 P1 P0 T7T6T5T4 Affinity Scheduler Static

21 21 Sample from NAS2004 !$omp parallel default(shared) private(i,j,k) !$omp do schedule(runtime) do j=1,lastrow-firstrow+1 do k=rowstr(j),rowstr(j+1)-1 colidx(k) = colidx(k) - firstcol + 1 enddo !$omp end do nowait !$omp do schedule(runtime) do i = 1, na+1 x(i) = 1.0D0 enddo !$omp end do nowait !$omp do schedule(runtime) do j=1, lastcol-firstcol+1 q(j) = 0.0d0 z(j) = 0.0d0 r(j) = 0.0d0 p(j) = 0.0d0 enddo !$omp end do nowait !$omp end parallel Static-Static, 8 threads TSS, 4 threads

22 22 Hardware Counter Scheduler Motivation  The RBS and LBS has runtime overhead. They will work even better if we can reduce the overhead as much as possible Algorithm  Try different schedulers on parallel loops on a subset of the benchmarks using training data  Use the characteristic: cache miss, number of floating point operations, number of micro-ops, load imbalance and the best scheduler for that loop as input  Feed the above data to classification software (we use C4.5) to build a decision tree  Apply this decision tree to a loop at runtime. Feed the runtime collected hardware counter data as input, and get the result – scheduler – as output.

23 23 4 Intel Xeon Processors with Hyperthreading

24 24 4 Intel Xeon Processors with Hyperthreading

25 25 IBM Power 5 Technology: 130nm Dual processor core 8-way superscalar Simultaneous Multi- Threaded (SMT) core  Up to 2 virtual processors  24% area growth per core for SMT  Natural extension to Power 4 design

26 26 Single Thread Single Thread has advantage when executing unit limited applications  Floating or fixed point intensive workloads Extra resources necessary for SMT provide higher performance benefit when dedicated to a single thread Data locality on one SMT core is better with single thread for some applications

27 27 Power 5 Multi-Chip Module (MCM) Or Multi-Chipped Monster 4 processor chips  2 processors per chip 4 L3 cache chips

28 28 Power5 64-way Plane Topology Each MCM has 4 inter- connected processor chips Each processor chip has two processors on chip Each processor has SMT technology therefore two threads can be executed on it simultaneously

29 29 Multi-Level Scheduler Loop Iterations Iterations for Module 1 1 st Level Scheduler Iterations for Module i Iterations for Module n 2 nd Level Scheduler Iterations for Processor m Iterations for Processor 1 Iterations for Processor m Iterations for Processor 1 3 rd Level Scheduler Iterations for Thread k Iterations for Thread 1 Iterations for Thread k Iterations for Thread 1 ………………. ……. ……………….

30 30 OpenMP Implementation Outline Technique  New subroutines created with body of each parallel construct  Runtime routines receives as a parameter the address of the outlined procedure

31 31 long main { _xlsmpParallelDoSetup_TPO(…) } 1. Initialize Work Items and work shares 2. Call _xlsmp_DynamicChunkCall(…) void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return; } while (still iterations left, go to get some iterations for this thread) { ………… call main@OL@1(...); …………. } Outlined Functions Source Code: #pragma omp parallel for shared(a,b) private(i) for ( i = 0; i < 100; i ++ ) { a = a + b; } Runtime Library

32 32 long main { _xlsmpParallelDoSetup_TPO(…) } 1. Initialize Work Items and work shares 2. Call _xlsmp_DynamicChunkCall(…) void main@OL@1 ( … ) { do { loop body; } while (loop end condition meets); return; } while (hier_sched(…))) { ………… call main@OL@1(...); …………. } Outlined Functions Source Code: #pragma omp parallel for shared(a,b) private(i) for ( i = 0; i < 100; i ++ ) { a = a + b; } Runtime Library

33 33 1.Lookup its parents iteration list to see if there is any iteration available; if yes, get some iterations from the 2 nd level scheduler and return 2.Look one level up, grab the lock for its group, and seek more iterations from the upper level using the upper level loop scheduler (a recursive function call) till it gets some iteration or the whole loop ends M0 P1 P0 T3T2T1T0 M1 P1 P0 T7T6T5T4 Root Guided Static Cyclic

34 34 Hierarchical Scheduler Guided as the 1 st level scheduler  Balance work loads among processors  Reduce runtime overhead Static Cyclic as the 2 nd level scheduler  Improve cache locality  Reduce runtime overhead …. T0 T1 Iteration space dividing using standard static scheduling …. T0T1 Iteration space dividing using static cyclic scheduling T1 T0

35 35 Evaluation IBM Power 5 System  4 Power 5 1904 MHz SMT processors  31872 M memory Operating System  AIX 5.3 Compiler:  IBM XL C/C++, XL Fortran compiler Benchmark  SpecOMP2001

36 36 Scalability of IBM Power 5 SMT Processors 1 through 8 threads

37 37 Evaluation on Power 5 Execution Time Normalized to Default (Static) Scheduler

38 38 Conclusion Standard schedulers are not aware of SMT technology Adaptive hierarchical schedulers take SMT specific characteristics into account, which could make OpenMP API (software) and SMT technology (hardware) work better together. OpenMP parallel applications running on Power 5 architecture with SMT has the same problem Multi-level hierarchical scheduler designed for IBM Power 5 achieves an average improvement over the default loop scheduler of 3% on SPEC OMP2001  Large improvements of 7% and 11% on some benchmarks  Improves on average over all other standard OpenMP loop schedulers by at least 2%

39 39 Future Work Evaluate multi-level hierarchical scheduler on a larger system with 32 SMT processors (with MCM) Explore performance on auto-parallelized benchmarks (SPEC CPU FP) Examine mechanisms for determining best scheduler configuration at compile-time Explore the use of helper threads on Power 5  Cache prefetching

40 Thank You~

41 41 (A cache miss comparison chart will be shown here) If find a way to calculate the overall L2 load/store miss generally. If not, will show the overhead of this optimization from the tprof data.

42 42 Schedulers’ Speedup on 4 threads

43 43 Scheduler’s Speedup on 8 Threads

44 44 Decision Tree Only one decision tree is built offline, before executing the program Apply that decision tree to loops at runtime without changing the tree Make a decision on which scheduler we should use with only one run of each loop, which greatly reduces runtime scheduling overhead uops <= 3.62885e+08 : | cachemiss <= 111979 : | | uops > 748339 : static-4 | | uops <= 748339 : | | | l/s <= 167693 : static-4 ( | | | l/s > 167693 : static-static | cachemiss > 111979 : | | floatpoint <= 1.52397e+07 : | | | cachemiss <= 384690 : | | | | uops <= 2.06431e+07 : static-static | | | | uops > 2.06431e+07 : | | | | | imbalance <= 1330 : afs-static | | | | | imbalance > 1330 : | | | | | | cachemiss <= 301582 : afs-4 | | | | | | cachemiss > 301582 : guided-static ……………………………. uops > 3.62885e+08 : | l/s > 7.22489e+08 : static-4 | l/s <= 7.22489e+08 : | | imbalance <= 32236 : static-4 | | imbalance > 32236 : | | | floatpoint <= 5.34465e+07 : static-4 | | | floatpoint > 5.34465e+07 : | | | | floatpoint <= 1.20539e+08 : tss-4 | | | | floatpoint > 1.20539e+08 : | | | | | floatpoint <= 1.45588e+08 : static-4 | | | | | floatpoint > 1.45588e+08 : tss-4 END hardware-counter scheduling END hardware-counter scheduling

45 45 (Load imbalance comparison chart will be shown here) Generating……..


Download ppt "A Multi-Level Adaptive Loop Scheduler for Power 5 Architecture Yun Zhang, Michael J. Voss University of Toronto Guansong Zhang, Raul Silvera IBM Toronto."

Similar presentations


Ads by Google