Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces

Similar presentations


Presentation on theme: "Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces"— Presentation transcript:

1 Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces
Arun Kejariwal (Yahoo!Inc. Santa Clara, CA) Alexandru Nicolau, Utpal Banerjee, Alexander V. Veidenbaum (UC Irvine, CA) Constantine D. Polychronopoulos (UI Urbana, IL) Presenter: Olga Golovanevsky

2 Outline Motivation Motivating examples The Techniques Results
Conclusion

3 Motivation Multi-cores becoming ubiquitous
Examples – Intel’s Sandybridge, IBM’s Cell and POWER Sun’s UltraSPARC T* family Number of cores is expected to increase Large-scale hardware parallelism available Software challenges Thread-level application parallelization How to map threads on different cores Load balancing Data affinity

4 Application Parallelization
Loops account for most of application run-time Loop classification DOALL: No loop-carried dependence Amenable to auto-parallelization Execute iterations in parallel on different threads Non-DOALL Thread synchronization support needed for parallelization

5 Parallel Execution of DOALL Loops
Auto-parallelized Directive-driven parallelization Example – OpenMP pragmas Issue with parallel execution Load balancing How to partition the iteration space for best performance? Naïve way: Partition the iteration space uniformly amongst the different threads Doesn’t yield the best performance!

6 Iteration Space Partitioning
Why is it non-trivial? Non-rectangular geometry of iteration space

7 Iteration Space Partitioning (contd.)
Why is it non-trivial? Use of indirect referencing Non-uniform cache miss profile Variation in L1 cache misses 462.libquantum 435.gromacs do k=nj0,nj1 jnr = jjnr(k)+1 j3 = 3*jnr-2 faction(j3) = faction(j3)-tx11 faction(j3+1) = faction(j3+1)-ty11 faction(j3+2) = faction(j3+2)-tz11 end do

8 Iteration Space Partitioning (contd.)
Why is it non-trivial? Non-perfect multi-way loops Outermost loop may have multiple loops at the same nesting level Conditional execution of inner loops do k=2, nk -1 do j=1, ny  first loop do i=1, nx read A[k,j,i] end do do j=2, ny-1  second loop do i=2, nx-1 write A[k,j,i] T1 k=1 T2 k=4

9 Iteration Space Partitioning (contd.)
Why is it non-trivial? Presence of conditionals in the loop body Non-uniform workload distribution Variation in Inst Retired 403.gcc 434.zeusmp do 90 k=ks,ke do 80 j=js,je do 70 i=is,ie if ((rin .lt. rsq) .and. (rout .le. rsq)) then endif if ((rin .lt. rsq) .and. (rout .gt. rsq)) then 70 continue 80 continue 90 continue

10 How to partition? Guiding factors Partition the outermost loop
Minimizes scheduling overhead Geometry-aware Model the iteration space as a convex polytope Loop indices are affine functions of outer loop indices Cache-Aware Account for non-uniform cache miss profile across the iteration space Account for non-uniform workload distribution across the iteration space

11 Algorithm High-level steps Obtain the cache miss profile
Obtain the workload distribution Compute the total volume of iteration space Weighted by cache misses and instructions retired Given n threads Compute n-1 breakpoints along the axis corresponding to the outermost loop wherein Each breakpoint delimits a set Each set has equal weighted volume Map each set on to a different thread

12 Experimental Setup Use in-built hardware performance counters
MEM_LOAD_RETIRED.L1D_MISS Obtain cache miss profile INST_RETIRED.ANY Obtain instructions retired profile

13 Kernel Set

14 Results (contd.) Compute two metrics: Speedup = (tco – tca) * 100 tca
tco = cache-oblivious tca = cache-aware Deviation Difference between proposed technique and worst case

15 Thank You!

16 Results (contd.) Performance variation with different partitioning planes 3 threads

17 Results Performance variation with different partitioning planes
A kernel from 178.galgel Nested non-perfect multiway DOALL loop 2 threads


Download ppt "Cache-Aware Partitioning of Multi-Dimensional Iteration Spaces"

Similar presentations


Ads by Google