Download presentation

Presentation is loading. Please wait.

Published byHoward Jarman Modified over 2 years ago

1
Autotuning at Illinois María Jesús Garzarán University of Illinois

2
Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

3
Why autotuning? In the era of parallelism… Applications and software must maintain high efficiency as machines evolve. – Otherwise, no reason for new machines. Problem: High-efficiency requires laborious tuning. – Cost increase. – Low performance if not enough resources Would like to automate tuning.

4
Compilers One way is compilers, but compilers have limitations. – Lack semantic information → fewer choices – Must target all applications – Must be reasonably fast

5
Compiler vs. Manual Tuning Discrete Fourier Transform

6
Compiler vs. Manual Tuning Matrix Matrix Multiplication 20x MFLOPS Matrix Size Intel MKL icc -O3 -xT icc -O3

7
Compiler vs. Manual Tuning Matrix Matrix Multiplication loop 1 c[i*N+j] += a[i*N+k]*b[k*N+j] loop 2 c[i][j] += a[i][k]*b[k][j] loop 3 C += a[i][k]*b[k][j]

8
Compilers … Can and should improve But we will need other strategies (at least in the short term)

9
Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

10
What is Autotuning An emerging strategy: empirical search – Goal: Automatically generate highly efficient code for each target machine (and input set). – Programmers develop metaprograms (a program that generates programs) that search the space of possible algorithms/implementations

11
Generator of the versions High-level code Source-to-source optimizer Native compiler Metaprogram:Decription of the space of versions Object code Execution performance Selected code High-level code Input data (training) Autotuning with empirical search

12
Autotuning More laborious than conventional programming, but – Longer lifetime → cost reduction – Can accumulate experience → better results – Can afford to search more extensively → better results

13
Examples of Existing Autotuning Systems ATLAS: Whaley, Petite, Dongarra (Tennessee) BeBop: Demmel, Yelick, Im, Vuduc (Berkeley) Datamining: Jian, Garzar á n, Snir (Illinois) FFTW: Frigo (MIT) Illinois Sorting: Li, Garzar á n, Padua (Illinois) Matrix-matrix multiplication for GPU: Jiang, Snir (Illinois) Phipac: Bilmes, Asanovic, Vuduc, Iyer, Demmel, Chin, Lan (Berkeley) Space Pruning for GPU: Ryoo, Rodrigues,Stone, Baghsorkhi, Ueng, Stratton, Hwu (Illinois) SPIRAL: Moura, Pueschel (CMU), Johnson (Drexel), Garzar á n, Padua (Illinois) SPIKETune: Wong, Kuck (Intel), Sameh(Purdue), Padua (Illinois)

14
Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

15
Generator of the versions High-level code Source-to-source optimizer Native compiler Metaprogram: Decription of the version space Object code Execution Selected code High-level code Input data (training) Autotuning with empirical search What to do when performance depends on the input How to specify the search space? performance What is performance (execution time, power)? How to drive the search?

16
Research Issues 1.What to do when performance depends on input 2.Modeling/Search 3.Description of the space 4.What to tune 5.What to tune for Very promising, but much to learn

17
Issue 1: Performance depends on input When performance depends on the input we must generate dynamically adapting routines. – Illustrated with the generation of sorting routines [CGO04] Li, Garzarán, Padua. A Dynamically Tuned Sorting Library. In Proc. of the Int. Symp. on Code Generation and Optimization,2004. [CGO05] Li, Garzarán, Padua. Optimizing Sorting with Genetic Algorithms. In Proc. of the Int. Symp. on Code Generation and Optimization 2005.

18
Issue 1: Sorting Different algorithms to perform sorting – Radix sort – Quick sort – Merge sort No single algorithm is the best for all inputs and platforms

19
Our Contribution Design of hybrid algorithms and use of genetic search to find sorting routines that automatically adapt to the target machine and the input characteristics. Result: – Generation of the fastest sorting routines for sequential and parallel execution

20
20 Sorting Performance (keys per cycle) Intel Xeon AMD Athlon MP CC-Radix Merge Sort Quicksort CC- Radix Merge Sort Quicksort Same input different performance Standard Deviation

21
21 Sorting Performance (keys per cycle) Intel Xeon AMD Athlon MP CC-Radix Merge Sort Quicksort CC- Radix Merge Sort Quicksort Standard Deviation

22
22 Divide with pivot Select with entropy Divide into block Sorting Genome < theta≥ theta Divide by digit Hybrid sorting for dynamic adaptation

23
23 Input Divide with pivot Select with entropy Divide by digit Divide into block < theta≥ theta Example of hybrid sorting

24
24 Divide with pivot Select with entropy Divide into block Input < theta≥ theta Divide by digit Example of hybrid sorting

25
25 Divide with pivot Select with entropy Divide into block Pivot Bucket 1 Bucket 2 Input < theta≥ theta Divide by digit Example of hybrid sorting

26
26 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input < theta≥ theta Divide by digit Example of hybrid sorting

27
27 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

28
28 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

29
29 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

30
30 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

31
31 Target Machine Learning Mechanism Used at runtime Training inputs Mapping input data ➔ best algorithm Learning: Algorithm Selection

32
32 IBM Power3 26% Classifier Sort IBM ESSL C++ STL Results: Sequential Sorting

33
Results: Parallel Sorting Intel Quad Core

34
Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to tune 5.What to tune for

35
Issue 2: Modeling/Search When the search space is too big we must use models or better search mechanisms. Illustrated with: 1. An analytical model and hybrid approach for ATLAS [PLDI03] Yotov, Li, Ren, Cibulskis, DeJong, Garzarán, Padua, Pingali, Stodghill, and Wu. A Comparison of Empirical and Model-driven Optimization. In PLDI, 2003. [Proc of IEEE] Yotov, Li, Ren, Garzarán, Padua, Pingali, and Stodghill. Is Search Really Necessary to Generate High-Performance BLAS? In Proc. of the IEEE, 2005. [LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005 2. Genetic search for sorting [CGO04, CG005]

36
36 ATLAS Modeling ATLAS = Automated Tuned Linear Algebra Software, developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee. ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). – Use search to adapt to the target machine

37
37 Our Contribution Development of methods to speed-up the search process. – Analytical models that replace the search – Hybrid models that combine models with empirical search [LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005 The result – Same performance – Faster generation

38
38 ATLAS Infrastructure Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS MM Code Generator (MMCase) ATLAS Search Engine (MMSearch)

39
39 Modeling for Optimization Parameters Our Modeling Engine Optimization parameters – NB: Hierarchy of Models (later) – MU, NU: – KU: maximize subject to L1 Instruction Cache – Latency, MulAdd: from hardware parameters – xFetch: set to 2 Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Model

40
40 Modeling for Tile Size (NB) Models of increasing complexity – 3*NB 2 ≤ C Whole work-set fits in L1 – NB 2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word – or Line Size > 1 word – or LRU Replacement

41
41 MMM Performance SGI R12000Sun UltraSparc III Intel Pentium III BLAS COMPILER ATLAS MODEL MFLOPS

42
42 Models/Search Models reduce search time to 0. However, search is still necessary when a model does not exist.

43
43 Divide with pivot Select with entropy Divide into block Sorting Genome < theta≥ theta Divide by digit Genetic search for sorting Genetic operators are used to derive new offsprings: -Mutation (add, remove subtrees, change params) -Cross-over

44
Issue 2: Modeling/Search We need tools to guide models and search: P-Ray: Characterization of hardware [LCPC05] Duchateau, Sidelnik, Garzarán, Padua. P-RAY: A Suite of Micro benchmarks for Multi-core Architectures. In LCPC, 2008.

45
45 Characterize Hardware P-Ray: Development of benchmarks to measure hardware characteristics of multicore platforms Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size

46
46 Our Contribution P-Ray: Tool to measure. – Block Size – Cache Mapping – Processor Mapping – Effective Bandwidth The result – Correct results for 3 different platforms (Intel Xeon Haperton, Sun UltraSparc T1 Niagara, Intel Core 2 Quad Kentsfield)

47
P-Ray:Processor Mapping L2L2 L2L2 L2 Core 1 Core 3 L2L2 L2L2 L2 Core 5 Core 7 L2L2 L2L2 L2 Core 2 Core 4 L2L2 L2L2 L2 Core 6 Core 8 8 Core Intel Hapertown Chip 1 Chip 2

48
Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to automate 5.What to tune for

49
Issue 3:Description of the Space ATLAS generator is written in C We need more effective notations to implement a generator (describe the search space) Two possibilities: – Domain Specific Languages – General Purpose Languages

50
Issue 3:Description of the Space Illustrated with: 1.SPIRAL (Domain Specific Language) [Proc. Of IEEE05] Püschel, Moura, Johnson, Padua, Veloso, Singer, Xiong, Franchetti, Gacic, Voronenko, Chen, Johnson, and Rizzolo. Spiral: Code Generation for DSP Transforms. Proc. Of IEEE, 2005. http://www.spiral.net 2.Metalanguage (General Purpose Language) [LCPC05] Donadio, Brodman, Roeder, Yotov, Barthou, Cohen, Garzarán, Padua and Pingali. A Language for the Compact Representation of Multiples Program Versions. In LCPC 2005.

51
SPIRAL SPIRAL, generator of signal processing algorithms (DFT, DCT, WHT, filters, …) SPIRAL uses empirical search to generate routines that adapt to the target machine: – Sequential, parallel, SIMD, …

52
SPIRAL Contribution Declarative domain-specific language and rewriting rules to specify the search space. The result – Generation of routines that run faster than IPP (manually tuned) – Intel has started to use SPIRAL to generate parts of the IPP library

53
SPIRAL Search based on breakdown and re-writing rules: This is SPL, SPIRAL metalanguage

54
54 SPIRAL Program Generation Transform Rule SPL Formula parameterized matrix a breakdown strategy (Cooley Tukey) product of sparse matrices Ruletree (a)(b) (a) (b) CT

55
SPIRAL Program Generation

56
SPIRAL Why is search important? – Different formulas (algorithms) have different execution times They differ in the memory access pattern Have different ILP

57
SPIRAL Performance Results

58
Metaprogramming General-purpose programming of autotuned libraries and applications. A metaprogram contains a compact description of the space of program versions and how to proceed with the search.

59
Metaprogram example %try s in {2,4,8} for j=1 to 128 by %s %for k=j to j+s-1 a(%k) = … for j=1 to 128 by 4 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … for j=1 to 128 by 2 a(j) = … a(j+1) = … for j=1 to 128 by 8 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … a(j+4) = … a(j+5) = … a(j+6) = … a(j+7) = … Search strategy Program shape for each value

60
Research Issues 1.Performance depends on input 2.Modelling/Search 3.Description of the space 4.What to tune 5.What to tune for

61
Issue 4: What to tune 1.Kernels (MMM, FFT, sorting, …) 2.Codelets 3.Primitives

62
Codelets A class of (short) code sequences that appear often in an application domain The set of codelets should cover much of the execution domain Applications are decomposed into codelets Codelets are autotuned

63
Codelets Need a database of codelets – Each codelet in the database contains a set of compiler optimizations Application is decomposed in codelets that are matched against the codelets in the database – Application codelets are optimized using the set of optimizations of the matched codelet in the database Collaboration with David Kuck and David Wong, INTEL

64
Primitive Operations Same as codelets, but not identified automatically by the compiler The user is expected to write the application using primitives The primitives operations are tuned for each target platform

65
Example of Primitive Operations HTA : Hierarchically Tiled Arrays [PPoPP06] Bikshandi, Guo, Hoeflinger, Almasi, Fraguela, Garzarán, Padua, and von Praun. Programming for Parallelism and Locality with Hierarchically Tiled. In PPoPP, 2006. [PPoPP08] Guo, Bikshandi, Fraguela, Garzarán, and Padua. Programming with Tiles.In PPoPP 2008.

66
Hierarchically Tiled Arrays (HTAs) HTA is a data type where tiles are explicit HTAs are manipulated with data parallel primitives – HTA programs look sequential programs where parallelism is encapsulated into the data parallel primitives Result – Programs that run as fast as MPI (test with NAS benchmarks) – Fewer lines of code – Portable codes

67
FFT using HTA parallel primitives Can be autotuned

68
Data Parallel Primitives Challenge: Can we extend data parallel primitive operations to other complex data types, such as sets, trees, graphs?

69
Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of options/space search 4.What to tune 5.What to tune for

70
Issue 5: What to tune for 1.Execution Time (All the previous systems) 2.Power (Preliminary data in next slides) 3.Space 4.Reliability

71
71 Power in SPIRAL Processors allow software control of operating frequency and voltage e.g. Intel Pentium M 770 has 6 settings – 2.13 GHz at 1.340 volt(max performance) – 800MHz at 0.988 volt (min power/energy)

72
72 Experimental Setup Intel Pentium M model 770 –,,,,, Measurements – HW: Agilent 34134A current probe and Agilent 34401A DMM – SW: SPIRAL controlled automatic runtime and energy measurement routine Optimization space – voltage-frequency scaling

73
73 Dynamic voltage-frequency scaling Use of voltage scaling instructions – CPU bound region --> run at high frequency – Memory bound region --> run at low frequency Minimum impact on execution time and significant reduction in energy consumption

74
74 Dynamic voltage-frequency scaling: memory profile Time Cache miss ratio Each point shows the cache miss ratio every 100 seconds WHT-2 19 (out-of-cache) Zoom

75
75 Dynamic voltage-frequency scaling: memory profile Cache miss ratio Each point shows the cache miss ratio every 100 seconds WHT-2 19 (out-of-cache) Time low frequency high frequency

76
76 Dynamic voltage-frequency scaling: results Energy (Joules) WHT-2 19 Execution Time (Seconds ) Energy versus execution time

77
77 Same exec. time 10% less energy Dynamic voltage-frequency scaling: results Energy (Joules) Execution Time (Seconds ) Energy versus execution time Dynamic Voltage Scaling Same energy less execution time WHT-2 19

78
78 Compiler Optimizations (Future work) Iterations Cache miss ratio Apply dependence analysis and group together iterations with similar cache miss ratio increases the benefit of dynamic voltage scaling Iterations

79
Research Agenda 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to automate 5.What to tune for

Similar presentations

OK

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

Performance Analysis of Divide and Conquer Algorithms for the WHT Jeremy Johnson Mihai Furis, Pawel Hitczenko, Hung-Jen Huang Dept. of Computer Science.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Ppt on cloud computing challenges Ppt on atrial septal defect picture Thyroid anatomy and physiology ppt on cells Ppt on aerobics exercises Download ppt on law and social justice Ppt on primary data collection methods Ppt on history of olympics for kids Download ppt on fibonacci numbers Ppt on mutual funds in india Ppt on business communication and technology