Presentation is loading. Please wait.

Presentation is loading. Please wait.

Autotuning at Illinois María Jesús Garzarán University of Illinois.

Similar presentations


Presentation on theme: "Autotuning at Illinois María Jesús Garzarán University of Illinois."— Presentation transcript:

1 Autotuning at Illinois María Jesús Garzarán University of Illinois

2 Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

3 Why autotuning? In the era of parallelism… Applications and software must maintain high efficiency as machines evolve. – Otherwise, no reason for new machines. Problem: High-efficiency requires laborious tuning. – Cost increase. – Low performance if not enough resources Would like to automate tuning.

4 Compilers One way is compilers, but compilers have limitations. – Lack semantic information → fewer choices – Must target all applications – Must be reasonably fast

5 Compiler vs. Manual Tuning Discrete Fourier Transform

6 Compiler vs. Manual Tuning Matrix Matrix Multiplication 20x MFLOPS Matrix Size Intel MKL icc -O3 -xT icc -O3

7 Compiler vs. Manual Tuning Matrix Matrix Multiplication loop 1 c[i*N+j] += a[i*N+k]*b[k*N+j] loop 2 c[i][j] += a[i][k]*b[k][j] loop 3 C += a[i][k]*b[k][j]

8 Compilers … Can and should improve But we will need other strategies (at least in the short term)

9 Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

10 What is Autotuning An emerging strategy: empirical search – Goal: Automatically generate highly efficient code for each target machine (and input set). – Programmers develop metaprograms (a program that generates programs) that search the space of possible algorithms/implementations

11 Generator of the versions High-level code Source-to-source optimizer Native compiler Metaprogram:Decription of the space of versions Object code Execution performance Selected code High-level code Input data (training) Autotuning with empirical search

12 Autotuning More laborious than conventional programming, but – Longer lifetime → cost reduction – Can accumulate experience → better results – Can afford to search more extensively → better results

13 Examples of Existing Autotuning Systems ATLAS: Whaley, Petite, Dongarra (Tennessee) BeBop: Demmel, Yelick, Im, Vuduc (Berkeley) Datamining: Jian, Garzar á n, Snir (Illinois) FFTW: Frigo (MIT) Illinois Sorting: Li, Garzar á n, Padua (Illinois) Matrix-matrix multiplication for GPU: Jiang, Snir (Illinois) Phipac: Bilmes, Asanovic, Vuduc, Iyer, Demmel, Chin, Lan (Berkeley) Space Pruning for GPU: Ryoo, Rodrigues,Stone, Baghsorkhi, Ueng, Stratton, Hwu (Illinois) SPIRAL: Moura, Pueschel (CMU), Johnson (Drexel), Garzar á n, Padua (Illinois) SPIKETune: Wong, Kuck (Intel), Sameh(Purdue), Padua (Illinois)

14 Outline 1.Why Autotuning? 2.What is Autotuning? 3.Research Problems

15 Generator of the versions High-level code Source-to-source optimizer Native compiler Metaprogram: Decription of the version space Object code Execution Selected code High-level code Input data (training) Autotuning with empirical search What to do when performance depends on the input How to specify the search space? performance What is performance (execution time, power)? How to drive the search?

16 Research Issues 1.What to do when performance depends on input 2.Modeling/Search 3.Description of the space 4.What to tune 5.What to tune for Very promising, but much to learn

17 Issue 1: Performance depends on input When performance depends on the input we must generate dynamically adapting routines. – Illustrated with the generation of sorting routines [CGO04] Li, Garzarán, Padua. A Dynamically Tuned Sorting Library. In Proc. of the Int. Symp. on Code Generation and Optimization,2004. [CGO05] Li, Garzarán, Padua. Optimizing Sorting with Genetic Algorithms. In Proc. of the Int. Symp. on Code Generation and Optimization 2005.

18 Issue 1: Sorting Different algorithms to perform sorting – Radix sort – Quick sort – Merge sort No single algorithm is the best for all inputs and platforms

19 Our Contribution Design of hybrid algorithms and use of genetic search to find sorting routines that automatically adapt to the target machine and the input characteristics. Result: – Generation of the fastest sorting routines for sequential and parallel execution

20 20 Sorting Performance (keys per cycle) Intel Xeon AMD Athlon MP CC-Radix Merge Sort Quicksort CC- Radix Merge Sort Quicksort Same input different performance Standard Deviation

21 21 Sorting Performance (keys per cycle) Intel Xeon AMD Athlon MP CC-Radix Merge Sort Quicksort CC- Radix Merge Sort Quicksort Standard Deviation

22 22 Divide with pivot Select with entropy Divide into block Sorting Genome < theta≥ theta Divide by digit Hybrid sorting for dynamic adaptation

23 23 Input Divide with pivot Select with entropy Divide by digit Divide into block < theta≥ theta Example of hybrid sorting

24 24 Divide with pivot Select with entropy Divide into block Input < theta≥ theta Divide by digit Example of hybrid sorting

25 25 Divide with pivot Select with entropy Divide into block Pivot Bucket 1 Bucket 2 Input < theta≥ theta Divide by digit Example of hybrid sorting

26 26 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input < theta≥ theta Divide by digit Example of hybrid sorting

27 27 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

28 28 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

29 29 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

30 30 Divide with pivot Select with entropy Divide into block Pivot Select operations based on entropy Bucket 1 Bucket 2 Input Sorted < theta≥ theta Divide by digit Example of hybrid sorting

31 31 Target Machine Learning Mechanism Used at runtime Training inputs Mapping input data ➔ best algorithm Learning: Algorithm Selection

32 32 IBM Power3 26% Classifier Sort IBM ESSL C++ STL Results: Sequential Sorting

33 Results: Parallel Sorting Intel Quad Core

34 Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to tune 5.What to tune for

35 Issue 2: Modeling/Search When the search space is too big we must use models or better search mechanisms. Illustrated with: 1. An analytical model and hybrid approach for ATLAS [PLDI03] Yotov, Li, Ren, Cibulskis, DeJong, Garzarán, Padua, Pingali, Stodghill, and Wu. A Comparison of Empirical and Model-driven Optimization. In PLDI, 2003. [Proc of IEEE] Yotov, Li, Ren, Garzarán, Padua, Pingali, and Stodghill. Is Search Really Necessary to Generate High-Performance BLAS? In Proc. of the IEEE, 2005. [LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005 2. Genetic search for sorting [CGO04, CG005]

36 36 ATLAS Modeling ATLAS = Automated Tuned Linear Algebra Software, developed by R. Clint Whaley, Antoine Petite and Jack Dongarra, at the University of Tennessee. ATLAS uses empirical search to automatically generate highly-tuned Basic Linear Algebra Libraries (BLAS). – Use search to adapt to the target machine

37 37 Our Contribution Development of methods to speed-up the search process. – Analytical models that replace the search – Hybrid models that combine models with empirical search [LCPC05] Epshteyn, Garzarán, Dejong, Padua, Ren, Li, Yotov and Pingali. Analytic Models and Empirical Search: A Hybrid Approach to Code Optimization. In LCPC, 2005 The result – Same performance – Faster generation

38 38 ATLAS Infrastructure Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source Compile, Execute, Measure MFLOPS Detect Hardware Parameters ATLAS MM Code Generator (MMCase) ATLAS Search Engine (MMSearch)

39 39 Modeling for Optimization Parameters Our Modeling Engine Optimization parameters – NB: Hierarchy of Models (later) – MU, NU: – KU: maximize subject to L1 Instruction Cache – Latency, MulAdd: from hardware parameters – xFetch: set to 2 Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size Model

40 40 Modeling for Tile Size (NB) Models of increasing complexity – 3*NB 2 ≤ C Whole work-set fits in L1 – NB 2 + NB + 1 ≤ C Fully Associative Optimal Replacement Line Size: 1 word – or Line Size > 1 word – or LRU Replacement

41 41 MMM Performance SGI R12000Sun UltraSparc III Intel Pentium III BLAS COMPILER ATLAS MODEL MFLOPS

42 42 Models/Search Models reduce search time to 0. However, search is still necessary when a model does not exist.

43 43 Divide with pivot Select with entropy Divide into block Sorting Genome < theta≥ theta Divide by digit Genetic search for sorting Genetic operators are used to derive new offsprings: -Mutation (add, remove subtrees, change params) -Cross-over

44 Issue 2: Modeling/Search We need tools to guide models and search: P-Ray: Characterization of hardware [LCPC05] Duchateau, Sidelnik, Garzarán, Padua. P-RAY: A Suite of Micro benchmarks for Multi-core Architectures. In LCPC, 2008.

45 45 Characterize Hardware P-Ray: Development of benchmarks to measure hardware characteristics of multicore platforms Detect Hardware Parameters ATLAS Search Engine (MMSearch) NR MulAdd Latency L1I$Size ATLAS MM Code Generator (MMCase) xFetch MulAdd Latency NB MU,NU,KU MiniMMM Source L1Size

46 46 Our Contribution P-Ray: Tool to measure. – Block Size – Cache Mapping – Processor Mapping – Effective Bandwidth The result – Correct results for 3 different platforms (Intel Xeon Haperton, Sun UltraSparc T1 Niagara, Intel Core 2 Quad Kentsfield)

47 P-Ray:Processor Mapping L2L2 L2L2 L2 Core 1 Core 3 L2L2 L2L2 L2 Core 5 Core 7 L2L2 L2L2 L2 Core 2 Core 4 L2L2 L2L2 L2 Core 6 Core 8 8 Core Intel Hapertown Chip 1 Chip 2

48 Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to automate 5.What to tune for

49 Issue 3:Description of the Space ATLAS generator is written in C We need more effective notations to implement a generator (describe the search space) Two possibilities: – Domain Specific Languages – General Purpose Languages

50 Issue 3:Description of the Space Illustrated with: 1.SPIRAL (Domain Specific Language) [Proc. Of IEEE05] Püschel, Moura, Johnson, Padua, Veloso, Singer, Xiong, Franchetti, Gacic, Voronenko, Chen, Johnson, and Rizzolo. Spiral: Code Generation for DSP Transforms. Proc. Of IEEE, 2005. http://www.spiral.net 2.Metalanguage (General Purpose Language) [LCPC05] Donadio, Brodman, Roeder, Yotov, Barthou, Cohen, Garzarán, Padua and Pingali. A Language for the Compact Representation of Multiples Program Versions. In LCPC 2005.

51 SPIRAL SPIRAL, generator of signal processing algorithms (DFT, DCT, WHT, filters, …) SPIRAL uses empirical search to generate routines that adapt to the target machine: – Sequential, parallel, SIMD, …

52 SPIRAL Contribution Declarative domain-specific language and rewriting rules to specify the search space. The result – Generation of routines that run faster than IPP (manually tuned) – Intel has started to use SPIRAL to generate parts of the IPP library

53 SPIRAL Search based on breakdown and re-writing rules: This is SPL, SPIRAL metalanguage

54 54 SPIRAL Program Generation Transform Rule SPL Formula parameterized matrix a breakdown strategy (Cooley Tukey) product of sparse matrices Ruletree (a)(b) (a) (b) CT

55 SPIRAL Program Generation

56 SPIRAL Why is search important? – Different formulas (algorithms) have different execution times They differ in the memory access pattern Have different ILP

57 SPIRAL Performance Results

58 Metaprogramming General-purpose programming of autotuned libraries and applications. A metaprogram contains a compact description of the space of program versions and how to proceed with the search.

59 Metaprogram example %try s in {2,4,8} for j=1 to 128 by %s %for k=j to j+s-1 a(%k) = … for j=1 to 128 by 4 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … for j=1 to 128 by 2 a(j) = … a(j+1) = … for j=1 to 128 by 8 a(j) = … a(j+1) = … a(j+2) = … a(j+3) = … a(j+4) = … a(j+5) = … a(j+6) = … a(j+7) = … Search strategy Program shape for each value

60 Research Issues 1.Performance depends on input 2.Modelling/Search 3.Description of the space 4.What to tune 5.What to tune for

61 Issue 4: What to tune 1.Kernels (MMM, FFT, sorting, …) 2.Codelets 3.Primitives

62 Codelets A class of (short) code sequences that appear often in an application domain The set of codelets should cover much of the execution domain Applications are decomposed into codelets Codelets are autotuned

63 Codelets Need a database of codelets – Each codelet in the database contains a set of compiler optimizations Application is decomposed in codelets that are matched against the codelets in the database – Application codelets are optimized using the set of optimizations of the matched codelet in the database Collaboration with David Kuck and David Wong, INTEL

64 Primitive Operations Same as codelets, but not identified automatically by the compiler The user is expected to write the application using primitives The primitives operations are tuned for each target platform

65 Example of Primitive Operations HTA : Hierarchically Tiled Arrays [PPoPP06] Bikshandi, Guo, Hoeflinger, Almasi, Fraguela, Garzarán, Padua, and von Praun. Programming for Parallelism and Locality with Hierarchically Tiled. In PPoPP, 2006. [PPoPP08] Guo, Bikshandi, Fraguela, Garzarán, and Padua. Programming with Tiles.In PPoPP 2008.

66 Hierarchically Tiled Arrays (HTAs) HTA is a data type where tiles are explicit HTAs are manipulated with data parallel primitives – HTA programs look sequential programs where parallelism is encapsulated into the data parallel primitives Result – Programs that run as fast as MPI (test with NAS benchmarks) – Fewer lines of code – Portable codes

67 FFT using HTA parallel primitives Can be autotuned

68 Data Parallel Primitives Challenge: Can we extend data parallel primitive operations to other complex data types, such as sets, trees, graphs?

69 Research Issues 1.Performance depends on input 2.Modeling/Search 3.Description of options/space search 4.What to tune 5.What to tune for

70 Issue 5: What to tune for 1.Execution Time (All the previous systems) 2.Power (Preliminary data in next slides) 3.Space 4.Reliability

71 71 Power in SPIRAL Processors allow software control of operating frequency and voltage e.g. Intel Pentium M 770 has 6 settings – 2.13 GHz at 1.340 volt(max performance) – 800MHz at 0.988 volt (min power/energy)

72 72 Experimental Setup Intel Pentium M model 770 –,,,,, Measurements – HW: Agilent 34134A current probe and Agilent 34401A DMM – SW: SPIRAL controlled automatic runtime and energy measurement routine Optimization space – voltage-frequency scaling

73 73 Dynamic voltage-frequency scaling Use of voltage scaling instructions – CPU bound region --> run at high frequency – Memory bound region --> run at low frequency Minimum impact on execution time and significant reduction in energy consumption

74 74 Dynamic voltage-frequency scaling: memory profile Time Cache miss ratio Each point shows the cache miss ratio every 100  seconds WHT-2 19 (out-of-cache) Zoom

75 75 Dynamic voltage-frequency scaling: memory profile Cache miss ratio Each point shows the cache miss ratio every 100  seconds WHT-2 19 (out-of-cache) Time low frequency high frequency

76 76 Dynamic voltage-frequency scaling: results Energy (Joules) WHT-2 19 Execution Time (Seconds ) Energy versus execution time

77 77 Same exec. time 10% less energy Dynamic voltage-frequency scaling: results Energy (Joules) Execution Time (Seconds ) Energy versus execution time Dynamic Voltage Scaling Same energy less execution time WHT-2 19

78 78 Compiler Optimizations (Future work) Iterations Cache miss ratio Apply dependence analysis and group together iterations with similar cache miss ratio  increases the benefit of dynamic voltage scaling Iterations

79 Research Agenda 1.Performance depends on input 2.Modeling/Search 3.Description of the space 4.What to automate 5.What to tune for

80


Download ppt "Autotuning at Illinois María Jesús Garzarán University of Illinois."

Similar presentations


Ads by Google