Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding K. Ishizaka, M. Obata, H. Kasahara Waseda University, Tokyo, Japan.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

U of Houston – Clear Lake

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Lecture 12 Reduce Miss Penalty and Hit Time

Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.

Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.

CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.

Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter

Memory Organization.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.

Improving Code Generation Honors Compilers April 16 th 2002.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

DATA LOCALITY & ITS OPTIMIZATION TECHNIQUES Presented by Preethi Rajaram CSS 548 Introduction to Compilers Professor Carol Zander Fall 2012.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.

A Prototypical Self-Optimizing Package for Parallel Implementation of Fast Signal Transforms Kang Chen and Jeremy Johnson Department of Mathematics and.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Computer architecture Lecture 11: Reduced Instruction Set Computers Piotr Bilski.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Virtual Memory 1 1.

Structure Layout Optimizations in the Open64 Compiler: Design, Implementation and Measurements Gautam Chakrabarti and Fred Chow PathScale, LLC.

Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.

M E M O R Y. Computer Performance It depends in large measure on the interface between processor and memory. CPI (or IPC) is affected CPI = Cycles per.

Static Process Scheduling

Project 11: Influence of the Number of Processors on the Miss Rate Prepared By: Suhaimi bin Mohd Sukor M

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Performance of BLAS-3 Based Tridiagonalization Algorithms on Modern SMP Machines Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Concurrency and Performance Based on slides by Henri Casanova.

Memory Design Principles Principle of locality dominates design Smaller = faster Hierarchy goal: total memory system almost as cheap as the cheapest component,

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Computer Orgnization Rabie A. Ramadan Lecture 9. Cache Mapping Schemes.

Code Optimization.

Xiaodong Wang, Shuang Chen, Jeff Setter,

Conception of parallel algorithms

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Programming in C with MPI and OpenMP

Improving cache performance of MPEG video codec

Software Cache Coherent Control by Parallelizing Compiler

Department of Computer Science University of California, Santa Barbara

Levels of Parallelism within a Single Processor

Shared Memory Multiprocessors

Jinquan Dai, Long Li, Bo Huang Intel China Software Center

Lecture 16: Register Allocation

CS510 - Portland State University

Levels of Parallelism within a Single Processor

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

COMPUTER ORGANIZATION AND ARCHITECTURE

Presentation transcript:

Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding K. Ishizaka, M. Obata, H. Kasahara Waseda University, Tokyo, Japan

Research Background Wide use of SMP machine –From chip multiprocessors to high performance computers Increasing the number of processors –Gap between peak and effective performance is getting larger Automatic parallelizing compiler is important –To increase effective performance further Multilevel parallel processing –Beyond limitation of loop parallelization Locality Optimization –Cope with the speed gap between processor and memory –To improve cost performance and usability of SMP

Multigrain Parallelization Improvement of effective performance and scalability In addition to the loop parallelism –Coarse grain task parallelism: subroutines, loops, basic blocks –Near fine grain parallelism: statements  OSCAR multigrain parallelizing compiler

Generation of Coarse Grain Tasks Program is decomposed into macro-tasks(MTs) –Block of Pseudo Assignments (BPA): Basic Block (BB) –Repetition Block (RB) : natural loop –Subroutine Block (SB): subroutine Program BPA RB SB Near fine grain parallelization Loop level parallelization Near fine grain of loop body Coarse grain parallelization Coarse grain parallelization BPA RB SB BPA RB SB BPA RB SB BPA RB SB BPA RB SB BPA RB SB 1 st. Layer2 nd. Layer3 rd. Layer

Earliest Executable Condition Analysis Conditions on which macro-task may begin its execution earliest Ex. Earliest Executable Condition of MT6 Macro-Task Graph Macro-Flow Graph MT2 takes a branch that guarantees MT4 will be executed OR MT3 completes execution (Condition for determination of MT Execution) AND (Condition for Data access)

Cache Optimization among Macro-Tasks To effectively use cache for data passing among macro-tasks –Macro-tasks accessing the same shared data are assigned to the same processor consecutively Macro Flow Graph Macro Task Graph Data dependency Control flow

Loop Align Decomposition (LAD) If loops access larger data than cache size –Divide the loops to smaller loops considering data dependencies among loops –Define Data Localizable Group (DLG) DLG: group of macro-tasks to be assigned to the same processor for passing the shared data through cache (b) after loop decomposition(a) Before loop decomposition Loops 2,3,7 are divided into 4 smaller loops respectively Loop Align Decomposition Gray macro-tasks are generated by LAD Colored bands show DLG dlg0dlg1dlg2dlg3

Consecutive Execution Original execution order on single processor Scheduling result on single processor dlg0dlg1dlg2dlg3 Effective cache usage time Macro-tasks in a DLG are assigned to the same processor as consecutively as possible considering EEC

DO 200 J=1,N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE DO 300 J=1,N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) 300 CONTINUE DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1,J) = PNEW(1,J) 210 CONTINUE Data Layout on a Cache for a Part of SPEC95 Swim 01234MB cache size (a) An example of target loops for data localization (b) Image of alignment of arrays on cache accessed by target loops Cache line conflicts among arrays on the same vertical location PO UN H VNPN VNPNUOVO UVPUN POCUCVZ VNPNUOVO UN loop1 loop2 loop3

DO 200 J=1,N DO 200 I=1,M UNEW(I+1,J) = UOLD(I+1,J)+ 1 TDTS8*(Z(I+1,J+1)+Z(I+1,J))*(CV(I+1,J+1)+CV(I,J+1)+CV(I,J) 2 +CV(I+1,J))-TDTSDX*(H(I+1,J)-H(I,J)) VNEW(I,J+1) = VOLD(I,J+1)-TDTS8*(Z(I+1,J+1)+Z(I,J+1)) 1 *(CU(I+1,J+1)+CU(I,J+1)+CU(I,J)+CU(I+1,J)) 2 -TDTSDY*(H(I,J+1)-H(I,J)) PNEW(I,J) = POLD(I,J)-TDTSDX*(CU(I+1,J)-CU(I,J)) 1 -TDTSDY*(CV(I,J+1)-CV(I,J)) 200 CONTINUE DO 300 J=1,N DO 300 I=1,M UOLD(I,J) = U(I,J)+ALPHA*(UNEW(I,J)-2.*U(I,J)+UOLD(I,J)) VOLD(I,J) = V(I,J)+ALPHA*(VNEW(I,J)-2.*V(I,J)+VOLD(I,J)) POLD(I,J) = P(I,J)+ALPHA*(PNEW(I,J)-2.*P(I,J)+POLD(I,J)) 300 CONTINUE DO 210 J=1,N UNEW(1,J) = UNEW(M+1,J) VNEW(M+1,J+1) = VNEW(1,J+1) PNEW(M+1,J) = PNEW(1,J) 210 CONTINUE cache size (a) An example of target loops for data localization (b) Image of alignment of arrays on cache accessed by target loops PO UN H VNPN VNPNUOVO UVPUN POCUCVZ VNPNUOVO UN loop1 loop2 loop3 Partial Arrays Accessed inside a DLG0 by Data Localization 01234MB Cache line conflicts among arrays on the same vertical location

Padding to Remove Conflict Miss UPUNV VNVOPNUO POZCVCU H 4MB : accessed inside a DLG0 4MB Loop division padding by increasing array size Example: spec95 swim: 1MB x 13 Arrays Cache: 4MB direct map cache size : padding partial arrays are allocated to the limited part of cache arrays are allocated to the whole cache many conflict misses

Page Replacement Policy Operating System maps virtual address to physical address –Compiler works on virtual address –Cache uses physical address (ex. L2 of Ultra SPARC II) Cache performance and efficiency of compiler data layout transformation depend on OS policy Sequential addresses on virtual address are –Hashed VA of Solaris8, Page Coloring of AIX4.3 sequential on physical address –Bin Hopping not sequential on physical address

SMP Machines for Performance Evaluation Sun Ultra 80 –Four 450MHz Ultra SPARC IIs –4MB direct map L2 cache –Solaris8 (Hashed VA and Bin Hopping) –Sun Forte 6 update 2 IBM RS/ p-270 –Four 375MHz Power3s –4MB 4-way set associative L2 cache(LRU) –AIX 4.3 (Page Coloring and Bin Hopping) –XL FORTRAN compiler 7.1 (underline: default page replacement policy)

Performance of Data Localization with Inter-Array Padding on Ultra 80 4pe, L2 4MB, direct map speedup against 1pe tomcatvswimhydro2dturb3d tomcatvswimhydro2dturb3d Hashed VABin Hopping 19s 11s 31s 56s 47s 35s 44s 60s x5.1 x5.4 x2.5x1.2

Performance of Data Localization with Inter-Array Padding on RS/ p-270 4pe, L2 4MB, 4-way set associative(LRU) speedup against 1pe tomcatvswimhydro2dturb3d tomcatvswimhydro2dturb3d Page ColoringBin Hopping 22s 8.0s 17s 26s 25s 12s 16s 25s x2.0 x5.3 x3.3x1.02

Related Works Loop Fusion and Cache Partitioning –N.Manjikian and T.Abdelrahman, Univ. of Toronto Padding heuristics –G.Rivera and C.-W.Tseng, Univ. of Maryland Compiler-directed page coloring –E.Bugnion and M.Lam et al, Stanford Univ.

Conclusions Cache optimization among coarse grain task using inter-array padding with Data Localization –Improve data locality over loops by Data Localization composed of Loop Division and Consecutive Execution –Decrease cache misses by Inter-Array Padding In the evaluation using SPEC95 tomcatv, swim, hydro2d and turb3d –on Sun Ultra 80 4processors workstation 5.1 times speedup for tomcatv, 5.4 times speedup for swim compared with Sun Forte 6u2 on Hashed VA –on IBM RS/6000 4processors workstation 5.3 times speedup for swim, 3.3 times speedup for hydro2d compared with IBM XLF 7.1 on page coloring