Far Fetched Prefetching?

Slides:



Advertisements
Similar presentations
fakultät für informatik informatik 12 technische universität dortmund Additional compiler optimizations Peter Marwedel TU Dortmund Informatik 12 Germany.
Advertisements

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,
Branch prediction Titov Alexander MDSP November, 2009.
1 Optimizing compilers Managing Cache Bercovici Sivan.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
AlphaZ: A System for Design Space Exploration in the Polyhedral Model
CA 714CA Midterm Review. C5 Cache Optimization Reduce miss penalty –Hardware and software Reduce miss rate –Hardware and software Reduce hit time –Hardware.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
Derivation of Efficient FSM from Polyhedral Loop Nests Tomofumi Yuki, Antoine Morvan, Steven Derrien INRIA/Université de Rennes 1 1.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Software Methods to Increase Data Cache Performance Presented by Philip Marshall.
Detailed look at the TigerSHARC pipeline Cycle counting for COMPUTE block versions of the DC_Removal algorithm.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
Reducing Cache Misses (Sec. 5.3) Three categories of cache misses: 1.Compulsory –The very first access to a block cannot be in the cache 2.Capacity –Due.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Folklore Confirmed: Compiling for Speed = Compiling for Energy Tomofumi Yuki INRIA, Rennes Sanjay Rajopadhye Colorado State University 1.
Synchronization (Barriers) Parallel Processing (CS453)
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
Research in Compilers and How it Relates to Software Engineering Part I: Compiler Research Tomofumi Yuki EJCP 2015 June 22, Nancy.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Workshop on Parallelization of Coupled-Cluster Methods Panel 1: Parallel efficiency An incomplete list of thoughts Bert de Jong High Performance Software.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
Data Flow Analysis for Software Prefetching Linked Data Structures in Java Brendon Cahoon Dept. of Computer Science University of Massachusetts Amherst,
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
ADAPTIVE CACHE-LINE SIZE MANAGEMENT ON 3D INTEGRATED MICROPROCESSORS Takatsugu Ono, Koji Inoue and Kazuaki Murakami Kyushu University, Japan ISOCC 2009.
Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.
COMPSYS 304 Computer Architecture Speculation & Branching Morning visitors - Paradise Bay, Bay of Islands.
Polya’s 4-step Process 1.Understand the problem 2.Devise a plan 3.Carry out the plan 4.Look back, review results.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Energy-Efficient Hardware Data Prefetching Yao Guo, Mahmoud Abdullah Bennaser and Csaba Andras Moritz.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
1 Lecture 5a: CPU architecture 101 boris.
Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.
Lecture 38: Compiling for Modern Architectures 03 May 02
Topics to be covered Instruction Execution Characteristics
Research in Compilers and Introduction to Loop Transformations Part I: Compiler Research Tomofumi Yuki EJCP 2016 June 29, Lille.
Advanced Architectures
Research in Compilers and Introduction to Loop Transformations Part I: Compiler Research Tomofumi Yuki EJCP 2017 June 29, Toulouse.
The University of Adelaide, School of Computer Science
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
Spare Register Aware Prefetching for Graph Algorithms on GPUs
CSC Classes Required for TCC CS Degree
A Practical Stride Prefetching Implementation in Global Optimizer
Henk Corporaal TUEindhoven 2011
Presented by David Wolinsky
CSC D70: Compiler Optimization Prefetching
15-740/ Computer Architecture Lecture 14: Prefetching
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
Memory System Performance Chapter 3
Cache Performance Improvements
Instruction Level Parallelism
CSC D70: Compiler Optimization Prefetching
Facts About High-Performance Computing
Interconnection Network and Prefetching
Presentation transcript:

Far Fetched Prefetching? Tomofumi Yuki INRIA Rennes Antoine Morvan ENS Cachan Bretagne Steven Derrien University of Rennes 1

Memory Optimizations Memory Wall Memory improves slower than processor Very little improvement in latency Making sure processors have data to consume Software: tiling, prefetching, array contraction Hardware: cache, prefetching Important for both speed and power Further memory takes more power

Prefetching Anticipate future memory accesses and start data transfer in advance Both HW and SW versions exist Hides latency of memory accesses Cannot help if bandwidth is the issue Inaccurate prefetch is harmful Consumes unnecessary bandwidth/energy Pressure on caches

This talk A failure experience based on our attempt to improve prefetching We target cases where trip count is small Prefetch instructions must be placed in previous iterations of the outer loop Use polyhedral techniques to find where to insert prefetch

Outline Introduction Software Prefetching Code Generation When it doesn’t work Improving prefetch placement Code Generation Simple Example Summary and Next Steps

Software Prefetching [Mowry 96] Shift the iterations by prefetching distance Simple yet effective method for regular computations Prologue for (i=-4; i<0; i++) prefetch(A[i+4]); for (i=0; i<N-4; i++) … = foo(A[i], …) for (i=N-4; i<N; i++) for (i=0; i<N; i++) … = foo(A[i], …) Epilogue prefetch distance = 4

When Prefetching Works, When It Doesn’t, and Why [Lee, Kim, and Vuduc HiPEAC 2013] When Prefetching Works, When It Doesn’t, and Why Difficult to statically determine prefetching distance They suggest use of tuning Not the scope of our work Interference with HW prefetchers Cannot handle short stream of accesses Limitation for both software and hardware We try to handle short streams

Problem with Short Streams When N=5 Most prefetches are issued “too early” Not enough computation to hide the latency Simply translating iterations in the innermost loop is not sufficient for (i=-4; i<0; i++) prefetch(A[i+4]); for (i=0; i<1; i++) … = foo(A[i], …) for (i=1; i<5; i++) prefetch distance = 4

2D Illustration Large number of useless prefetches i j

Lexicographical Shift The main idea to improve prefetch placement i j

Outline Introduction Software Prefetching Code Generation Polyhedral representation Avoiding redundant prefetch Simple Example Summary and Next Steps

Prefetch Code Generation Main Problem: Placement Prefetching distance Manually given parameter in our work Lexicographically shifting the iteration spaces Avoid reading the same array element twice Avoid prefetching the same line multiple times We use the polyhedral representation to handle these difficulties

Polyhedral Representation Represent iteration space (of polyhedral programs) as a set of points (polyhedra) Simply a different view of the program Manipulate using convenient polyhedral representation, and then generate loops S0 i j N i=j i=N j=1 for (i=1; i<=N; i++) for (j=1; j<=i; j++) S0 Domain(S0) = [i,j] : 1≤j≤i≤N

Transforming Iteration Spaces Expressed as affine transformations Shifting by constant is even simpler Example: Shift the iteration by 4 along j (i,j→i,j-4) S0 i j N Domain(S0) = [i,j] : 1≤j≤i≤N S0’ Domain(S0’) = [i,j] : 1≤i≤N && -3≤j≤i-4

Lex. Shift as Affine Transform Piece-wise affine transformation (i,j→i,j-1) if j>1 or i=1 (i,j→i-1,i-1) if j=1 and i>1 Apply n times for prefetch distance n S0 i j N Domain(S0) = [i,j] : 1≤j≤i≤N S0’ n Domain(S0’) = [i,j] : <complicated> 15

Avoiding Redundant Prefetch: Same Element Given: Target array: A Set of statement instances that read from A Array access functions for each read access Find the set of first readers: Statement instances that first read from each element in A Find the lex. min among the set of points accessing the same array

Avoiding Redundant Prefetch: Same Cache Line Let an element of array A be ¼ of a line. The following prefetches the same line 4 times We apply unrolling to avoid redundancy for i … prefetch(A[i+4]); … = foo(A[i], …); for i … prefetch(A[i+4]); … = foo(A[i], …); … = foo(A[i+1], …); … = foo(A[i+2], …); … = foo(A[i+3], …);

Outline Introduction Software Prefetching Code Generation Simple Example Simulation results Summary and Next Steps

Simple Example Contrived example that better work Expectation: when M is small and N is large, lexicographical shift should work better Compare between unroll only Mowery prefetching (shift in innermost) Proposed (lexicographic shift) for(i=0; i<N; i+=1) for(j=0; j<M; j+=1) sum = sum + A[i][j];

Search for Simulators Need simulators to experiment with Memory latency Number of outstanding prefetches Line size, and so on Tried on many simulators XEEMU: Intel Xscale SoCLib: SoC gem5: Alpha/ARM/SPARC/x86 VEX: VLIW Xeeme : Institute of Informatics, Hungary SoCLib, Lip6 / ANR project Gem5 : Univ. Wisconsin, HP, Intel, ARM

Simulation with VEX (HP Labs) Miss penalty: 72 cycles We see what we expect Effective: number of useful prefetches But, for more “complex” examples, the benefit diminishes (<3%) cycles misses prefetches effective speedup original 658k 2015 - unroll 610k 1.08 mowry 551k 1020 2000 992 1.19 lex. shift 480k 25 2001 1985 1.37

Lex. Shift was Overkill Precision in placement does not necessary translate to benefit Computationally very expensive Power of piecewise affine function by the prefetch distance Takes a lot of memory and time High control overhead Code size more than doubles compared to translation in the innermost loop

Summary Prefetching doesn’t work with short streams Epilogue dominates Can we refine prefetch placement? Lexicographic Shift Increase overlap of prefetch and computation Can be done with polyhedral representation But, it didn’t work Overkill

Another Possibility Translation in outer dimensions Keeps things simple by selecting one appropriate dimension to shift Works well for rectangular spaces i j

Coarser Granularity Lexicographic shift at statement instance level seems too fine grained Can it work with tiles? Prefetch for next tile as you execute one

Thank you