High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture 5: Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2)
ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
1 Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Optimal Instruction Scheduling for Multi-Issue Processors using Constraint Programming Abid M. Malik and Peter van Beek David R. Cheriton School of Computer.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Register Allocation (via graph coloring)
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Improving Code Generation Honors Compilers April 16 th 2002.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.
Introduction to Embedded Systems Rabie A. Ramadan 4.
Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.
High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)
1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.
Associativity in Caches Lecture 25
Modeling of Digital Systems
5.2 Eleven Advanced Optimizations of Cache Performance
CSCI1600: Embedded and Real Time Software
Instruction Scheduling Hal Perkins Summer 2004
Instruction Scheduling Hal Perkins Winter 2008
Register Pressure Guided Unroll-and-Jam
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Spring 2008 CSE 591 Compilers for Embedded Systems
Code Transformation for TLB Power Reduction
Main Memory Background
Introduction to Optimization
CSCI1600: Embedded and Real Time Software
Presentation transcript:

High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

© 2006 Elsevier Topics List scheduling Loop Transformations Global Optimizations Buffers, Data Transfers, and Storage Management Cache and Scratch-Pad Optimizations Main Memory Optimizations

© 2006 Elsevier List Scheduling Given DFG graph, how do you assign (schedule) operations to particular slots? Schedule based on priority function  Compute longest path to leaf function  Schedule nodes with longest path first  Keep track of readiness (based on result latency) Goal: maximize ILP, fill as many issue slots with useful work as possible (minimize NOPs)

© 2006 Elsevier List Scheduling Example  Heuristic – no guarantee of optimality  For parallel pipelines account for structural hazards Pick 33 (4), 1(4), 2(3), 4(2), 5(0), 6(0) Pick 11(4), 2(3), 4(2), 5(0), 6(0) Pick 22(3), 4(2), 5(0), 6(0) Pick 44(2), 5(0), 6(0) Pick 6 (5 nr)5(0), 6(0) Pick 55(0)

© 2006 Elsevier Local vs. Global Scheduling Single-entry, single-exit (e.g. basic block) scope  Limited opportunity since only 4-5 instructions Expand scope  Unroll loop body, inline small functions  Construct superblocks and hyperblocks, which are single-entry, multiple exit sequences of blocks Code motion across control-flow divergence  Speculative, consider safety (exceptions) and state  Predication is useful to nullify wrong-path instructions

© 2006 Elsevier Memory-oriented optimizations Memory is a key bottleneck in many embedded systems. Memory usage can be optimized at any level of the memory hierarchy. Can target data or instructions. Global memory analysis can be particularly useful.  It is important to size buffers between subsystems to avoid buffer overflow and wasted memory.

© 2006 Elsevier Loop transformations Data dependencies may be within or between loop iterations. Ideal loops are fully parallelizable. A loop nest has loops enclosed by other loops. A perfect loop nest has no conditional statements.

© 2006 Elsevier Types of loop transformations Loop permutation changes order of loops. Index rewriting changes the form of the loop indexes. Loop unrolling copies the loop body. Loop splitting creates separate loops for operations in the loop body. Loop fusion or loop merging combines loop bodies. Loop padding adds data elements to an array to change how the array maps into memory.

© 2006 Elsevier Polytope model Commonly used to represent data dependencies in loop nests. Loop transformations can be modeled as matrix operations:  Each column represents iteration bounds. j i

© 2006 Elsevier Loop permutation Changes the order of loop indices Can help reduce the time needed to access matrix elements 2-D arrays in C are stored in row major order Access the data row by row. Example of matrix-vector multiplication

© 2006 Elsevier Loop fusion Combines loop bodies for (i = 0; i <N; i++)for (i = 0; i <N; i++) { x[i] = a[i] * b[i]; for (i = 0; i <N; i++) y[i] = a[i] * c[i]; y[i] = a[i] * c[i]; } Original loopsAfter loop fusion How might this help improve performance?

© 2006 Elsevier Buffer management [Pan01] In embedded systems, buffers are often used to communicate between subsystems Excessive dynamic memory management wastes cycles, energy with no functional improvements. Many embedded programs use arrays that are statically allocated Several loop transformations have been developed to make buffer management more efficient Before: for (i=0; i<N; ++i) for (j=0; j<N-L; ++j) b[i][j] = 0; for (i=0; i<N; ++i) for (j=0; j<N-L; ++j) for (k=0; k<L; ++k) b[i][j] += a[i][j+k]; After: for (i=0; i<N; ++i) for (j=0; j<N-L; ++j) b[i][j] = 0; for (k=0; k<L; ++k) b[i][j] += a[i][j+k]; closer

© 2006 Elsevier Buffer management [Pan01] int a_buf[L]; int b_buf; for (i = 0; i < N; ++i) { initialize a_buf for (j = 0; j < N - L; ++j) { b_buf = 0; a_buf[(j + L - 1) % L] = a[i][j + L - 1]; for (k<0; k < L; ++k) b_buf += a_buf[(j + k)%L]; b[i][j] = b_buf; } Loop analysis can help to make data reuse more explicit. Buffers are declared in the program Don’t need to exist in final implementation.

© 2006 Elsevier Cache optimizations – [Pan97] Strategies:  Move data to reduce the number of conflicts.  Move data to take advantage of prefetching. Need:  Load map.  Information on access frequencies.

© 2006 Elsevier Cache conflicts Assume a direct-mapped cache of size C = 2 m with a cache line size of M words Memory address A maps to cache line k = (A mod C)/M If N is a multiple of C, then a[i], b[i], and c[i] all map to the same cache line

© 2006 Elsevier Reducing cache conflicts Could increase the cache size  Why might these be a bad idea? Add L dummy words between adjacent arrays Let f(x) denote the cache line to which the program variable x is mapped. For L = M, with a[i] starting at 0, we have

© 2006 Elsevier Scalar variable placement Place scalar variables to improve locality and reduce cache conflicts.  Build closeness graph that indicates the desirability of keeping sets of variables close in memory. M adjacent words are read on a single miss  Group variables into M word clusters.  Build a cluster interference graph Indicates which cluster map to same cache line  Use interference graph to optimize placement. Try to avoid interference

© 2006 Elsevier Constructing the closeness graph Generate an access sequence  Create a node for memory access in the code  Directed edge between nodes indicates successive access  Loops weighted with number of loop iterations Use access sequence to construct closeness graph  Connect nodes within distance M of each other x to b is distance 4 (count x and b)  Links give # of times control flows between nodes  Requires O(Mn 2 ) time for n nodes

© 2006 Elsevier Group variables into M word clusters. Determine which variables to place on same line  Put variables that will frequently be accessed closely together on the same line. Why? Form clusters to maximize the total weight of edges in all the clusters  Greedy AssignClusters algorithm has complexity O(Mn 2 )  In previous example M = 3 and n = 9

© 2006 Elsevier Build cluster interference graph Identify clusters that should not map to the same line  Convert variable access sequence to cluster access sequence  Weight in graph corresponds to number of times cluster access alternates along execution path High weights should not be mapped to the same line

© 2006 Elsevier Assigning memory locations to clusters Find an assignment of clusters in a CIG to memory locations, such that MemAssignCost (CIG) is minimized.

© 2006 Elsevier Array placement Focus on arrays accessed in innermost loops. Why? Arrays are placed to avoid conflicting accesses with other arrays. Don’t worry about clustering, but still construct the interference graph – edges dictated by array bounds

© 2006 Elsevier Avoid conflicting memory locations Given addresses X, Y. Cache with k lines each holding M words. Formulas for X and Y mapping to the same cache line:

© 2006 Elsevier Array assignment algorithm [Pan97] © 1997 IEEE

Results from data placement [Pan97] © 2006 Elsevier Data cache hit rates improve 48% on average Results in average speedup of 34% Results for a 256-byte cache and for kernels

© 2006 Elsevier On-Chip vs. Off-Chip Data Placement [Pan00] explore how to partition data to optimize performance

© 2006 Elsevier On-Chip vs. Off-Chip Data Placement Allocate static variables at compile time Map all scalar values/constants to scratchpad Map all arrays too large for scratchpad into DRAM Only arrays with intersecting lifetimes will have conflicts. Calculate several parameters:  VAC(u): variable access count – how many times is u accessed.  IAC(u): interference access count – how many times are other variables accessed during u’s lifetime.  IF(u): total interference count = VAC(u) + IAC(u).

© 2006 Elsevier On-Chip vs. Off-Chip Data Placement Also need to calculate LCF(u): loop conflict factor.  p is the number of loops accessed by u  k(u) is the number of accesses to u  K(v) is the number of accesses to variables other than u And TCF(u): total conflict factor.

© 2006 Elsevier Scratch pad allocation formulation AD( c ): access density.

© 2006 Elsevier Scratch pad allocation algorithm [Pan00] © 2000 ACM Press

© 2006 Elsevier Scratch pad allocation performance [Pan00] © 2000 ACM Press

© 2006 Elsevier Main memory-oriented optimizations Memory chips provide several useful modes:  Burst mode accesses sequential locations. Provide start address and length Reduce number of addresses sent, increase transfer rate  Paged modes allow only part of the address to be transmitted. Address split into page number and offset Store page number in register to quickly access values in same page Access times depend on address(es) being accessed.

Banked memories © 2006 Elsevier Banked memories allow multiple memory banks to be accessed in parallel