Optimizing Memory Accesses for Spatial Computation Mihai Budiu, Seth Goldstein CGO 2003.

Slides:

Advertisements

Similar presentations

Spatial Computation Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 SCS Mihai.

Advertisements

Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.

Simplifications of Context-Free Grammars

2. Getting Started Heejin Park College of Information and Communications Hanyang University.

EE384y: Packet Switch Architectures

1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.

© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.

Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.

Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.

Author: Julia Richards and R. Scott Hawley

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.

Objectives: Generate and describe sequences. Vocabulary:

UNITED NATIONS Shipment Details Report – January 2006.

Dataflow: A Complement to Superscalar Mihai Budiu – Microsoft Research Pedro V. Artigas – Carnegie Mellon University Seth Copen Goldstein – Carnegie Mellon.

On the Critical Path of (Parallel) Computations Mihai Budiu March 30, 2005.

We need a common denominator to add these fractions.

1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.

Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×

Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.

FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.

All You Ever Wanted to Know About Dynamic Taint Analysis & Forward Symbolic Execution (but might have been afraid to ask) Edward J. Schwartz, ThanassisAvgerinos,

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Robust Window-based Multi-node Technology- Independent Logic Minimization Jeff L.Cobb Kanupriya Gulati Sunil P. Khatri Texas Instruments, Inc. Dept. of.

REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.

Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.

PP Test Review Sections 6-1 to 6-6

Chapter 17 Linked Lists.

Chapter 1 Object Oriented Programming 1. OOP revolves around the concept of an objects. Objects are created using the class definition. Programming techniques.

Bellwork Do the following problem on a ½ sheet of paper and turn in.

XML and Databases Exercise Session 3 (courtesy of Ghislain Fourny/ETH)

CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.

Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.

Name Convolutional codes Tomashevich Victor. Name- 2 - Introduction Convolutional codes map information to code bits sequentially by convolving a sequence.

1 public class Newton { public static double sqrt(double c) { double epsilon = 1E-15; if (c < 0) return Double.NaN; double t = c; while (Math.abs(t - c/t)

Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.

1 Decision Procedures An algorithmic point of view Equality Logic and Uninterpreted Functions.

1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.

© 2012 National Heart Foundation of Australia. Slide 2.

Intermediate Representations CS 671 February 12, 2008.

MaK_Full ahead loaded 1 Alarm Page Directory (F11)

Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M

Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.

Analyzing Genes and Genomes

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Essential Cell Biology

Intracellular Compartments and Transport

PSSA Preparation.

Essential Cell Biology

Datorteknik TopologicalSort bild 1 To verify the structure Easy to hook together combinationals and flip-flops Harder to make it do what you want.

Mani Srivastava UCLA - EE Department Room: 6731-H Boelter Hall Tel: WWW: Copyright 2003.

Immunobiology: The Immune System in Health & Disease Sixth Edition

Energy Generation in Mitochondria and Chlorplasts

Insertion Sort Introduction to Algorithms Insertion Sort CSE 680 Prof. Roger Crawfis.

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Instructor: Shengyu Zhang 1. Content Two problems  Minimum Spanning Tree  Huffman encoding One approach: greedy algorithms 2.

Techniques for proving programs with pointers A. Tikhomirov.

User Defined Functions Lesson 1 CS1313 Fall User Defined Functions 1 Outline 1.User Defined Functions 1 Outline 2.Standard Library Not Enough #1.

1 Decidability continued…. 2 Theorem: For a recursively enumerable language it is undecidable to determine whether is finite Proof: We will reduce the.

Compiler Construction

Overview Structural Testing Introduction – General Concepts

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Presentation transcript:

Optimizing Memory Accesses for Spatial Computation Mihai Budiu, Seth Goldstein CGO 2003

2 Optimizing Memory Accesses for Spatial Computation Program Compiler

3 This work C Predicated IR Optimized IR Why at CGO?

4 Optimizing Memory Accesses for Spatial Computation =*q *p= =a[i] =*q*p==a[i] =*p This paper describes compiler representations and algorithms to increase memory access parallelism remove redundant memory accesses Time

5... def-use may-dep. :Intermediate Representation Traditionally SSA + predication Uniform for scalars and memory Explicitly encode may-depend Summarize control-flow Executable Our proposal CFG

6 Contributions Predicated SSA optimizations for memory –Boolean manipulation instead of CFG dependences –Powerful term-rewriting optimizations for memory –Simple to implement and reason about Expose memory parallelism in loops –New loop pipelining techniques –New parallelization method: loop decoupling

7 Outline Introduction Program representation Redundant memory operation removal Pipelining memory accesses in loops Conclusions

8 Executable SSA if (x) y = x*2; else y++; *+ 2 y y ! x1 Program representation is a graph: Nodes = operations, edges = values

9 Predication …=*p; if (x) …=*q; else *r = …; (1) …=*p; (x) …=*q; (!x) *r = …; Predicates encode control-flow Hyperblock ) branch-free code Caveat: all optimizations on hyperblock scope Pred

10 Read-write Sets Memory *p=…; if (x) …=*q; else *r = …; Entry Exit

11 Token Edges Memory *p=…; if (x) …=*q; else *r = …; Entry Exit

12 Tokens ¼ SSA for Memory *p=…; if (x) …=*q; else *r = …; Entry *p=…; if (x) …=*q; else *r = …; Entry

13 Meaning of Token Edges Token graph is maintained transitively reduced Focus the optimizer Linear space complexity in practice Maybe dependent No intervening memory operation Independent …=*q *p=… …=*q *p=…

14 Outline Introduction Program Representation Redundant memory operation removal –Dead code elimination –Load || load –Store ) load –Store ) store –Useless token removal –... Pipelining memory accesses in loops Evaluation Conclusions

15 Dead Code Elimination *p=… (false)

16 ¼ PRE...=*p (p1)...=*p (p2)...=*p (p1 Ç p2) This corresponds in the CFG to lifting the load to a basic block dominating the original loads

17 Forwarding Data (St ) Ld) …=*p (p2) *p=… (p1) …=*p *p=… (p1) (p2 Æ : p1) Load is executed only if store is not

18 Forwarding Data (2) …=*p (p2) *p=… (p1) …=*p (false) *p=… (p1) When p2 ) p1 the load becomes dead......i.e., when store dominates load in CFG

19 Store-store (1) *p=... (p2) *p=… (p1) *p=... (p2) *p=… (p1 Æ : p2) When p1 ) p2 the first store becomes dead......i.e., when second store post-dominates first in CFG

20 Store-store (2) *p=... (p2) *p=… (p1) *p=... (p2) *p=… (p1 Æ : p2) Token edge eliminated, but......transitive closure of tokens preserved

21 Key Observation The control-dependence tests and transformations (i.e., dominance, post-dominance) are carried by simple predicate Boolean manipulations.

22 Implementation Is Clean OptimizationLOC Useless dependence removal160 Immutable loads70 Dead-code elimination (incl. memory op)66 Load-after-load and store-after-store removal153 Redundant load and store removal94 Transitive reduction of token edges61 Loop-invariant scalar & load discovery74

23 Operations Removed: - static data - Percent MediabenchSpecInt95

24 Operations Removed: - dynamic data - Percent MediabenchSpecInt95

25 Outline Introduction Program Representation Redundant memory operation removal Pipelining memory accesses in loops Conclusions

26 Loop Pipelining...=*in++; *out++ =......=*in++; *out++ =... 1 loop ) 2 loops, which can slip with respect to each other in slips ahead of out ) pipelining of the loop body

27 One Token Loop Per Object extern int a[ ]; void g(int* p) { int i; for (i=0; i < N; i++) a[i] += *p; } a[ ] =*a *a= a a =*p other

28 All accesses after current iteration All accesses prior to current iteration Inter-iteration Dependences aother =*p=*a *a= aother !

29 collector generator Monotone Addresses *a++= a[1] must receive token from a[0] but these are independent! *a++=

30 independent Loop Decoupling: Motivation for (i=0; i < N; i++) { a[i] = = a[i+3]; } a a[i]= =a[i+3] a a[i]= =a[i+3]

31 Loop Decoupling for (i=0; i < N; i++) { a[i] = = a[i+3]; } a0a0 a[i]= =a[i+3] a3a3 tk(3) Slip control Token generator emits 3 tokens instantly It allows a 0 loop to slip at most 3 iterations ahead of a 3

32 Performance Impact of Memory Optimizations Speed-up vs. no memory optimizations MediabenchSpecInt95

33 Conclusions Tokens = compact representation of memory dependences Explicit dependences enable easy & powerful optimizations Simple predicate manipulation replaces control-flow transforms Fine-grain dependence information enables loop pipelining Token generators + loop decoupling = dynamic slip control

34 Backup Slides Compilation speed Compiler structure Tokens in hardware Cycle-free condition How performance is evaluated Sources of performance Arent these optimizations well known? Computing predicates

35 Compilation Speed On average 3.5x slower than gcc -O3 Max 10x slower We do intra-procedural pointer analysis, but no scheduling or register allocation back

36 Compiler Structure Suif CC C/FORTRAN low Suif IR Pointer analysis Live var. analysis CFG construction Unreachable code Build hyperblocks Ctrl dominance Path predicates high Suif IR inlining unrolling call-graph Pegasus (Predicated SSA) call-graph C circuit simulation Verilog back CSE Dead-code PRE Induction variables Strength reduction Loop-invariant lift Reassociation Memory optimization Constant propagation Constant folding Unreachable code

37 Tokens in Hardware Load add data pred token Memory Tokens are actual operation inputs and outputs Operation waits for token to execute Output token released as soon as side-effect certain back LSQ

38 Cycle-free Condition...=*p (p1)...=*p (p2)...=*p (p1 Ç p2) Requires a reachability computation to test Using memoization complexity is amortized constant back

39 How Performance Is Evaluated C Unlimited ILP LSQ limited BW (2 words/c) L1 8K L2 1/4M Mem back

40 Sources of Performance Removal of redundant operations More freedom in scheduling Pipelining loops back

41 Arent These Opts. Well Known? gcc –O3, Pentium Sun Workshop CC –xo5, Sparc DEC cc –O4, Alpha MIPSpro cc –O4, SGI SGI ORC –O4, Itanium IBM cc –O3, AIX Our compiler back void f(unsigned*p, unsigned a[], int i) { if (p) a[i] += p; else a[i]=1; a[i] <<= a[i+1]; } Only ones to remove accesses to a[i]

42 Computing Predicates Correct for irreducible graphs Correct even when speculatively computed Can be eagerly computed st b back

43 Spatial Computation