Xiaomi An, Jiqiang Song, Wendong Wang SimpLight Nanoelectronics Ltd 2008/03/24 Temporal Distribution Based Software Cache Partition To Reduce I-Cache Misses.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Advertisements

Discovering and Exploiting Program Phases Timothy Sherwood, Erez Perelman, Greg Hamerly, Suleyman Sair, Brad Calder CSE 231 Presentation by Justin Ma.
Compiler-Based Register Name Adjustment for Low-Power Embedded Processors Discussion by Garo Bournoutian.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Memory Optimizations Research at UNT Krishna Kavi Professor Director of NSF Industry/University Cooperative Center for Net-Centric Software and Systems.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Performance Visualizations using XML Representations Presented by Kristof Beyls Yijun Yu Erik H. D’Hollander.
Simulations of Memory Hierarchy LAB 2: CACHE LAB.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
August Code Compaction for UniCore on Link-Time Optimization Platform Zhang Jiyu Compilation Toolchain Group MPRC.
Compiler Optimizations for Memory Hierarchy Chapter 20 High Performance Compilers.
Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter
A SoC Simulator the newest component in Open64 Wendong Wang, Tony Tuo, Kevin Lo Dongchen Ren, Gary Hau, Jun zhang, Dong Huang SimpLight Nanoelectronics.
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
University of California San Diego Locality Phase Prediction Xipeng Shen, Yutao Zhong, Chen Ding Computer Science Department, University of Rochester Class.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
Outline Introduction Different Scratch Pad Memories Cache and Scratch Pad for embedded applications.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Authors: Jia-Wei Fang,Chin-Hsiung Hsu,and Yao-Wen Chang DAC 2007 speaker: sheng yi An Integer Linear Programming Based Routing Algorithm for Flip-Chip.
PMaC Performance Modeling and Characterization Efficient HPC Data Motion via Scratchpad Memory Kayla Seager, Ananta Tiwari, Michael Laurenzano, Joshua.
1 中華大學資訊工程學系 Ching-Hsien Hsu ( 許慶賢 ) Localization and Scheduling Techniques for Optimizing Communications on Heterogeneous.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
© 2010 IBM Corporation Code Alignment for Architectures with Pipeline Group Dispatching Helena Kosachevsky, Gadi Haber, Omer Boehm Code Optimization Technologies.
Profile-Guided Optimization Targeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
Reuse Distance as a Metric for Cache Behavior Kristof Beyls and Erik D’Hollander Ghent University PDCS - August 2001.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
© 2004 Wayne Wolf Memory system optimizations Strictly software:  Effectively using the cache and partitioned memory. Hardware + software:  Scratch-pad.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.
Chapter 5 Memory III CSE 820. Michigan State University Computer Science and Engineering Miss Rate Reduction (cont’d)
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
1 J. Bradley Chen and Bradley D. D. Leupen Division of Engineering and Applied Sciences Harvard University Improving Instruction Locality with Just-In-Time.
ECE 720T5 Fall 2011 Cyber-Physical Systems Rodolfo Pellizzoni.
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,
Cache-Conscious Data Placement Adapted from CS 612 talk by Amy M. Henning.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
1 of 14 Lab 2: Design-Space Exploration with MPARM.
An Offline Approach for Whole-Program Paths Analysis using Suffix Arrays G. Pokam, F. Bodin.
Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.
Cache Simulations and Application Performance Christopher Kerr Philip Mucci Jeff Brown Los Alamos, Sandia.
Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,
PINTOS: An Execution Phase Based Optimization and Simulation Tool) PINTOS: An Execution Phase Based Optimization and Simulation Tool) Wei Hsu, Jinpyo Kim,
Optimizing Compilers Background
Lecture 9- Design Concepts and Principles
Cache Memory Presentation I
Improving cache performance of MPEG video codec
CSCI1600: Embedded and Real Time Software
A Practical Stride Prefetching Implementation in Global Optimizer
Ann Gordon-Ross and Frank Vahid*
Lecture 9- Design Concepts and Principles
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
/ Computer Architecture and Design
Spring 2008 CSE 591 Compilers for Embedded Systems
ECE 463/563, Microprocessor Architecture, Prof. Eric Rotenberg
CSCI1600: Embedded and Real Time Software
Design.
Research: Past, Present and Future
Overview of Exception Handling Implementation in Open64
Presentation transcript:

Xiaomi An, Jiqiang Song, Wendong Wang SimpLight Nanoelectronics Ltd 2008/03/24 Temporal Distribution Based Software Cache Partition To Reduce I-Cache Misses

SimpLight Confidential Patent pending2 outline  Traditional code layout optimizations  Code layout optimizations in Open64 compiler  Temporal distribution based software cache partition to reduce I-Cache misses  Future work

SimpLight Confidential Patent pending3 Traditional code layout optimizations  Code layout is a kind of optimization to change the code organization in memory.  Main benefits of code layout: Improve branch prediction by placement of basic blocks Reduce I-cache misses by changing code’s mapping onto cache (mainly compulsory misses and conflict misses) Fit code into complex memory hierarchy (e.g. scratch-pad memory and cache)

SimpLight Confidential Patent pending4 Traditional code layout optimizations  Representation of temporal relationship: control flow graph with edge frequency weighted call graph temporal relation graph  Consideration of cache architecture: Linearize code, do not consider cache architecture (Pettis and Hansen) Distribute temporal interleaved code onto different cache lines (Hashemi, Gloy, etc)

SimpLight Confidential Patent pending5 Code layout optimizations in Open64 compiler  Profile based basic block reordering and procedure-splitting in CG Based on control flow graph with edge frequency Pettis and Hansen based algorithm  Procedure reordering in IPA Based on weighted call graph with call-edge frequency Kind of Pettis and Hansen based algorithm

SimpLight Confidential Patent pending6 Software cache partition  What is Software cache partition? Through code layout optimization, different code blocks are mapped to different regions of the I-cache.  Benefits of software cache partition Reduce cache misses Remove interference of multi-programs and avoid additional hardware support (embedded systems) Soft implementation of scratch pad memory on top of I-cache

SimpLight Confidential Patent pending7 Benefits of software cache partition (1)  Remove interference of multi-programs and avoid additional hardware support Video app Audio app I-cache is partitioned according to the performance demand and code locality of the video application and the audio application.

SimpLight Confidential Patent pending8 Benefits of software cache partition (2)  Soft implementation of scratch pad memory on top of I-cache Other code Code with real time requirement I-cache is partitioned to guarantee code with real time requirement will not be replaced after they are brought into the cache.

SimpLight Confidential Patent pending9 Benefits of software cache partition (3)  Reduce I-cache misses Runtime trace of code blocks: ABCDEF(UV) 5 ABCDEF(PQ) 5 ABCDEF(XY) 5 ABCDEF A B C D E/U/P/X F/V/Q/Y A/E B/F C D U/P/X V/Q/Y Layout 1: 24 misses Layout 2: 18 misses

SimpLight Confidential Patent pending10 Temporal distribution based layout of code blocks in the partitioned cache  Selection of good candidates holding cache lines exclusively Hot, Dense and Temporal Distribution Hot, dense and good regularity Hot and good locality Cold Hot and good locality Cold Mapping into I-cache: Share cache lines

SimpLight Confidential Patent pending11 Temporal distribution  Temporal locality and temporal regularity Trace: ABCDEF(UV) 5 ABCDEF(PQ) 5 ABCDEF(XY) 5 ABCDEF A,B,C,D,E,F have good temporal regularity since they have uniform distribution along the trace. U,V,P,Q,X,Y have good temporal locality since they exhibit a large skew in the reference distribution. UVUV ABCDABCD PQPQ XYXY EFEF Our mapping: Totally 18 misses Share cache lines

SimpLight Confidential Patent pending12 Qualification of temporal distribution Variance of reuse distance Weighted temporal distribution Temporal distribution

SimpLight Confidential Patent pending13 Iterative partition and layout Func Partition (RB, IRB) Sort nodes in RB by instruction density // highest //instruction density first RB_SIZE = Calc_rb_size(RB) IRB_SIZE = Calc_irb_size(IRB) While(RB_SIZE+IRB_SIZE>CACHE_SIZE) { Adjust(RB, IRB) RB_SIZE = Calc_rb_size(RB) IRB_SIZE = Calc_irb_size(IRB) }

SimpLight Confidential Patent pending14 Experiments and results (1) Cumulative effect of optimizations on I-cache miss reduction 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% % H264 encH264 decAVSM decMPEG4 decG729.A BB reorder BB reorder + layout BB reorder + pu split + layout

SimpLight Confidential Patent pending15 Experiments and results (2) Reduction of I-cache misses by TD, PH and TRG. 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% 90.00% % H264 encH264 decAVSM decMPEG4 decG729.A TD PH TRG

SimpLight Confidential Patent pending16 Experiments and results (3) 0.00% 10.00% 20.00% 30.00% 40.00% 50.00% 60.00% 70.00% 80.00% HE:stenfanHE:akiyoHE:footballHD:stenfanHD:akiyoHD:football TD PH TRG H264 codec I-cache miss reduction by TD, PH and TRG with various inputs

SimpLight Confidential Patent pending17 Future work  Improve current iterative partition algorithm  Incorporate more cache configurations into the layout algorithm, e.g. cache line size, L2 cache …  Develop effective software cache partition method for multi-thread programs on our memory hierarchy

SimpLight Confidential Patent pending18 Thank You!