Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999.

Slides:



Advertisements
Similar presentations
Asanovic/Devadas Spring VLIW/EPIC: Statically Scheduled ILP Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
Advertisements

Code Optimization and Performance Chapter 5 CS 105 Tour of the Black Holes of Computing.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Delivering High Performance to Parallel Applications Using Advanced Scheduling Nikolaos Drosinos, Georgios Goumas Maria Athanasaki and Nectarios Koziris.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Compiler techniques for exposing ILP
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)
Compiler Challenges for High Performance Architectures
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.
Multiscalar processors
0 HPEC 2010 Automated Software Cache Management.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
DBMSs On A Modern Processor: Where Does Time Go? by A. Ailamaki, D.J. DeWitt, M.D. Hill, and D. Wood University of Wisconsin-Madison Computer Science Dept.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Multi-core architectures. Single-core computer Single-core CPU chip.
Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
A Data Cache with Dynamic Mapping P. D'Alberto, A. Nicolau and A. Veidenbaum ICS-UCI Speaker Paolo D’Alberto.
Profile-Guided Optimization Targeting High Performance Embedded Applications David Kaeli Murat Bicer Efe Yardimci Center for Subsurface Sensing and Imaging.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
Using Cache Models and Empirical Search in Automatic Tuning of Applications Apan Qasem Ken Kennedy John Mellor-Crummey Rice University Houston, TX Apan.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:
Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
Memory-Aware Compilation Philip Sweany 10/20/2011.
4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.
Using Interaction Cost (icost) for Microarchitectural Bottleneck Analysis Brian Fields 1 Rastislav Bodik 1 Mark Hill 2 Chris Newburn 3 1 UC-Berkeley, 2.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Memory Hierarchy— Five Ways to Reduce Miss Penalty.
Cache memory. Cache memory Overview CPU Cache Main memory Transfer of words Transfer of blocks of words.
COMP 740: Computer Architecture and Implementation
Code Optimization.
Advanced Architectures
Reducing Memory Interference in Multicore Systems
Computer Architecture Principles Dr. Mike Frank
Multiscalar Processors
Parallel Processing - introduction
Parallel Programming By J. H. Wang May 2, 2017.
The University of Adelaide, School of Computer Science
5.2 Eleven Advanced Optimizations of Cache Performance
Improving cache performance of MPEG video codec
CSCI1600: Embedded and Real Time Software
Memory Hierarchies.
Lecture 14: Reducing Cache Misses
Chapter 5 Memory CSE 820.
Shared Memory Consistency Models: A Tutorial
CARP: Compression-Aware Replacement Policies
The Vector-Thread Architecture
Dynamic Hardware Prediction
Cache Performance Improvements
CSCI1600: Embedded and Real Time Software
Research: Past, Present and Future
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Code Transformations to Improve Memory Parallelism Vijay S. Pai and Sarita Adve MICRO-32, 1999

Motivation and Solutions Memory system is the bottleneck in ILP-based system –Solution: overlap multiple read misses (the dominant source of memory stalls) within the same instruction window, while preserving cache locality Lack of enough independent load misses in a single instruction window –Solution: read miss clustering enabled by code transformations, eg. unroll-and-jam Automate code transformation –Solution: mapping memory parallelism problem to floating-point pipelining (D. Callahan et al. Estimating Interlock and Improving Balance for Pipelined Machines. Journal of Parallel and Distributed Computing, Aug. 1988)

Unroll-and-jam

Apply code transformations in a compiler –Automatic unroll-and-jam transformation –Locality analysis to determine leading references (M. E. Wolf and M. S. Lam. A Data Locality Optimizing Algorithm. PLDI 1991) –Dependence analysis of limit memory parallelism Cache-line dependences Address dependences Window constraints Experimental methodology –Environment: Rice Simulator for ILP Multiprocessors –Workload: Latbench,five scientific applications –Incorporate miss clustering by hand Results –9-39% reduction in multiprocessor execution time –11-48% reduction in uniprocessor execution time

Strengths –Good performance Weaknesses –Transformations is lack of validity

Questions to discuss: –What hardware supports are needed to overlap multiple read misses? –Why use unroll-and-jam instead of strip-mine and interchange code transformation? –How do you think of the future work? V. S. Pai and S. Adve. Improving Software Prefetching with Transformations to Increase Memory Parallelism.