Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

IMPACT Second Generation EPIC Architecture Wen-mei Hwu IMPACT Second Generation EPIC Architecture Wen-mei Hwu Department of Electrical and Computer Engineering.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Memory Systems Performance Workshop 2004© David Ryan Koes MSP 2004 Programmer Specified Pointer Independence David Koes Mihai Budiu Girish Venkataramani.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)

Compiler Speculative Optimizations Wei Hsu 7/05/2006.

Just-In-Time Java Compilation for the Itanium Processor Tatiana Shpeisman Guei-Yuan Lueh Ali-Reza Adl-Tabatabai Intel Labs.

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Hybrid-Scheduling: A Compile-Time Approach for Energy–Efficient Superscalar Processors Madhavi Valluri and Lizy John Laboratory for Computer Architecture.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

Transmeta and Dynamic Code Optimization Ashwin Bharambe Mahim Mishra Matthew Rosencrantz.

Predicated Static Single Assignment (PSSA) Presented by AbdulAziz Al-Shammari

Hardware Support for Compiler Speculation

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.

IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.

OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

M. Mateen Yaqoob The University of Lahore Spring 2014.

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

ECE 4100/6100 Advanced Computer Architecture Lecture 2 Instruction-Level Parallelism (ILP) Prof. Hsien-Hsin Sean Lee School of Electrical and Computer.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

Topics to be covered Instruction Execution Characteristics

Computer Architecture Principles Dr. Mike Frank

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

CC 423: Advanced Computer Architecture Limits to ILP

Henk Corporaal TUEindhoven 2009

Instruction Scheduling for Instruction-Level Parallelism

Levels of Parallelism within a Single Processor

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Yingmin Li Ting Yan Qi Zhao

How to improve (decrease) CPI

Henk Corporaal TUEindhoven 2011

Control unit extension for data hazards

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Levels of Parallelism within a Single Processor

Chapter 12 Pipelining and RISC

CSC3050 – Computer Architecture

rePLay: A Hardware Framework for Dynamic Optimization

Project Guidelines Prof. Eric Rotenberg.

Presentation transcript:

Performance Potentials of Compiler- directed Data Speculation Author: Youfeng Wu, Li-Ling Chen, Roy Ju, Jesse Fang Programming Systems Research Lab Intel Labs

Introduction Compiler-directed data speculation has been implemented on Itanium systems to allow for a compiler to move a load across a store even when the two operations are potentially aliased. A memory load operation often appears on a critical execution path, and needs to be scheduled earlier to improve performance. When a load may cause a cache miss, it is even more important to move it early to hide its latency. It is a runtime disambiguation mechanism.

Introduction Data speculation Data speculation of a load and its uses store reg *p reg1 =ld *q reg1 = ld.a *q store reg *p reg1 = ld.c *q store reg *p If (cond) reg1 =ld *q reg1 = ld.sa *q store reg *p If (cond) reg1 =ld.c *q Advanced Load Address Table (ALAT)

Related Works Software-only approach Address comparison and conditional branch instructions allowing general code movement across ambiguous stores predicated instructions was introduced to help parallelize the comparison code based on a hardware approach The Memory Conflict Buffer (MCB) scheme extends the idea of run-time disambiguation by introducing architectural support to eliminate the need for explicit address comparison instructions. The MCB scheme adds two new instructions Preload Check Compilation technology, benchmarks, hardware implementation, new usages of data speculation have changed.

Methodology Data speculation is primarily used to improve performance with the following benefits: Reduce schedule cycle counts by moving load away from critical path. Hide data cache miss latencies by advancing loads earlier from their uses. Total dependency height (TDH) Schedule gain Schedule&pref etch gain

Methodology Three parameters Data speculation benefits: {reducing schedule cycle counts, prefetching for cache misses) Scheduling methods: {TDH, real scheduler) Disambiguation modes: {no-disam, static-disam}

Experimental Results Eight scenarios the performance potentials for SPEC2000C benchmarks. without memory disambiguation and with an ideal scheduler, the potential gain from data speculation is about 8.8% (5.4% from critical path reduction and 3.4% from hiding cache miss latency). With the state of the art memory disambiguation and an ideal scheduler, the potential gain from data speculation is around 4.4% (3.2% from critical path reduction and 1.2% from hiding miss latency). With a real scheduler and assuming all memory pairs are aliased, the potential gain from data speculation is around 3.2% (1.8% from critical path reduction and 1.4% from hiding miss latency). With a sophisticated alias analysis and a realistic scheduler, the potential gain from data speculation is around 1.9% (1.4% from critical path reduction and 0.5% from hiding miss latency).

Experimental Results Interesting statistics Only about 2% to 2.5% of dependency edges are maybe dependency and critical to performance. Those are the dependency edges that data speculation can be useful. Majority of the loads can be moved by data speculation no more than 20 cycles early. This limits the effectiveness of data speculation to hide long latency event (e.g., L3 cache misses).

Future Work Potential gain and measured gain on real machines Compiler limitations that are responsible for the performance gap between the ideal scheduler and a real scheduler The conflict probabilities of the critical maybe dependency edges

Thank you!