Yingmin Li Ting Yan Qi Zhao

Slides:

Advertisements

Similar presentations

CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.

Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Loop Unrolling & Predication CSE 820. Michigan State University Computer Science and Engineering Software Pipelining With software pipelining a reorganized.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.

1 Advanced Computer Architecture Limits to ILP Lecture 3.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Instruction Level Parallelism (ILP) Colin Stevens.

Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

Wish Branches A Review of “Wish Branches: Enabling Adaptive and Aggressive Predicated Execution” Russell Dodd - October 24, 2006.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.

Hardware Support for Compiler Speculation

Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.

Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

StaticILP.1 2/12/02 Static ILP Static (Compiler Based) Scheduling Σημειώσεις UW-Madison Διαβάστε κεφ. 4 βιβλίο, και Paper on Itanium στην ιστοσελίδα.

Limits of Instruction-Level Parallelism Presentation by: Robert Duckles CSE 520 Paper being presented: Limits of Instruction-Level Parallelism David W.

VLIW CSE 471 Autumn 021 A (naïve) Primer on VLIW – EPIC with slides borrowed/edited from an Intel-HP presentation VLIW direct descendant of horizontal.

Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.

Unit II Intel IA-64 and Itanium Processor By N.R.Rejin Paul Lecturer/VIT/CSE CS2354 Advanced Computer Architecture.

IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.

现代计算机体系结构主讲教师：张钢天津大学计算机学院 2009 年.

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

Computer Architecture Principles Dr. Mike Frank

William Stallings Computer Organization and Architecture 8th Edition

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Simultaneous Multithreading

The University of Adelaide, School of Computer Science

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 14 Instruction Level Parallelism and Superscalar Processors

CC 423: Advanced Computer Architecture Limits to ILP

/ Computer Architecture and Design

Henk Corporaal TUEindhoven 2009

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

The EPIC-VLIW Approach

Levels of Parallelism within a Single Processor

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)

How to improve (decrease) CPI

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

Sampoorani, Sivakumar and Joshua

Instruction Level Parallelism (ILP)

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Adapted from the slides of Prof

VLIW direct descendant of horizontal microprogramming

Levels of Parallelism within a Single Processor

CSC3050 – Computer Architecture

Dynamic Hardware Prediction

How to improve (decrease) CPI

Loop-Level Parallelism

Lecture 5: Pipeline Wrap-up, Static ILP

Predication ECE 721 Prof. Rotenberg.

Presentation transcript:

Yingmin Li Ting Yan Qi Zhao OOO vs. EPIC Yingmin Li Ting Yan Qi Zhao

Outline “Advantages” of EPIC Critique Conclusion

EPIC: Main Idea “Smart compiler, dumb machine” Finding parallelism Processor  compiler Software/hardware synergy Processor design Avoid complexity and difficulty ILP, SMT & CMP

EPIC: Predication In OOO: dynamic branch prediction. Larger basic blocks. Control dep.  Data dep. Eliminate misprediction & penalties.

EPIC: Speculation OOO: dynamic hardware Data speculation & control speculation Bigger window Reduce impact of memory latencies

EPIC: Large Register Set OOO: register renaming. Easier to design than reg. Renaming. “Real” registers benefits some apps. Encryption alg., Numerical alg. Avoid loss of invisible registers. Interruptions in OOO.

EPIC: Unique Features Register Stack Engine (RSE). To deal with call/ return costs. Seems an unlimited stack of phys. Reg. Rotating register file. Software pipelining. Multiple loops at the same time.

Function Call Register saving/restoring Register file Processor? Compiler? Register file Expensive Always idle

Predication Computation of the branch condition is on the critical path Increase ICache footprint Half of the functional units effectively used if both “then” and “else” are scheduled Hard to implement out-of-order with full predication

Predication To compute if (a) x = t+1:

Control Speculation Why not just use prefetch which will not cause unexpected exception? Technique to exploit control speculation such as superblock increase code length

Control prediction

Data Speculation Moving a load above a possibly conflicting store An advanced load and a checking load (IA64) A run-time predictor

Data speculation

Software Pipelining For high performance technical computing High trip-count loops For commercial applications Low trip-count loops

EPIC: at least not a breakthrough Design Object of EPIC: Moving hardware complexity to compiler

EPIC: at least not a breakthrough The failure of EPIC: The compiling technique used for EPIC almost also apply well to OOO Hardware simplicity is not so obvious to offset EPIC’s overhead Without dynamic information, compiler essentially can’t do sth well enough

The tragedy of cycle time Why no obvious improvement in cycle time mechanisms like RSA increase die complexity Compare and dependent branch in one cycle Predicted execution dependent on the existence of many function units

Dynamic path length: hey, IA64, you wasted too much here Speculation Half of the predicted instructions discarded Restricted bundling One base register No sign-extended loads No integer multiply or divide in general register

CPI No dynamic prediction Longer source code (more GR, Predicate register, template bit, restricted bundling, recovery code) is burdensome for instruction fetching Recovery code may induce ICache pollution or just a page-fault