Sampoorani, Sivakumar and Joshua

Slides:



Advertisements
Similar presentations
Topics Left Superscalar machines IA64 / EPIC architecture
Advertisements

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Rung-Bin Lin Chapter 4: Exploiting Instruction-Level Parallelism with Software Approaches4-1 Chapter 4 Exploiting Instruction-Level Parallelism with Software.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 Lecture 7: Static ILP, Branch prediction Topics: static ILP wrap-up, bimodal, global, local branch prediction (Sections )
Chapter 14 Superscalar Processors. What is Superscalar? “Common” instructions (arithmetic, load/store, conditional branch) can be executed independently.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Multiscalar processors
1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Assignment 2 posted; due in a week.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
COMP381 by M. Hamdi 1 Commercial Superscalar and VLIW Processors.
IA-64 ISA A Summary JinLin Yang Phil Varner Shuoqi Li.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
Anshul Kumar, CSE IITD CS718 : VLIW - Software Driven ILP Example Architectures 6th Apr, 2006.
Hardware Support for Compiler Speculation
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Introducing The IA-64 Architecture - Kalyan Gopavarapu - Kalyan Gopavarapu.
IA-64 Architecture RISC designed to cooperate with the compiler in order to achieve as much ILP as possible 128 GPRs, 128 FPRs 64 predicate registers of.
OOE vs. EPIC Emily Evans Prashant Nagaraddi Lin Gu.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture 12: Advanced Static ILP Topics: parallel loops, software speculation (Sections )
IA64 Complier Optimizations Alex Bobrek Jonathan Bradbury.
IA-64 Architecture Muammer YÜZÜGÜLDÜ CMPE /12/2004.
Use of Pipelining to Achieve CPI < 1
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
CSL718 : Superscalar Processors
William Stallings Computer Organization and Architecture 8th Edition
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
Henk Corporaal TUEindhoven 2009
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
Pipelining: Advanced ILP
Instruction Scheduling for Instruction-Level Parallelism
Morgan Kaufmann Publishers The Processor
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
IA-64 Microarchitecture --- Itanium Processor
Lecture 6: Static ILP, Branch prediction
Yingmin Li Ting Yan Qi Zhao
Ka-Ming Keung Swamy D Ponpandi
Adapted from the slides of Prof
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2)
How to improve (decrease) CPI
Advanced Computer Architecture
Henk Corporaal TUEindhoven 2011
Instruction Level Parallelism (ILP)
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Adapted from the slides of Prof
Adapted from the slides of Prof
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Loop-Level Parallelism
Lecture 11: Machine-Dependent Optimization
Ka-Ming Keung Swamy D Ponpandi
Presentation transcript:

Sampoorani, Sivakumar and Joshua A Critical Look At IA-64 Massive Resources, Massive ILP, But Can It Deliver? Martin Hopkins, IBM Research 2/7/00 Sampoorani, Sivakumar and Joshua

Design decisions common to modern processors Pipelining Micro Ops Large ROB Single path execution Dynamic scheduling

At what cost? Accurate Branch Prediction Dependency Checking Register Renaming Alias Detection Hardware

Performance of IA-64 Execution time = Cycle Time *IC* CPI No improvement reported in frequency Possible Reasons? Reducing CPI at the cost of cycle time Compares and branches in same cycle Predicated Execution => more FUs => more complexity + longer wires limit on frequency => more power

Dynamic Path Length (IC) Longer than other architectures Reasons? Speculation Check operations and recovery code Predication No sign extended loads No integer multiply or divide

Dynamic Path Length (IC) Loads and Stores – Only post execution update of base register ldsz.ldtype.ldhint r1 = [r3] no base update form ldsz.ldtype.ldhint r1 = [r3], r2 register base update ldsz.ldtype.ldhint r1 = [r3], imm immediate base update

CPI Cache Effects Larger code footprint Recovery code 128 bit bundle - 3 instructions Restrictions on placing instructions Branch target - beginning of bundle Recovery code Pollutes I-Cache and/or triggers page faults Speculative loads - Pollute D-cache

Stalls possible Example load ra = load rb = ;; // end of bundle add rx = ra load ry = [rb];; If load ra causes a cache miss, stall. Superscalar out-of-order processors – can execute non-dependent instructions in parallel with the cache miss.

Comparing Complexities Support for speculative execution Superscalar processors reorder buffer register renaming hardware EPIC need to expose parallelism, speculation hardware just does what the compiler says

IA-64: Exposing Speculative Execution Control speculation (moving loads above branches) Data speculation (moving loads above stores)

Control Speculation Hardware for deferring exceptions exposed to software NaT (Not a Thing or poison bits) set NaT bit associated with a register on exception perform an explicit check before using the register Increase in machine state 2 NaT registers instructions to modify, test, and retrieve NaT values

Data Speculation Explicit memory-alias-detection table ALAT (Advanced Load Address table) loads place their entries in ALAT stores remove the entry if addresses match Hardware cost: ALAT is 32 entry, 2 way set associative recovery code requires that operands be maintained (until the store is seen the operands have to be maintained) increased register requirements (128 Int + 128 FP)

Data Speculation Hardware Costs Increased register pressure implies more state to be saved across functions to avoid this: Register stacking (SPARC register windows) (0-31) global registers, others dynamically mapped CFM (Current Frame Marker) Register Stack engine Should also handle stack overflows Additional complexity due to rotating registers

Hardware Costs Reorder buffer Register rename mechanism NaT bits, associated instructions ALAT Increased number of registers Reg Stack Engine Additional complexities due to rotating registers, page faults, …

Runtime Information Information about behavior of programs Can’t be predicted at compile time Profiling helps But costly… Superscalar machines Dynamic selection of instructions to execute Rely upon information known at run time

Epic Depends mostly on compiler Consider the following code sequence Run time information is not used so much Consider the following code sequence cmp p1, p2 = .. /* set predicate registers */ (p1) br.cond low_probability_path ;; /* if (p1) goto ...*/ l ra = [rb];; add rc = ra, rd;; use of (rc) 4 bundles, load not hoisted over a branch (which is not usually taken)

As Scheduled by IA64 Compiler Optimize for the most probable path l.s ra = [rb];; add rc = ra, rd cmp p1, p2 = ... (p1) br.cond low_probability_path ;; check.s rc, recovery_code use of (rc) 3 bundles

When Low Probability Path Is Taken Superscalar processor Execute the load as early as possible Cancel if found to be mis-speculated Change assumptions dynamically EPIC load has to complete since dependant add is in next bundle may take 100s of cycles if the pointer is random Heavy penalty if the compiler gets the probabilities wrong

Dependence on Profiling RISC and CISC find profiling useful, but not essential IA-64 is much more dependent on profiling Difficulties involved with profiling Additional responsibility for programmer Creating a representative test suite Using in demanding, diverse development environments

Code Bloat RISC instructions 50 3 instructions per 128 bits 33 Avg of 2 instructions per bundle 33 Branch target at beginning of bundle 10 Check ops Recovery code 20 No base+disp addressing 15 No sign-extended loads Predication Optimizations 30 IA-64 code should be 4.8 times x86 code

Some things that may reduce code size Post-increment loads can eliminate and add in a loop eg. accessing an array in strides Combining a compare and a logical op r1 + r2 +1 Rotating register files for s/w pipelining All the above amount to <5% difference. So net code bloat is about 4 times. (excluding optimization overhead) Code bloat => More memory b/w requirement.

Performance comparison 800MHz Itanium SPECint <68% Alpha 21264 (1GHz) (20% less power) <60% P4 (2GHz) SPECfp >20% Alpha 21264 >8% P4 Power – a major hurdle

Conclusion The IA-64 gamble – power is not going to be a critical limitation in future. This allows use of massive resources