Lecture on High Performance Processor Architecture (CS05162)

Lecture on High Performance Processor Architecture (CS05162)
Limits on Instruction-Level Parallelism An Hong Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

Limits to ILP Conflicting studies of amount
Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor performance curve? DLPs: Intel MMX, SSE, SSE2; Stream Processors TLPs: IBM Power5 (SMT/CMP) PCAs: RAW, Smart Memory, TRIPS etc. 2019/1/2 CS of USTC AN Hong

Overcoming Limits Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future 2019/1/2 CS of USTC AN Hong

Limits to ILP Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction perfect; no mispredictions 3. Jump prediction all jumps perfectly predicted (returns, case statements) 2 & 3  no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle; 2019/1/2 CS of USTC AN Hong

Limits to ILP HW Model comparison
Power 5 Instructions Issued per clock Infinite 4 Instruction Window Size 200 Renaming Registers 48 integer Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Analysis ?? 2019/1/2 CS of USTC AN Hong

Upper Limit to ILP: Ideal Machine
FP: Integer: Instructions Per Clock 2019/1/2 CS of USTC AN Hong

New Model Model Power 5 Instructions Issued per clock Infinite 4 Instruction Window Size Infinite, 2K, 512, 128, 32 200 Renaming Registers 48 integer Fl. Pt. Branch Prediction Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias ?? 2019/1/2 CS of USTC AN Hong

More Realistic HW: Window Impact
Change from Infinite window 2048, 512, 128, 32 FP: Integer: FP: Integer: 2019/1/2 CS of USTC AN Hong

New Model Model Power 5 Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 200 Renaming Registers 48 integer Fl. Pt. Branch Prediction Perfect vs. 8K Tournament vs bit vs. profile vs. none Perfect 2% to 6% misprediction (Tournament Branch Predictor) Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias ?? 2019/1/2 CS of USTC AN Hong

More Realistic HW: Branch Impact
FP: More Realistic HW: Branch Impact FP: Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle Integer: IPC Integer: 2019/1/2 Perfect Tournament BHT (512) Profile No prediction CS of USTC AN Hong

Misprediction Rates 2019/1/2 CS of USTC AN Hong

New Model Model Power 5 Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 200 Renaming Registers Infinite vs. 256, 128, 64, 32, none 48 integer Fl. Pt. Branch Prediction 8K 2-bit Perfect Tournament Branch Predictor Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias 2019/1/2 CS of USTC AN Hong

More Realistic HW: Renaming Register Impact (N int + N fp)
Change instr window, 64 instr issue, 8K 2 level Prediction Integer: IPC 2019/1/2 Infinite CS of USTC AN Hong 256 128 64 32 None

New Model Model Power 5 Instructions Issued per clock 64 Infinite 4 Instruction Window Size 2048 200 Renaming Registers 256 Int FP 48 integer Fl. Pt. Branch Prediction 8K 2-bit Perfect Tournament Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias Perfect vs. Stack vs. Inspect vs. none 2019/1/2 CS of USTC AN Hong

More Realistic HW: Memory Address Alias Impact
Change instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers FP: (Fortran, no heap) Integer: 4 - 9 IPC Perfect Global/Stack perf; heap conflicts Inspec. Assem. None 2019/1/2 CS of USTC AN Hong

New Model Model Power 5 Instructions Issued per clock 64 (no restrictions) Infinite 4 Instruction Window Size Infinite vs. 256, 128, 64, 32 200 Renaming Registers 64 Int + 64 FP 48 integer Fl. Pt. Branch Prediction 1K 2-bit Perfect Tournament Cache 64KI, 32KD, 1.92MB L2, 36 MB L3 Memory Alias HW disambiguation 2019/1/2 CS of USTC AN Hong

Realistic HW: Window Impact
Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window FP: Integer: IPC 2019/1/2 Infinite 256 CS of USTC AN Hong 128 64 32 16 8 4

Analysis of the ILP Limit
What Went Wrong? Preserving sequential semantics while reordering instructions is hard--esp. in hardware Limits to reordering Branches:control flow limit loads and stores:data flow limit 2019/1/2 CS of USTC AN Hong

The Dynamically-Scheduled Superscalar Model (Fixed-size window) Instructions are sequenced and retired in program order 抽取ILP的方法建立一个指令窗口，确定控制依赖确定和最小化该窗口中指令间的数据依赖调度指令并行执行软件抽取ILP的方法/硬件抽取ILP的方法 2019/1/2 CS of USTC AN Hong

The Dynamically-Scheduled Superscalar Model (Fixed-size window) Instructions are sequenced and retired in program order In-order sequencing establishes the correct data dependences between instructions required to implement the meaning of the program 2019/1/2 CS of USTC AN Hong

The Dynamically-Scheduled Superscalar Model (Fixed-size window) Instructions are sequenced and retired in program order Issue 1: No enough ready-to-execute useful instuctions due to two kinds of interruptions: At the sequencing end (Direct interruptions) Instruction cache misses Branch mispredictions At the retirement end (Indirect interruptions) long execution latencies FP divide Load (data cache miss) 2019/1/2 CS of USTC AN Hong

The Dynamically-Scheduled Superscalar Model (Fixed-size window) Instructions are sequenced and retired in program order Issue 2: Not conductive to high processor utilization due to sequencing order and global data-driven order rarely match ! Execution should take place in global data-driven order(data-flow order), but execution order is constrained by sequencing order(control-flow order). 2019/1/2 CS of USTC AN Hong

Dynamically re-order instructions to fill multiple execution units Must preserve sequential semantics=>require dependency checking Complexity grows as product of instructions in flight and number of execution units The work by Sun, IBM, Compaq indicates that a superscalar width of about 4 is the current cost vs. Performance point 2019/1/2 CS of USTC AN Hong

Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, probably requires a processor to issue 3 or 4 data memory accesses per cycle, resolve 2 or 3 branches per cycle, rename and access more than 20 registers per cycle, and fetch 12 to 24 instructions per cycle. The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate E.g, widest issue processor is the Itanium 2, but it also has the slowest clock rate, despite the fact that it consumes the most power! 2019/1/2 CS of USTC AN Hong

短延迟: 浮点除法；分支处理；访问本地存储系统；长延迟：访问远程的存储系统；由并发操作引起的延迟不确定的同步等待事件 ILP的提高受限于单指令流中固有的并行性特征 SPEC CPU 2000(int, fp) TPC (OLTP, DSS) ILP的提高受限于串行指令流中的偏序关系 von Neumann计算模型 vs.数据流计算模型 2019/1/2 CS of USTC AN Hong

Limits to ILP Most ILP techniques for increasing performance increase power consumption Multiple issue processors techniques all are energy inefficient: Issuing multiple instructions incurs some overhead in logic that grows faster than the issue rate grows Growing gap between peak issue rates and sustained performance Number of transistors switching = f(peak issue rate), and performance = f( sustained rate), growing gap between peak and sustained performance  increasing energy per unit of performance The key question is whether a technique is energy efficient: does it increase power consumption faster than it increases performance? 2019/1/2 CS of USTC AN Hong

How to Exceed ILP Limits of this study?
2019/1/2 CS of USTC AN Hong

Performance beyond single thread ILP
There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform same operations on data, and lots of data 2019/1/2 CS of USTC AN Hong

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations

Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture on High Performance Processor Architecture (CS05162)

Similar presentations

Presentation on theme: "Lecture on High Performance Processor Architecture (CS05162)"— Presentation transcript:

Similar presentations

About project

Feedback