Presentation is loading. Please wait.

Presentation is loading. Please wait.

Uri Weiser; VLSI_04_2010.ppt; Dec 08 1 VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5 The Generic Processor – Microarchitecture trends – Performance/power/frequency.

Similar presentations


Presentation on theme: "Uri Weiser; VLSI_04_2010.ppt; Dec 08 1 VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5 The Generic Processor – Microarchitecture trends – Performance/power/frequency."— Presentation transcript:

1 Uri Weiser; VLSI_04_2010.ppt; Dec 08 1 VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5 The Generic Processor – Microarchitecture trends – Performance/power/frequency implications – Insights Today's lecture: Comprehend performance, power and area implications of various Microarchitectures

2 Uri Weiser; VLSI_04_2010.ppt; Dec 08 2 References of the day “Computer Architecture - A Quantitative Approach” (The second edition), John L. Hennessy, David A. Patterson, Chapter 3-4 (p. 125-370) “Computer Organization and Design”, John L. Hennessy, David A. Patterson, Chapter 5-6, 9 (p. 268-451, 594-646) “Tuning the Pentium Pro Micro-Architecture”, David Papworth, IEEE Micro, April 1996 IA-64 Application Architecture Tutorial, Allan D. Knies, Hot-Chips 11, August 1999. “Billion-Transistor Architecture: There and Back again” Doug Burger, James Goodman, Computer, March “VLIW Architecture for a Tree Scheduling Compiler”, R. Colwell, R. Nix, J. O’Donnell, D. Papworth, P. Rodman, ACM 1987 Joseph Fisher - "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer July 1984. “The IBM 360/91: Machine Philosophy and Instruction Handling”, R.M. Tomasulo et al, IBM Journal of Research and Development 11:1, 1967 “VLIW Architecture for a Tree Scheduling Compiler”, R. Colwell, R. Nix, J. O’Donnell, D. Papworth, P. Rodman, ACM 1987 Joseph Fisher - "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer July 1984. “Tuning the Pentium Pro Micro-Architecture”, David Papworth, IEEE Micro, April 1996 IA-64 Application Architecture Tutorial, Allan D. Knies, Hot-Chips 11, August 1999 Some of the lectures material have been prepared by Ronny Ronen

3 Uri Weiser; VLSI_04_2010.ppt; Dec 08 3 Computing platform Messages Balanced design Power  CV 2 f System Performance –Transactions overhead –Memory as a scratch pad –Scheduling –System efficiency –… CPU –ILP and IPC vs. Frequency –External vs. internal frequency –Speculation »Branch Prediction »$ (Caches) »Memory disambiguation »Instructions and Data Prefetch »Value prediction »…. Multithread –Multithread on single core –Multi-cores system »$ in multi-core »Asymmetry »NUMA »Scheduling in MC »Mtulti-Core vs. Multi-thread machines »….

4 Uri Weiser; VLSI_04_2010.ppt; Dec 08 4 The Generic Processor Data supply Instruction supply Execution engine Sophisticated organization to “ service ” instructions Instruction supply –Instruction cache –Branch prediction –Instruction decoder –... Execution engine –Instruction scheduler –Register files –Execution units –... Data supply –Data cache –TLB ’ s –… Goal - Maximum throughput – balanced design

5 Uri Weiser; VLSI_04_2010.ppt; Dec 08 5 Power & Performance Performance 1/Execution Time (IPC x Frequency) / #-of-instructions-in-Task For a given instruction stream: P erformance depends on the number of instructions executed per time-unit: Performance IPC x Frequency Sometimes, Measured in MIPS - Million Instructions Per Second Power C x V 2 x Frequency C = overall capacitance: for a given technology, is ~proportional to the # of transistors Energy Efficiency = Performance/Power –Measured in MIPS/Watt Message: Power = C x V 2 x Frequency

6 Uri Weiser; VLSI_04_2010.ppt; Dec 08 6 Microprocessor Performance Evolution [John DeVale & Bryan Black, 2006] YNH MRM Intel P4 Intel P-M Power 4 Power 3 Itanium AMD Opetron AMD Athlon 6 IPC Message: Frequency vs. IPC

7 Uri Weiser; VLSI_04_2010.ppt; Dec 08 7 Real life: Performance vs. frequency * Source: Intel ® Pentium ® 4 Processor and Intel ® 850 Performance Brief, April2002 866  95% 708  90% 878  92% 807  87% Message: Internal vs. external frequency

8 Uri Weiser; VLSI_04_2010.ppt; Dec 08 8 Microarchitecture Micro-Processor Core – Performance/ power/area insights –Parallelism –Pipeline stalls/Bypasses –Superpipeline –Static/Dynamic scheduling –Branch prediction –Memory Hierarchy VLIW / EPIC

9 Uri Weiser; VLSI_04_2010.ppt; Dec 08 9 Parallelism Evolution Performance, power, area Insights? PE=Processor Element Instruction Basic configuration Pipeline VLIW Superscalar - In order Superscalar - Out of Order a... PE a... PE b c n a... PE d f n b c e a a bcn

10 Uri Weiser; VLSI_04_2010.ppt; Dec 08 10 Static Scheduling: VLIW / EPIC Performance, power, area Insights? Static scheduling of instructions by compiler –VLIW: Very Long Instruction Word (MultiFlow, TI6X family) –EPIC: Explicit Parallel Instruction set Computer (IA64) Shorter pipe, wider machine, global view => potentially huge ILP (wider & simpler than plain superscalar!)  Many nops, sensitive to varying latencies (memory accesses)  Low utilization  Huge code size  Highly depends on compiler EPIC overcomes some of these limitations: –Advance loads (hide memory latency) –Predicated execution (avoid branches) –Decoder templates (reduce nops) But at increased complexity IFIF F MFMF BFBF IFIF F MFMF BFBF IFIF F MFMF BFBF IDID FDFD MDMD BDBD IDID FDFD MDMD BDBD IDID FDFD MDMD BDBD IEIE FEFE MEME BEBE IEIE FEFE MEME BEBE IEIE FEFE MEME BEBE IWIW FWFW MWMW BWBW IWIW FWFW MWMW BWBW IWIW FWFW MWMW BWBW IFIF F MFMF BFBF IDID FDFD MDMD BDBD IEIE FEFE MEME BEBE IWIW FWFW MWMW BWBW st I:integer F:Float M:Memory B:Branch st:stall Gray: nop Examples Intel Itanium® proc. DSPs time Perf/power increase decrease Pipeline stages

11 Uri Weiser; VLSI_04_2010.ppt; Dec 08 11 Dynamic Scheduling Performance, power, area Insights? Scheduling instructions at run time, by the HW Advantages: –Works on the dynamic instruction flow: Can schedule across procedures, modules... –Can see dynamic values (memory addresses) –Can accommodate varying latencies and cases (e.g. cache miss) Disadvantages –Can schedule within a limited window only –Should be fast - cannot be too smart Perf/power increase decrease

12 Uri Weiser; VLSI_04_2010.ppt; Dec 08 12 129567348 129567348 Out Of Order Execution In Order Execution: instructions are processed in their program order. –Limitation to potential Parallelism. OOO: Instructions are executed based on “data flow” rather than program order Before:src -> dest (1) load(r10), r21 (2) movr21, r31(2 depends on 1) (3) loada, r11 (4) movr11, r22(4 depends on 3) (5) movr22, r23(5 depends on 4) After: (1) load(r10), r21; (3) loada, r11; (2) movr21,r31; (4) movr11,r22; (5) movr22,r23; Usually highly superscalar 1F1F 2F2F 3F3F 4F4F 1D1D 2D2D 3D3D 4D4D 5F5F 5D5D 5W5W 1W1W 2E2E 3w3w 4E4E 5E5E 2W2W 3W3W 4W4W 5E5E 1E1E 2E2E 3E3E 4E4E 5E5E 1E1E 2E2E 3E3E 4E4E 5E5E 1F1F 2F2F 3F3F 4F4F 1D1D 2D2D 3D3D 4D4D 5F5F 5D5D 5W5W 1W1W 2E2E 3E3E 4E4E 5E5E 2W2W 3E3E 4E4E 5E5E 4E4E 5E5E 1E1E 2E2E 3E3E 4E4E 5E5E 1E1E 2E2E 3E3E 4E4E 5E5E 4W4W 5E5E 3W3W In Order Processing Out of Order Processing In Order vs. OOO execution. Assuming: - Unlimited resources - 2 cycles load latency Examples: Intel Pentium® II/III/4 Compaq Alpha 21264 t t

13 Uri Weiser; VLSI_04_2010.ppt; Dec 08 13 Out Of Order (cont.) Performance, power, area Insights? Advantages –Help exploit Instruction Level Parallelism (ILP) –Help cover latencies (e.g., cache miss, divide) –Artificially increase the Register file size (i.e. number of registers) ? –Superior/complementary to compiler scheduler »Dynamic instruction window »Make usage of more registers than the Architecture Registers ? Complex microarchitecture –Complex scheduler »Large instruction window »Speculative execution –Requires reordering back-end mechanism (retirement) for: »Precise interrupt resolution »Misprediction/speculation recovery »Memory ordering Perf/power increase decrease

14 Uri Weiser; VLSI_04_2010.ppt; Dec 08 14 Branch Prediction Performance, power, area Insights? Goal - ensure instruction supply by correct prefetching In the past - prefetcher assumed fall-through –Lose on unconditional branch (e.g., call) –Lose on frequently taken branches (e.g., loops) Dynamic Branch prediction –Predicts whether a branch is taken/not taken –Predicts branch target address Misprediction cost varies (higher w/ increased pipeline depth) Typical Branch prediction rates: ~90%-96%  4%-10% misprediction,  10-25 branches between mispredictions  50-125 instructions between mispredictions Misprediction cost increased with –Pipeline depth –Machine width »e.g. 3 width x 10 stages = 30 inst flushed! ? Perf/power increase decrease

15 Uri Weiser; VLSI_04_2010.ppt; Dec 08 15 Caches In computer engineering, a cache (pronounced /kæ ʃ / kash in US and /k e ɪʃ / kaysh in Aust/NZ) is a component that transparently stores data so that future requests for that data can be served faster (Wikipedia)computer engineering/kæ ʃ /kashUSkayshAustNZ

16 Uri Weiser; VLSI_04_2010.ppt; Dec 08 16 Memory hierarchy Performance, power, area Insights? Capacity (Size) Big Small Speed Slow Fast CPU Registers L1 cache L2 cache Main memory (DRAM) DISK/Flash 0.25ns 1-2ns 100ns 1ms/ 64KB 8MB 100GB <500B 4GB 5ns 10us Perf/power: What are the parameters to consider here?

17 Uri Weiser; VLSI_04_2010.ppt; Dec 08 17 Environment and motivation Moore’s Law: 2X transistors (cores?) per chip every technology generation however, current process generation provide almost same clock rate Processor running single process can compute only as fast as memory –A 3Ghz processor can execute an “add” operation in 0.33ns –Today’s “external Main memory” latency is 50-100ns –Naïve implementation: loads/stores can be 300x slower than other operations

18 Uri Weiser; VLSI_04_2010.ppt; Dec 08 18 Cache Motivation CPU - DRAM Gap (latency) CPU-DRAM Gap Memory latency can be handle by: Multi-threaded engine (no cache)  every memory access = off-chip access  BW and power implications? Caches  every Cache miss = off-chip access  BW and power implications? µProc 60%/yr. (2X/1.5yr) Processor-Memory Performance Gap: (grows 50% / year) DRAM 9%/yr. (2X/10 yrs) “Moore’s Law”

19 Uri Weiser; VLSI_04_2010.ppt; Dec 08 19 Memory Hierarchy Number of CPU cycles to reach memory domain  latency 046267 Computer Architecture 1 U Weiser Registers Memory Disk/SSD CPU 1C T=300 C 1,000,000 C to Disk 10,000 C to SSD ! C=CPU cycles *

20 Uri Weiser; VLSI_04_2010.ppt; Dec 08 20 A cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations Cache 048750 CMP Cache/Mem Arch – Uri W.,Evgeny B.

21 Uri Weiser; VLSI_04_2010.ppt; Dec 08 21 Memory Hierarchy solution I – single core environment A fast memory structure between CPU and memory solves latency issue 21 046267 Computer Architecture 1 U Weiser Registers Memory Disk/SSD CPU 1C 300 C 1,000,000 C to Disk 10,000 C to SSD ! C=CPU cycles Cache 10 C

22 Uri Weiser; VLSI_04_2010.ppt; Dec 08 22 Memory Hierarchy solution II – Multi-thread environment Many Threads execution hide latency 22 046267 Computer Architecture 1 U Weiser Memory Disk/SSD 300 C 1,000,000 C to Disk 10,000 C to SSD ! 22 execution Memory access (300 C) execution Memory access Performance 1  BW 1, P 1

23 Uri Weiser; VLSI_04_2010.ppt; Dec 08 23 Memory Hierarchy Solution II – Multi-thread environment Memory structure ($) between CPU and memory serves as BW filter 23 046267 Computer Architecture 1 U Weiser Memory Disk/SSD 300 C 1,000,000 C to Disk 10,000 C to SSD ! Cache 10 C MR=Cache Miss rate Same performance: Performance 1  BW 1 *MR, P 1 *MR

24 Uri Weiser; VLSI_04_2010.ppt; Dec 08 24 Power, Performance, Area: Insights – 1 Energy to process one instruction: W i –increases with the complexity of the processor E.g., OOO processor consumes more energy per instruction than an in-order processor  Perf/Power Energy efficiency=Perf/Power –value deteriorates as speculation increases and complexity grows Area efficiency = Performance/area –Leakage become a major issue –Effectiveness of area – how to get more performance for a given area (secondary to power)

25 Uri Weiser; VLSI_04_2010.ppt; Dec 08 25 Power, Performance, Area: Insights - 2 Performance –Perf  IPC * f Voltage Scaling –Increased operating voltage to increase frequency –f = k * V (within a given voltage range) Power & Energy consumption –P  C * V 2 * f  P ~  * C * V 3 –E = P * t Tradeoff –Maximum performance –Minimum energy  1% perf  1% power –Maximum performance within constrained power  1% perf  3% power

26 Uri Weiser; VLSI_04_2010.ppt; Dec 08 26 Power, Performance, Area Insight - 3 Many things do not scale –Wire delays –Power –Memory latencies and bandwidth –Instruction Level parallelism (ILP) … We solve one: we fertilize the others!  Performance = frequency * IPC –Increasing IPC => more work per instruction »Prediction, renaming, scheduling, etc… –More useless work: Speculation, replays... –More Frequency => More pipe stages »Less gate delays per stage »More gate delays per instruction overall  Bigger loss due to flushes, cache misses, prefetch miss We may “gain” Performance => But with a lot of area and power!

27 Uri Weiser; VLSI_04_2010.ppt; Dec 08 27 Static Scheduling: VLIW / EPIC A short architectural case study Why “new”? ….CISC = Old Why reviving? ….OOO complexity  Advantages – simplicity (pipeline, dependency, dynamic)  reasons: –EOL of X86? –Business? Servers? –Questions to ask? »Technical »Business Controllers –Questions to ask? »Technical »Business

28 Uri Weiser; VLSI_04_2010.ppt; Dec 08 28 Static Issuing - example VLIW-Very Long Instruction Word Multiflow 7/200 A VLIW Performs many program steps at once. Many operations are grouped together into Very Long Instruction Word and execute together Ref: “VLIW Architecture for a Trace Scheduling Compiler” Colwell. Nix, O’Donnell LD/STFADDFMULIALUBRANCH LD/STFADDFMULIALU Register File Memory Instruction Word

29 Uri Weiser; VLSI_04_2010.ppt; Dec 08 29 Optimized compiler arrange instructions according to instruction timing example: LD#B, R1 LD#C, R2 FADDR1, R2, R3 LD#D, R4 LD#E, R5 FADDR4, R5, R6 FMULR6, R3, R1 STOR1, #A LD#G, R7 LD#H, R8 FMULLR7, R8, R9 LD#X, R4 LD#Y, R5 FMULLR4, R5, R6 FADDR6, R9, R1 STOR1, #F Multiflow 7/200 (cont’) Compiler Basic Concept A = (B+C) * (D+E) F = G*H + X*Y Assume latencies: Load 3 FADD 3 FMUL 3 Store 1

30 Uri Weiser; VLSI_04_2010.ppt; Dec 08 30 Multiflow 7/200 (cont’) Compiler Basic Concept Example (Cont.):A = (B+C) * (D+E) F = G*H + X*Y LD/STIALUFADDFMULBR LD #B, R1 LD #C, R2 LD #D, R4 LD #E, R5 LD #G, R7 LD #H, R8 LD #X, R4 LD #Y, R5 FADD R1,R2,R3 FADD R4,R5,R6 FMUL R7,R8,R9 FMUL R3,R6,R1 FMUL R4,R5,R6 FADD R9,R6,R1 STO R1, #A STO R1, #F Assume latencies: Load 3 FADD 3 FMUL 3 Store 1 - - - - - - - - - : stalled cycle, takes time, but no space. Overall latency 17 cycles. Very Low code efficiency: <25%!

31 Uri Weiser; VLSI_04_2010.ppt; Dec 08 31 Intel® Itanium™ Processor Block Diagram L1 Instruction Cache and Fetch/Pre-fetch Engine 128 Integer Registers128 FP Registers Branch Prediction L2 Cache Dual-Port L1 Data Cache and DTLB Branch Units Branch & Predicate Registers Scoreboard, Predicate,NaTs, Exceptions ALAT ITLB BBBMMIIFF IA-32 Decode and Control Instruction Queue SIMD FMAC Floating Point Units SIMD FMAC 8 bundles Register Stack Engine / Re-Mapping 9 Issue Ports L3 Cache Bus Controller ECC Integer and MM Units

32 Uri Weiser; VLSI_04_2010.ppt; Dec 08 32 IA64 Instruction Template Instruction Types –M: Memory –I: Shifts, MM –A: ALU –B: Branch –F: Floating point –L+X: Long Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits template 5 bits 128 bits Template types –Regular: MII, MLX, MMI, MFI, MMF –Stop: MI_I M_MI –Branch: MIB, MMB, MFB, MBB, BBB –All come in two versions: »with stop at end »without stop at end Microarchitecture considerations: –Can run N bundles per clock (Merced = 2) –Limits on numbers of memory ports (Merced =2, future > 2?)


Download ppt "Uri Weiser; VLSI_04_2010.ppt; Dec 08 1 VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5 The Generic Processor – Microarchitecture trends – Performance/power/frequency."

Similar presentations


Ads by Google