Presentation is loading. Please wait.

Presentation is loading. Please wait.

UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson.

Similar presentations


Presentation on theme: "UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson."— Presentation transcript:

1 UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson

2 D. Gibson Thesis Defense - 2 Executive Summary (1/2) “Walls & Laws” suggest future CMPs will need Scalable Cores –Scale Up for Performance (e.g., one thread) –Scale Down for per-core energy conservation (e.g., many threads) Area 1: How to build efficient scalable cores. –Forwardflow, one scalable core –Overprovision rather than borrow

3 D. Gibson Thesis Defense - 3 Executive Summary (2/2) Area 2: How to use scalable cores: –Scale at fine granularity: Discover most-efficient configuration –Scale for multi-threaded workloads Scale up for sequential bottlenecks, improve performance Scale down for unimportant executions, improve efficiency –Using DVFS as a scalable core proxy

4 D. Gibson Thesis Defense - 4 Document Outline 1.Introduction 2.Extended Motivation 1.Scalable Cores 2.Background 3.Related Work 3.Methods 4.Serialized Successor Representation 5.Forwardflow 6.Scalable Cores for CMPs 1.Scalable Forwardflow 2.Overprovisioning vs. Borrowing 3.Power-Awareness 4.Single-Thread Scaling 5.Multi-Thread Scaling 6.DVFS as a Proxy for Scalable Cores 7.Conclusions/Future Work/Reflections A/B. Supplements Of Course Mostly Old Material: Recap Mostly New Material: Talk Focus TALK Outline… If there’s time and interest

5 Hello, Software. I am a single x86 processor. D. Gibson Thesis Defense - 5 ‘80s - `00s: Single-Core Heyday Core and Chip Microarchitecture Changed Enormously 386, 1985, 20MHz 486, 1989, 50MHz P6, MHzPIV, 2004, 3000MHz Clock Frequency Increased Dramatically Hello, Software. I am still a single x86 processor.

6 Hitting the Power Wall Core i Pentium Pentium MMX Pentium II Pentium III Pentium 4 Pentium D Core 2 D. Gibson Thesis Defense - 6 Kneejerk Reactions: Reduce Clock Frequency (e.g., 3.0 Ghz to 2.4- ish GHz) De-Emphasize Pipeline Depth (e.g., Pentium M) What about Performance? Resource borrowed from Yasuko’s WiDGET ISCA 2010 Talk One example data point represents a range of actual products.

7 Chip Multiprocessors (CMPs) 1.Can’t clock (much) faster… 2.Hard to make uArch faster… Use Die Area for More Cores! D. Gibson Thesis Defense - 7 Hello, Software. I am TWO x86 processors. (And my descendants will have more…) “Fundamental Turn Toward Concurrency” [Sutter2005] Software must now change to see continued performance gains. This Won’t Be Easy.

8 In 1965, Gordon Moore sketched out his prediction of the pace of silicon technology. Decades later, Moore ’ s Law remains true, driven largely by Intel ’ s unparalleled silicon expertise. Copyright © 2005 Intel Corporation. D. Gibson Thesis Defense - 8 Cost per Device Falls Predictably –Density rises (Devices/mm 2 ) –Device size grows smaller Rock, 65nm [JSSC2009] Rock16, 16nm [ITRS2007] Moore’s Law in the Multicore Era (If you want 1024 threads) Or “Fell”

9 D. Gibson Thesis Defense - 9 Amdahl’s Law (1 - f ) + f N Parallel Fraction f Normalized Runtime Parallel Runtime = f = Parallel Fraction N = Number of Cores N = 8 Sequential: Not Good Partially-Parallel: OK Highly-Parallel: Very Good

10 D. Gibson Thesis Defense - 10 Utilization Wall (aka SAF) Simultaneously Active Fraction (SAF): Fraction of devices in a fixed-area design that can be active at the same time, while still remaining within a fixed power budget. [Venkatesh2009] nm65nm45nm32nm Dynamic SAF LP DevicesHP Devices [Chakraborty2008]

11 UTILIZATION D. Gibson Thesis Defense - 11 Architects Boxed In: Walls and Laws PW: Cannot clock much faster. UW: Cannot use all devices. AL: Single threads need help Not all code is parallel. Scalable CMPs POWER AMDAHL

12 Scalable CMPs → Scalable Cores Scale UP for Performance –Use more resources for more performance –(i.e., 2 Strong Oxen) Scale DOWN to Conserve Energy –Exploit TLP with many small cores –(i.e., 1024 Chickens) If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens? -Attributed to Seymour Cray D. Gibson Thesis Defense - 12

13 Scalable Cores in CMPs 1.How to build a Scalable Core? –Should be efficient –Should offer a wide power/perf. Range 2.How to use Scalable Cores? –Optimize single-thread efficiency –Detect and ameliorate bottlenecks D. Gibson Thesis Defense - 13 This Thesis:

14 D. Gibson Thesis Defense - 14 Area 1: Efficient Scalable Cores Fear leads to Anger, Anger leads to Hate, Hate leads to Suffering Naming Association BroadcastInefficiency Forwardflow Core Architecture –Raise Average I/MLP, Not Peak –Efficient SRAMs, no CAMs Serialized Successor Representation (SSR) –Use pointers instead of names Basis for a Scalable Core Design

15 D. Gibson Thesis Defense - 15 Area 2: Scalable Cores in CMPs How to scale cores: –Overprovision each core? –Borrow/merge cores? When to scale cores: –For one thread? –For many threads? How to continue: –DVFS as a proxy for a scalable core

16 D. Gibson Thesis Defense - 16 Outline Introduction: Scalable Cores –Motivation (Why scale in the first place?) –Definition Scalable Cores for CMPs –How to scale: Dynamically-Scalable Core (Forwardflow) Overprovision or Borrow Resources? –When to scale: Hardware Scaling Policies For single-thread efficiency For multi-threaded workloads Conclusions/Wrap-Up

17 D. Gibson Thesis Defense - 17 Forwardflow (FF): A Scalable Core Forwardflow Core = Frontend (L1-I, Decode, ARF) + FE L1-D Distributed Execution Logic/Window (DQ) + L1-D Cache Scale Down: Use a Smaller Window Scale Up: Use a Bigger Window

18 Window Scaling vs. Core Scaling FF: Only scales instruction window –Not width, –Not registers, –etc. D. Gibson Thesis Defense - 18 How does window scaling scale the core? –By affecting demand –Analogous to Bernoulli’s Principle FEDQ 2 ddL VCfP   Power of Unscaled Components? “Activity Factor”

19 FF Dynamic Configuration Space CONFIG.SOBRIQUETDESCRIPTION F-32 “Fully scaled down” 32-entry instruction window, single issue (1/4 of a DQ bank group) F entry instruction window, dual issue (1/2 of a DQ bank group) F-128 “Nominal” 128-entry instruction window, quad issue, (one full DQ bank group) F entry instruction window, “quad” issue, (2 BGs) F entry instruction window, “quad” issue, (4 BGs) F-1024 “Fully scaled up” 1K-entry instruction window, “quad” issue, (8 BGs) D. Gibson Thesis Defense - 19

20 D. Gibson Thesis Defense - 20 Configuration… ComponentConfiguration Mem. Cons. Mod. Sequential Consistency Coherence Prot. MOESI Directory (single chip) Store Issue Policy Permissions Prefetch at X Frequency3.0 GHz Technology32nm Window SizeVaried by experiment Disambig.NoSQ Branch Prediction TAGE + 16-entry RAS entry BTB Frontend7 Cyc. Pred-to-dispatch ComponentConfiguration L1-I Caches32KB 4-way 64b 4cycle 2 proc. ports L1-D Caches32KB 4-way 64b 4 cycle LTU 4 proc. ports, WI/WT, included by L2 L2 Caches1MB 8-way 64b 11cycle WB/WA, Private L3 Cache8MB 16-way 64b 24cycle, Shared Main Memory 8GB, 2 DDR2-like controllers (64 GB/s peak BW), 300 cycle latency Inter-proc network 2D Mesh 16B link

21 FF Scalable Core Performance D. Gibson Thesis Defense Gmeanh264reflibquantum Normalized Runtime F-32 F-64 F-128 F-256 F-512 F Mostly Compute- Bound: Not much scaling Mostly Memory- Bound: Great scaling

22 FF Scalable Core Power D. Gibson Thesis Defense F-32F-64 F-128F-256F-512 F-1024 Normalized Power FE DQ/ALU MEM Static WRT Nominal, F-128: Scale UpScale Down 8x Window1/4 Window -32% MEM power +27% MEM power -54% DQ/ALU power +91% DQ/ALU power +28% FE power -39% FE power

23 D. Gibson Thesis Defense - 23 FF Recap Forwardflow Scalable Core FE L1-D Scale Down: Use a Smaller Window Scale Up: Use a Bigger Window FE L1-D FE L1-D More details on Forwardflow

24 D. Gibson Thesis Defense - 24 Outline Introduction: Scalable Cores –Motivation (Why scale in the first place?) –Definition Scalable Cores for CMPs –How to scale: Dynamically-Scalable Core (Forwardflow) Overprovision or Borrow Resources? –When to scale: Hardware Scaling Policies For single-thread efficiency For multi-threaded workloads Conclusions/Wrap-Up

25 D. Gibson Thesis Defense - 25 Overprovisioning vs. Borrowing Scaling Core Performance means Scaling Core Resources –From where can a scaled core acquire more resources? Option 1: Overprovision All Cores –Every core can scale up fully using a core- private resource pool Option 2: Borrow From Other Cores –Cores share resources with neighbors

26 What Resources? Forwardflow: –Resources = DQ Bank Groups = –(i.e., window space, functional units) Simple Experiment: –Overprovision: Each Core has 8 BGs, enough for F What is the area cost? –Borrow: Each Core has 4 BGs, enough to scale to F-512, borrows neighbors’ BGs to reach F What is the performance cost? D. Gibson Thesis Defense - 26

27 D. Gibson Thesis Defense - 27 Per-Core Overprovisioning FE L1-D L2 L3 Bank 17.9mm 16.6mm Overprovisioned CMP: 298mm 2 Scale Up: Activate More Resources 8.96mm 4.15mm Overprovisioned Tile: 37.2mm 2 32nm

28 D. Gibson Thesis Defense - 28 Resource Borrowing FE L1-D L2 L3 Bank FE L1-D L2 L3 Bank Scale Up: Borrow Resources from Neighbor 8.31mm 4.15mm Borrowed Tile: 34.5mm mm Borrowing CMP: 276mm 2 32nm

29 D. Gibson Thesis Defense - 29 Area Cost Per-Core 12.3mm mm 2 +27% Per-Tile +8% 37.2mm mm 2 Per-CMP +7% 276mm 2 298mm 2 32nm

30 Performance Cost of Borrowing Borrowing Slower? –Maybe Not: Comparable Wire Delay (in this case) –Maybe: Crosses Physical Core Boundary Global vs. Local Wiring? Cross a clock domain? D. Gibson Thesis Defense - 30 Simple Experiment –2-cycle lag crossing core boundary –Slows inter-BG communication –Slows dispatch 32nm

31 A Loose Loop D. Gibson Thesis Defense Normalized Runtime F-1024O F cycles lag: F-1024B –9% Runtime Reduction from Borrowing w.r.t. Overprovisioning –Essentially No Performance Improvement From Scaling Up!

32 Overprovisioning vs. Borrowing Overprovisioning CAN be cheap –FF: 7% CMP area –CF: 12.5% area from borrowing [Ipek2007] If Borrowing introduces even small delays, it may no longer be worthwhile to scale at all. –This effect is worse if borrowing occurs at smaller design points. D. Gibson Thesis Defense - 32

33 D. Gibson Thesis Defense - 33 Outline Introduction: Scalable Cores –Motivation (Why scale in the first place?) –Definition Scalable Cores for CMPs –How to scale: Dynamically-Scalable Core (Forwardflow) Overprovision or Borrow Resources? –When to scale: Hardware Scaling Policies For single-thread efficiency For multi-threaded workloads Conclusions/Wrap-Up

34 D. Gibson Thesis Defense - 34 What to do for f =0.00 What is important? –Performance: Just Scale Up (done) –Efficiency: Pick the most efficient configuration? How to find the right configuration? Can we do better? Parallel Fraction f Normalized Runtime

35 What about local efficiency? (i.e., phases) Applications may exhibit phases at “micro-scale” –Not all phases are equal # Sum an array l_array: load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array... # Sum a list l_list: load [R1+ 8] -> R2 add R2 R3 -> R3 load [R1+ 0] -> R1 brnz l_list D. Gibson Thesis Defense - 35 Great for Big Windows (Scale Up?) Big Window Makes No Difference (Scale Down?)

36 E*D 2, Normalized to Best Static Design Prior Art (some of it) POS (Positional Adaptation) [Huang03]: –Code Configuration –POS: Static Profiling, Measure Efficiency PAMRS (Power-Aware uArch Resource Scaling) [Iyer01] –Detect “hot spots” –Measure all configurations’ efficiency, pick best D. Gibson Thesis Defense - 36 POS PAMRS Want: Efficiency of POS, but dynamic response of PAMRS

37 MLP-based Window Size Estimation Play to the cards of the uArch: –FF: Pursue/measure MLP –Something else: Something else Find the smallest window that will expose as much MLP as the largest window Hardware: –Poison bits –Register names –Load miss bit –Counter –LFSR D. Gibson Thesis Defense - 37 Results Explain window size estimation in detail with a gory example

38 FG Scaling Results MLP: –No profiling needed –Safe, only hurts efficiency >10% for 1 bmk. –Compare to: POS, 8 bmks PAMRS, 20 bmks D. Gibson Thesis Defense Normalized E*D 2 POS PAMRS MLP Fewer of these compared to PAMRS Fewer of these compared to POS

39 D. Gibson Thesis Defense - 39 Recap: What to do for f =0.00 Profiling (POS) –Can help, might hurt Dynamic Response: Seek MLP –Seldom hurts, usually finds most-optimal configuration Parallel Fraction f Normalized Runtime

40 D. Gibson Thesis Defense - 40 Outline Introduction: Scalable Cores –Motivation (Why scale in the first place?) –Definition Scalable Cores for CMPs –How to scale: Dynamically-Scalable Core (Forwardflow) Overprovision or Borrow Resources? –When to scale: Hardware Scaling Policies For single-thread efficiency For multi-threaded workloads Conclusions/Wrap-Up

41 D. Gibson Thesis Defense - 41 What to do for f  Two Opportunities: 1.Sequential Bottlenecks Detect, Fix i.e., Scale Up Better Performance 2.Useless Executions Detect, Fix i.e., Scale Down Better Efficiency Parallel Fraction f Normalized Runtime

42 What if the OS Knows? OS knows about bottlenecks –Can scale up a core OS knows about useless work –Can scale down, or, –Can shut off unneeded cores (e.g., OPMS) Result: Amdahl’s Law in the Multicore Era [Hill/Marty08] D. Gibson Thesis Defense Parallel Fraction f Normalized Runtime

43 If the OS doesn’t know Maybe programmer knows? (Prog) D. Gibson Thesis Defense - 43 Dunce SLE-like lock detector to identify critical sections? (Crit) Hardware spin detection? [Wells2006] (Spin) Holding a lock… except when spinning? (CSpin) Every thread spinning except one? (ASpin) (limit study, pretend global communication is OK)

44 D. Gibson Thesis Defense - 44 Amdahl Microbenchmark Bmks X Policies X Configs + Opacity 1 Unclear Behavior SequentialParallel f(1-f) Parallel Fraction f Normalized Runtime Real HW

45 Prog (Programmer-Guided Scaling) D. Gibson Thesis Defense F-128 F-1024 Normalized Runtime Parallel Fraction f Prog sc_hint(slow) sc_hint(fast) F-1024 F-512 F-256 F-128 F-64 F-32

46 Crit (SLE-Style Lock Detector for Scale-Up) D. Gibson Thesis Defense F-128 F-1024 Normalized Runtime Parallel Fraction f Prog F-1024 F-512 F-256 F-128 F-64 F-32 Crit Barrier::Arrive() { l.Lock(); … l.Unlock(); } Lock::Lock() { CAS(myValue); … } Lock::Unlock() { CAS(myValue); … } WTH?

47 Crit: What goes wrong Intuition Mismatch: –Lock Detector Implementer's expectations don’t match pthread library implementer's expectations. 1.Critical Section != Sequential Bottleneck 2.Lock+Unlock != CAS+Temporal Silent Store More general lesson –SW is really flexible. Programmers do strange things. HW designer: Be careful, SW may not be doing what you think D. Gibson Thesis Defense - 47

48 Spin (Spin Detector for Scale-Down) D. Gibson Thesis Defense - 48 F-1024 F-512 F-256 F-128 F-64 F-32 Spinning, Scale Down Seldom/Never Spins (Performs like CSpin, next)

49 CSpin (Lock Detector for Scale-Up, Spin Detector for Scale-Down) D. Gibson Thesis Defense - 49 F-1024 F-512 F-256 F-128 F-64 F F-128 F-1024 Normalized Runtime Parallel Fraction f Prog Crit CSpin LD thinks a lock is held, but also spinning

50 ASpin (Spin, but Scale Up if all others Scaled Down) D. Gibson Thesis Defense - 50 F-1024 F-512 F-256 F F-128 F-1024 Normalized Runtime Parallel Fraction f CSpin Crit Prog F-64 F-32 All Spinning: Scale Up Better late than never ASpin

51 Amdahl Efficiency D. Gibson Thesis Defense F-128 Prog Spin ASpin Normalized E*D 2 Parallel Fraction f 1. Hope of SW Parallelism for Efficiency Seems Sound 2. “Programmer” can help. Psychology?, Difficulty for non-toy programs? 3.a. Spin-detection helps, by scaling down. 3.b. Can scale up when others spin (“others”?)

52 Real Workloads? D. Gibson Thesis Defense - 52 Workload Behavior f 0.90+ By design – graduate students spend a lot of time making this so No Prog Scaling Policy Apache : Spin Det. helps. Synchronization Heavy. JBB : Synchronization Heavy. OLTP : (Just) Spin hurts a little, ASpin helps. Synchronization Heavy. Zeus : (Just) Spin hurts a little, ASpin helps. Synchronization Heavy F-128 Spin CSpin ASpin Normalized E*D 2 F-1024

53 D. Gibson Thesis Defense - 53 Outline Introduction: Scalable Cores –Motivation (Why scale in the first place?) –Definition Scalable Cores for CMPs –How to scale: Dynamically-Scalable Core (Forwardflow) Overprovision or Borrow Resources? –When to scale: Hardware Scaling Policies For single-thread efficiency For multi-threaded workloads –How to continue: DVFS/Models for Future Software Evaluations Conclusions/Wrap-Up

54 Conclusions (1/2) How to scale cores: –Forwardflow: An Energy-Proportional Scalable Window Core Architecture Scale up for performance Scale down for energy conservation –Overprovision Resources when cheap Borrow only when necessary Avoid loose loops D. Gibson Thesis Defense - 54

55 Conclusions (2/2) When to scale cores: –For single-thread efficiency: Seek efficient operation intrinsically (FF: MLP) Profiling can help, if possible. –For threaded workloads: Scale up for sequential bottlenecks –If you can find them Scale down for useless work How to emulate scalable cores –Proxy with DVFS, with caveats D. Gibson Thesis Defense - 55 Parallel Fraction f Normalized Runtime 1V 0V

56 D. Gibson Thesis Defense - 56 Other Contributions Side Projects with Collaborators –Deconstructing Scalable Cores, Coming Soon –“Diamonds are an Architect’s Best Friend”, ISCA 2009 –To CMP or Not to CMP, TR & ANCS Poster Parallel Programming at Wisconsin –CS 838, CS 758 Various Infrastructure Work –Ruby, Tourmaline, Lapis, GEM5

57 D. Gibson Thesis Defense - 57 Fun Facts About This Thesis Simulator: –C++: 135kl (101kl), Python: 16.7kl –1188 Revs, 17,476 Builds ~15 builds per day since 5 July 2007 Forwardflow used to be Fiberflow –Watch out, Metamucil Est. Simulation Time: –2.9B CPU*Seconds = 95 Cluster*Days (just in support of data in this thesis)

58 D. Gibson Thesis Defense - 58 Questions/Pointers Overp./Borrowing FG Uniproc. Scaling Multiproc. Scaling DVFS vs. W. Scaling SSR All about FF Estimating Power LBUS/RBUS Scalable Scheduling Seeking MLP Other Scalable Cores Related Work Backward Ptrs. In the Document Always in motion is the future. DVFS vs. Scaling

59 D. Gibson Thesis Defense - 59 DVFS Instead of Simulation So far: –“Benchmark” = 1ms – 10ms target time –Scaling “in the micro” i.e., Much faster than software What about longer runs? –“Benchmark” = minutes+ –Scaling “in the macro” i.e., At the scale of systems No real hardware scalable core –Use DVFS instead, as a proxy. You must unlearn what you have learned.

60 1V 0V DVFS Effects D. Gibson Thesis Defense - 60 FE L1-D L2 L3 DRAM DVFS Domain 1V 0V +Freq: Compute operations are faster +Freq: Memory seems slower +Freq,+Volt: Dynamic Power Higher (~cubic) +P dyn : Higher temperature leads to higher static power 3Ghz 3.6Ghz

61 HW Scaling Effects D. Gibson Thesis Defense - 61 FE L1-D L2 L3 DRAM +Window: Compute operations are not (much) faster +Window: Memory seems faster +Window: Dynamic Power Higher (~log) Scale Up 3Ghz 3Ghz How do they compare quantitatively?

62 F GHz DVFS/HW Scaling Performance D. Gibson Thesis Defense - 62 More CPU-Bound: Prefer DVFS More Memory-Bound: Prefer Window Scaling  a DVFS/ Scaling config. pair with comparable performance Runtime Normalized to F-128

63 DVFS/HW Scaling Power DVFS: +~38% Chip Power +~70% DVFS Domain Dynamic Power +~20% Temp-induced Leakage FF Scaling: +~10% Chip Power +~2% Temp-induced Leakage D. Gibson Thesis Defense FE DQ/ALU MEM Static F GHzF-256 Power Normalized to F-128

64 DVFS Proxying Scalable Cores Performance: OK With Caveats –CPU-bound workloads: DVFS overestimates scalable core performance –Memory-bound workloads: DVFS underestimates scalable core performance Power: Not OK. –DVFS follows E*D 2 curve –FF/Scalable Core should be better than E*D 2 curve. –Use a model instead. D. Gibson Thesis Defense - 64

65 SSR Per-Value Distributed Linked List –Starts at producer –Visits each successor –NULL pointer at last successor Amenable to simple hardware –Serializes wakeup D. Gibson Thesis Defense - 65 ld R4 4 R1 add R1 R3 R3 sub R4 16 R4 st R3 R8 breq R4 R3

66 Effect of Serialized Wakeup D. Gibson Thesis Defense - 66 Compared to idealized window –Low mean performance loss from serialized wakeup (+2% runtime) –Occasionally noticeable (i.e., bzip2, 50%+)

67 SSR Compiler Optimization D. Gibson Thesis Defense - 67 long –Compiler cannot identify dynamic repeated regs split –Compiler can identify dynamic repeated regs, but cannot identify critical path crit –Compiler knows both dynamic repeated regs and critical path

68 D. Gibson Thesis Defense - 68 Power-Awareness How much energy is used by a computation? –Measure (e.g., with a multimeter) –Detailed Simulation (e.g., SPICE) –Simple Simulation (e.g., WATTCH) –Simple Model (e.g., 10W/core) Number of activations of element i Energy per activation of element i

69 Measuring Energy Online D. Gibson Thesis Defense - 69 Event: “Easy” to measure Activation: “Hard” to measure Correlated [Iyer01]: MAC in hardware. [Joseph01]: HW Perf. Ctrs, works for Pentium-era This work: Scalable core, use core’s resources to do the computation

70 D. Gibson Thesis Defense - 70 DVFS Won’t Cut It Near saturation in voltage scaling Subthreshold DVFS never energy-efficient [Zhai04] Need microarchitectural alternative ~80% ~33% Resource borrowed from David’s “Two Cores” Talk

71 Scalable Interconnect D. Gibson Thesis Defense - 71 Logically: A Ring. Scale Down: A Ring with Fewer Elements Not straightforward Overprovisioning won’t work well: Wrap- around link is ugly Needs to support 1-, 2-, 4-, 8-BG operation

72 Two Unidirectional Busses (gasp!) D. Gibson Thesis Defense - 72 F-1024 F

73 Window Estimation Example D. Gibson Thesis Defense - 73 # Sum an array l_array: load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array load [R1+ 0] -> R2 add R2 R3 -> R3 add R1 64 -> R1 brnz l_array MMMMMM Miss, Start Profiling, Poison R2 W Poison R3 Antidote R1 Indep. Miss, Poison R3 Antidote R1 Indep. Miss, Set ELMR i, Poison R ELMR Set ELMR i,Poison R2 81 MSb(ELMR) = 16 → Window size 16 needed.

74 Adding Hysteresis (1/2) D. Gibson Thesis Defense - 74 libquantum F-1024 F-512 F-256 F-128 F-64 F-32 astar F-1024 F-512 F-256 F-128 F-64 F Many reconfigs 2. Too small most of the time. Must anticipate, not react.

75 Adding Hysteresis (2/2) Scale Down only “occasionally” –On full squash D. Gibson Thesis Defense - 75 astar F-1024 F-512 F-256 F-128 F-64 F-32 Intuition: –Assume big window not useful –Show, occasionally, that a big window IS useful.

76 D. Gibson Thesis Defense - 76 Leakage Trends Leakage Starts to Dominate SOI & DG Technology Helps (ca 2010/2013) Tradeoffs Possible: –Low-Leak Devices (slower access time) DG Devices LSP Devices 1MB Cache: Dynamic & Leakage Power [HP2008,ITRS2007] Leakage Power by Circuit Variant [ITRS2007] Power (mW) Normalized Power

77 D. Gibson Thesis Defense - 77 Forwardflow Overview Design Philosophy: –Avoid ‘broadcast’ accesses (e.g., no CAMs) Avoid ‘search’ operations (via pointers) –Prefer short wires, tolerate long wires –Decouple frontend from backend details Abstract backend as a pipeline

78 D. Gibson Thesis Defense - 78 Forwardflow – Scalable Core Design Use Pointers to Explicitly Define Data Movement –Every Operand has a Next Use Pointer –Pointers specify where data moves (in log(N) space) –Pointers are agnostic of: Implementation Structure sizes Distance –No search operation ld R4 4 R1 add R1 R3 R3 sub R4 16 R4 st R3 R8 breq R4 R3

79 D. Gibson Thesis Defense - 79 Forwardflow – Dataflow Queue Table of in-flight instructions Combination Scheduler, ROB, and PRF –Manages OOO Dependencies –Performs Scheduling –Holds Data Values for All Operands Each operand maintains a next use pointer hence the log(N) Implemented as Banked RAMs  Scalable 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 breqR4R5 Op1 Op2 Dest Dataflow Queue Bird ’ s Eye View of FF Detailed View of FF

80 D. Gibson Thesis Defense - 80 Forwardflow – DQ +/-’s Explicit, Persistent Dependencies No searching of any kind - Multi-cycle Wakeup per value * 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 breqR4R5 Op1 Op2 Dest *Average Number of Successors is Small [Ramirez04,Sassone07] Dataflow Queue

81 D. Gibson Thesis Defense - 81 DQ: Banks, Groups, and ALUs Logical OrganizationPhysical Organization DQ Bank Group – Fundamental Unit of Scaling

82 D. Gibson Thesis Defense - 82 Forwardflow: Pipeline Tour RCT: Identifies Successors ARF: Provides Architected Values DQ: Chases Pointers I$ RCT DQ D$ ARF PREDFETCHDECODE DISPATCH COMMIT EXECUTE Scalable, Decoupled Backend

83 D. Gibson Thesis Defense - 83 RCT: Summarizing Pointers Want to dispatch: breq R4 R5 Need to know: –Where to get R4 ? Result of DQ Entry 3 –Where to get R5 ? From the ARF Register Consumer Table summarizes where most-recent version of registers can be found 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 Op1 Op2 Dest Dataflow Queue

84 D. Gibson Thesis Defense - 84 RCT: Summarizing Pointers 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 breqR47 Op1 Op2 Dest Dataflow Queue REFWR R1 2-S11-D R2  R3 4-S12-D R4 3 -D R5  Register ConsumerTable (RCT) breq R4 R5 5-S1 R4 Comes From DQ Entry 3-D R5 Comes From ARF

85 D. Gibson Thesis Defense - 85 Wakeup/Issue: Walking Pointers Follow Dest Ptr When New Result Produced –Continue following pointers to subsequent successors –At each successor, read ‘other’ value & try to issue NULL Ptr  Last Successor 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 breqR47 Op1 Op2 Dest Dataflow Queue

86 D. Gibson Thesis Defense - 86 DQ: Fields and Banks Independent Fields  Independent RAMs –I.E. accessed independently, independent ports, etc. Multi-Issue ≠ Multi-Port –Multi-Issue  Multi-Bank –Dispatch, Commit access contiguous DQ regions Bank on low-order bits for dispatch/commit BW Port Contention + Wire Delay = More Banks –Dispatch, Commit Share a Port Bank on a high-order bit to reduce contention

87 D. Gibson Thesis Defense - 87 DQ: Banks, Groups, and ALUs Logical OrganizationPhysical Organization DQ Bank Group – Fundamental Unit of Scaling

88 D. Gibson Thesis Defense - 88 Related Work Scalable Schedulers –Direct Instruction Wakeup [Ramirez04]: Scheduler has a pointer to the first successor Secondary table for matrix of successors –Hybrid Wakeup [Huang02]: Scheduler has a pointer to the first successor Each entry has a broadcast bit for multiple successors –Half Price [Kim02]: Slice the scheduler in half Second operand often unneeded

89 D. Gibson Thesis Defense - 89 Related Work Dataflow & Distributed Machines –Tagged-Token [Arvind90] Values (tokens) flow to successors –TRIPS [Sankaralingam03]: Discrete Execution Tiles: X, RF, $, etc. EDGE ISA –Clustered Designs [e.g. Palacharla97] Independent execution queues

90 D. Gibson Thesis Defense - 90 RW: Scaling, etc. CoreFusion [Ipek07] –Fuse individual core structures into bigger cores Power aware microarchitecture resource scaling [Iyer01] –Varies RUU & Width Positional Adaptation [Huang03] –Adaptively Applies Low-Power Techniques: Instruction Filtering, Sequential Cache, Reduced ALUs

91 D. Gibson Thesis Defense - 91 RW: Scalable Cores CoreFusion [Ipek07] –Fuse individual core structures into bigger cores Composable Lightweight Processors [Kim07] –Many very small cores operate collectively, ala TRIPS WiDGET [Watanabe10] –Scale window via smart steering

92 D. Gibson Thesis Defense - 92 RW: Seeking MLP Big Windows [Many] Runahead Execution [Dundas97][Multu06] –“Just keep executing” WIB [Lebeck02] –Defer, re-schedule later Continual Flow [Srinivasan04] –& Friends [Hilton09][Chaudhry09] –Defer, re-dispatch later

93 D. Gibson Thesis Defense - 93 Operand Networks CDF Pointer Span SPAN=5 Observation: ~85% of pointers designate near successors Intuition: Most of these pointers yield IB traffic, some IBG-N, none IBG-D. SPAN=16 Observation: Nearly all pointers (>95%) designate successors 16 or fewer entries away Intuition: There will be very little IBG-D traffic. astar sjeng jbb

94 D. Gibson Thesis Defense - 94 Is It Correct? Impossible to tell –Experiments do not prove, they support or refute What support has been observed of the hypothesis “This is correct”? –Reasonable agreement with published observations (e.g. consumer fanouts) –Few timing-first functional violations –Predictable uBenchmark behavior Linked list: No parallelism Streaming: Much parallelism

95 D. Gibson Thesis Defense - 95 CoreFusion Borrow Everything –Merges multiple discrete elements in multiple discrete cores into larger components –Troublesome for N>2 BPRED Decode Sched. PRF I$ BPRED Decode Sched. PRF I$

96 D. Gibson Thesis Defense - 96 “Vanilla” CMOS P- N N+

97 D. Gibson Thesis Defense - 97 Double-Gate, Tri-Gate, Multigate

98 D. Gibson Thesis Defense - 98 ITRS-HP vs. ITRS-LSP Device P- N N+ LSP: ~2x Thicker Gate Oxides LSP: ~2x Longer Gates LSP: ~4x Vth

99 D. Gibson Thesis Defense - 99 OoO Scaling Decode Width = 2 op1 dest src1 src2 op2 dest src1 src2 op1 dest src1 src2 op2 dest src1 src2 op1 dest src1 src2 op2 dest src1 src2 Decode Width = 4 Number of Comparators ~ O(N 2 )Bypassing Complexity ~ O(N 2 ) Two-way Fully Bypassed Four-way fully bypassed is beyond my powerpoint skill

100 D. Gibson Thesis Defense OoO Scaling ROB Complexity: O(N), O(I ~3/2 ) PRF Complexity: O(ROB), O(I ~3/2 ) Scheduler Complexity: –CAM: O(N*log(N)) (size of reg tag increases log(N)) –Matrix: O(N 2 ) (in fairness, the constant in front is small)

101 D. Gibson Thesis Defense Flavors of “Off” Dynamic Power Static Power Response Lag Time Active (Not Off) U%100%0 cycles Drowsy (Vdd Scaled) 1-5%40%1-2 cycles Clock- Gated 1-5%100%~0 cycles Vdd-Gated<1% 100s cycles Freq. Scaled F%100%~0 cycles

102 D. Gibson Thesis Defense Forwardflow – Resolving Branches 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 breqR4R5 6 ldR44R1 7 addR1R3 Op1 Op2 Dest Dataflow Queue On Branch Pred.: –Checkpoint RCT –Checkpoint Pointer Valid Bits Checkpoint Restore –Restores RCT –Invalidates Bad Pointers

103 D. Gibson Thesis Defense add A Day in the Life of a Forwardflow Instruction: Decode 4 -S 1 R4  R3  R2 7 -D R1 Register Consumer History 8 -D R3=0 8 -S1 add R1 R3 R3

104 D. Gibson Thesis Defense A Day in the Life of a Forwardflow Instruction: Dispatch R3R1add R14R4ld Op1 Op2 Dest Dataflow Queue add R3=0 Implicit -- Not actually written 0

105 D. Gibson Thesis Defense A Day in the Life of a Forwardflow Instruction: Wakeup 7 ldR44R1 8 addR10R3 9 subR416R stR3R8 Op1 Op2 Dest Dataflow Queue DQ7 Result is 0! 7 -D next Update HW value 0 DestPtr.Read(7) DestVal.Write(7,0) 8 -S 1 0

106 D. Gibson Thesis Defense A Day in the Life of a Forwardflow Instruction: Issue (…and Execute) 7 ldR44R1 8 addR10R3 9 subR416R stR3R8 Op1 Op2 Dest Dataflow Queue S2Val.Read(8) Meta.Read(8) 8 -S 1 next Update HW value 0  S1Ptr.Read(8) S1Val.Write(8,0) 0 0  add0 add → DQ8

107 D. Gibson Thesis Defense A Day in the Life of a Forwardflow Instruction: Writeback R8R3st R416R4sub R30R1add Op1 Op2 Dest Dataflow Queue 8 -D next Update HW value 0 DestPtr.Read(8) DestVal.Write(8,0) S 1 R3:0

108 D. Gibson Thesis Defense A Day in the Life of a Forwardflow Instruction: Commit R8R3st R416R4sub Op1 Op2 Dest Dataflow Queue 0R1add R3:0 Commit Logic Meta.Read(8) DestVal.Read(8) add R3:0 ARF.Write(R3,0)

109 D. Gibson Thesis Defense S1 R4 4 -S1 R3  R2 2 -S1 R1 DQ Q&A 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 breqR4R5 6 ldR44R1 7 addR1R3 8 subR416R4 9 stR3R8 Op1 Op2 Dest Dataflow Queue Register Consumer History 8 -D R4 9 -S1 R3  R2 7 -S1 R1

110 D. Gibson Thesis Defense Forwardflow – Wakeup 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 breqR40 Op1 Op2 Dest Dataflow Queue DQ1 Result is 7! 1 -D next Update HW value 7 DestPtr.Read(1) DestVal.Write(1,7) 2 -S 1 7

111 D. Gibson Thesis Defense S2Val.Read(2) Meta.Read(2) 2 -S 1 Forwardflow – Selection 1 ldR44R1 2 addR1R3 3 subR416R4 4 stR3R8 5 breqR40 Op1 Op2 Dest Dataflow Queue next Update HW value 7  S1Ptr.Read(2) S1Val.Write(2,7) 7 7  add 44 DQ2 Issue

112 D. Gibson Thesis Defense Forwardflow – Building Pointer Chains: Decode Decode must determine, for each operand, where the operand’s value will originate –Vanilla-OOO: Register Renaming –Forwardflow-OOO: Register Consumer Table RCT records last instruction to reference a particular architectural register –RAM-based table, analogous to renamer

113 D. Gibson Thesis Defense Decode Example 7 -S 1 R4  R3  R2 7 -D R1 Register Consumer History 5 ldR44 6 addR4R1R4 7 ldR416R1 8 9 Op1 Op2 Dest Dataflow Queue

114 D. Gibson Thesis Defense : add R1 R3 R3 Decode Example 4 -S 1 R4  R3  R2 7 -D R1 Register Consumer History 8 -D R3=0 8 -S1 add→R3

115 D. Gibson Thesis Defense Forwardflow –Dispatch Dispatch into DQ: –Writes metadata and available operands –Appends instruction to forward pointer chains 5 ldR44 6 addR4R1R4 7 ldR416R1 8 addR10R3 9 Op1 Op2 Dest Dataflow Queue


Download ppt "UW-Madison Computer Sciences Multifacet Group© 2010 Scalable Cores in Chip Multiprocessors Thesis Defense 2 November 2010 Dan Gibson."

Similar presentations


Ads by Google