Presentation is loading. Please wait.

Presentation is loading. Please wait.

© Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch On Power and Multi-Processors Finishing up power issues and how those issues have led us to multi-core.

Similar presentations


Presentation on theme: "© Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch On Power and Multi-Processors Finishing up power issues and how those issues have led us to multi-core."— Presentation transcript:

1 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch On Power and Multi-Processors Finishing up power issues and how those issues have led us to multi-core processors. Introduce multi-processor systems.

2 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch Stuff upcoming soon 3/24: HW4 due (date later than schedule) 3/26: MS2 3/27: MS2 meetings (sign up posted on 3/24) 4/2: Quiz (date later than schedule)

3 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch Power from last time Power is perhaps the performance limiter – Can’t remove enough heat to keep performance increasing Even for things with plugs, energy is an issue – $$$$ for energy, $$$$ to remove heat. For things without plugs, energy is huge – Cost of batteries – Time between charging

4 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch What we did last time (1/4) Power is important – It has become a (perhaps the) limiting factor in performance

5 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch When last we met (2/4) Power is important to computer architects. – It limits performance. With cooling techniques we can increase our power budget (and thus our performance). – But these techniques get very expensive very very quickly. – Both to build and operate active cooling devices. Costs power (and thus $) to coolActive cooling system costs $ to build

6 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch When we last met (3/4) Energy is important to computer architects – Energy is what we really pay for (10¢ per kWh) – Energy is what limits batteries (usually listed as mAh) AA batteries tend to have mAh or (assuming 1.5V) Wh* iPad battery is rated at about 25Wh. – Some devices limited by energy they can scavenge. * Watch those units. Assuming it takes 5Wh of energy to charge a 3Wh battery, how much does it cost for the energy to charge that battery? What about an iPad?

7 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch Power vs. Energy (not covered last time) What uses power in a chip?

8 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch When last we met (4/4) How do performance and power relate?* – Power approximately proportional to V 2 f. – Performance is approximately proportional to f – The required voltage is approximately proportional to the desired frequency.** If you accept all of these assumptions, we get that an X increase in performance would require an X 3 increase in power. – This is a really important fact *These are all pretty rough rules of thumb. Consider the second one and discuss its shortcomings. **This one in particular tends to hold only over fairly small (10-20%?) changes in V.

9 9 Dynamic (Capacitive) Power Dissipation Data dependent – a function of switching activity What uses power in a chip?

10 10 Capacitive Power dissipation Power ~ ½ CV 2 Af Capacitance: Function of wire length, transistor size Supply Voltage: Has been dropping with successive fab generations Clock frequency: Increasing… Activity factor: How often, on average, do wires switch? What uses power in a chip?

11 11 Static Power: Leakage Currents Subthreshold currents grow exponentially with increases in temperature, decreases in threshold voltage But threshold voltage scaling is key to circuit performance! Gate leakage primarily dependent on gate oxide thickness, biases Both type of leakage heavily dependent on stacking and input pattern What uses power in a chip?

12 © Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch What we didn’t quite do last time For things without plugs, energy is huge – Cost of batteries – Time between charging (somewhat extreme) Example: – iPad has ~42Wh (11,560mAh) which is huge – Still run down a battery in hours – Costs $99 for a new one (2 years of use or so)

13 13 © Brehob -- Portions © Brooks, Wenisch E vs. EDP vs. ED 2 P Currently have a processor design: 80W, 1 BIPS, 1.5V, 1GHz Want to reduce power, willing to lose some performance Cache Optimization: –IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W –Relative E = MIPS/W (higher is better) = 14/12.5 = 1.125x Energy is better, but is this a “better” processor? What uses power in a chip?

14 14 © Brehob -- Portions © Brooks, Wenisch Not necessarily 80W, 1 BIPS, 1.5V, 1GHz Cache Optimization: –IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W –Relative E = MIPS/W (higher is better) = 14/12.5 = 1.125x –Relative EDP = MIPS 2 /W = 1.01x –Relative ED 2 P = MIPS 3 /W =.911x What if we just adjust frequency/voltage on processor? How to reduce power by 20%? P = CV 2 F = CV 3 => Drop voltage by 7% (and also Freq) =>.93*.93*.93 =.8x So for equal power (64W) –Cache Optimization = 900MIPS –Simple Voltage/Frequency Scaling = 930MIPS What uses power in a chip?

15 15 © Brehob -- Portions © Brooks, Wenisch So? What we’ve concluded is that if we want to increase performance by a factor of X, we might be looking at a factor of X 3 power! But if you are paying attention, that’s just for voltage scaling! What about other techniques? Performance & Power

16 16 © Brehob -- Portions © Brooks, Wenisch Other techniques? Well, we could try to improve the amount of ILP we take advantage of. That probably involves making a “wider” processor (more superscalar) What are the costs associated with doing that? –How much bigger do things get? What do we expect the performance gains to be? How about circuit techniques? Historically the “threshold voltage” has dropped as circuits get smaller. –So power drops. This has (mostly) stopped being true. –And it’s actually what got us in trouble to begin with! Performance & Power

17 17 © Brehob -- Portions © Brooks, Wenisch So we are hosed… I mean if voltage scaling doesn’t work, circuit shrinking doesn’t help (much), and ILP techniques don’t clearly work… What’s left? How about we drop performance to 80% of what it was and have 2 processors? How much power does that take? How much performance could we get? Pros/Cons? What if I wanted 8 processors? –How much performance drop needed per processor? Performance & Power

18 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Multiprocessors Keeping it all working together

19 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Why multi-processors? r Multi-processors have been around for a long time. m Originally used to get best performance for certain highly-parallel tasks. r But as noted, we now use them to get solid performance per unit of energy. So that’s it? r Not so much. m We need to make it possible/reasonable/easy to use. m Nothing comes for free. q If we take a task and break it up so it runs on a number of processors, there is going to be a price.

20 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Thread-Level Parallelism Thread-level parallelism (TLP) r Collection of asynchronous tasks: not started and stopped together r Data shared loosely, dynamically Example: database/web server (each query is a thread)  accts is shared, can’t register allocate even if it were scalar  id and amt are private variables, register allocated to r1, r2 struct acct_t { int bal; }; shared struct acct_t accts[MAX_ACCT]; int id,amt; if (accts[id].bal >= amt) { accts[id].bal -= amt; spew_cash(); } 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 6:

21 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Shared-Memory Multiprocessors P1P1 P1P1 P2P2 P2P2 P3P3 P3P3 P4P4 P4P4 Memory System Shared memory r Multiple execution contexts sharing a single address space m Multiple programs (MIMD) m Or more frequently: multiple copies of one program (SPMD) r Implicit (automatic) communication via loads and stores

22 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch What’s the other option? Basically the only other option is “message passing” r We communicate via explicit messages. r So instead of just changing a variable, we’d need to call a function to pass a specific message. Message passing systems are easy to build and pretty effeicent. r But harder to code. Shared memory programming is basically the same as multi- threaded programming on one processors r And (many) programmers already know how to do that.

23 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch So Why Shared Memory? Pluses r For applications looks like multitasking uniprocessor r For OS only evolutionary extensions required r Easy to do communication without OS being involved r Software can worry about correctness first then performance Minuses r Proper synchronization is complex r Communication is implicit so harder to optimize r Hardware designers must implement Result r Traditionally bus-based Symmetric Multiprocessors (SMPs), and now the CMPs are the most success parallel machines ever r And the first with multi-billion-dollar markets

24 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Shared-Memory Multiprocessors P1P1 P1P1 P2P2 P2P2 P3P3 P3P3 P4P4 P4P4 There are lots of ways to connect processors together Interconnection Network Cache M1M1 M1M1 Interface Cache M2M2 M2M2 Interface Cache M3M3 M3M3 Interface Cache M4M4 M4M4 Interface

25 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Paired vs. Separate Processor/Memory? Separate processor/memory r Uniform memory access (UMA): equal latency to all memory + Simple software, doesn’t matter where you put data – Lower peak performance r Bus-based UMAs common: symmetric multi-processors (SMP) Paired processor/memory r Non-uniform memory access (NUMA): faster to local memory – More complex software: where you put data matters + Higher peak performance: assuming proper data placement CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) Mem CPU($) MemRRRR

26 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Shared vs. Point-to-Point Networks Shared network: e.g., bus (left) + Low latency – Low bandwidth: doesn’t scale beyond ~16 processors + Shared property simplifies cache coherence protocols (later) Point-to-point network: e.g., mesh or ring (right) – Longer latency: may need multiple “hops” to communicate + Higher bandwidth: scales to 1000s of processors – Cache coherence protocols are complex CPU($) Mem CPU($) MemR CPU($) MemR CPU($) MemR CPU($) MemR CPU($) Mem CPU($) Mem CPU($) MemRRRR

27 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Organizing Point-To-Point Networks Network topology: organization of network r Tradeoff performance (connectivity, latency, bandwidth)  cost Router chips r Networks that require separate router chips are indirect r Networks that use processor/memory/router packages are direct + Fewer components, “Glueless MP” Point-to-point network examples r Indirect tree (left) r Direct mesh or ring (right) CPU($) Mem CPU($) Mem CPU($) Mem CPU($) MemRRRR R R R CPU($) MemR CPU($) MemR CPU($) MemR CPU($) MemR

28 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Implementation #1: Snooping Bus MP Two basic implementations Bus-based systems r Typically small: 2–8 (maybe 16) processors r Typically processors split from memories (UMA) m Sometimes multiple processors on single chip (CMP) m Symmetric multiprocessors (SMPs) m Common, I use one everyday CPU($) Mem CPU($) Mem CPU($)

29 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Implementation #2: Scalable MP General point-to-point network-based systems r Typically processor/memory/router blocks (NUMA) m Glueless MP: no need for additional “glue” chips r Can be arbitrarily large: 1000’s of processors m Massively parallel processors (MPPs) r In reality only government (DoD) has MPPs… m Companies have much smaller systems: 32–64 processors m Scalable multi-processors CPU($) MemR CPU($) MemR CPU($) MemR CPU($) MemR

30 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Issues for Shared Memory Systems Two in particular r Cache coherence r Memory consistency model Closely related to each other

31 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch An Example Execution Two $100 withdrawals from account #241 at two ATMs r Each transaction maps to thread on different processor  Track accts[241].bal (address is in r3 ) Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash CPU0MemCPU1

32 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch No-Cache, No-Problem Scenario I: processors have no caches r No problem Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash

33 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Cache Incoherence Scenario II: processors have write-back caches  Potentially 3 copies of accts[241].bal : memory, p0$, p1$ r Can get incoherent (inconsistent) Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 500 V: D: D: V:500 D: D:400

34 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Hardware Cache Coherence Coherence controller: r Examines bus traffic (addresses and data) r Executes coherence protocol m What to do with local copy when you see different things happening on bus CPU D$ data D$ tags CC bus

35 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Snooping Cache-Coherence Protocols Bus provides serialization point Each cache controller “snoops” all bus transactions r take action to ensure coherence m invalidate m update m supply value r depends on state of the block and the protocol

36 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Snooping Design Choices Processor ld/st Snoop (observed bus transaction) StateTagData... Controller updates state of blocks in response to processor and snoop events and generates bus xactions Often have duplicate cache tags Snoopy protocol r set of states r state-transition diagram r actions Basic Choices r write-through vs. write-back r invalidate vs. update Cache

37 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch The Simple Invalidate Snooping Protocol Write-through, no- write-allocate cache Actions: PrRd, PrWr, BusRd, BusWr PrWr / BusWr Valid BusWr Invalid PrWr / BusWr PrRd / BusRd PrRd / --

38 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch Example time Processor 1Processor 2Bus Processor Transaction Cache State Processor Transaction Cache State Read A Write A Read A Write A Actions: PrRd, PrWr, BusRd, BusWr

39 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch More Generally: MOESI [Sweazey & Smith ISCA86] M - Modified (dirty) O - Owned (dirty but shared) WHY? E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid Variants r MSI r MESI r MOSI r MOESI O M E S I ownership validity exclusiveness

40 EECS 470 © Brehob -- Portions © Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Wenisch MESI example M - Modified (dirty) E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid Processor 1Processor 2Bus Processor Transaction Cache State Processor Transaction Cache State Read A Write A Read A Write A Actions: PrRd, PrWr, BRL – Bus Read Line (BusRd) BWL – Bus Write Line (BusWr) BRIL – Bus Read and Invalidate BIL – Bus Invalidate Line


Download ppt "© Brehob -- Portions © Brooks, Dutta, Mudge & Wenisch On Power and Multi-Processors Finishing up power issues and how those issues have led us to multi-core."

Similar presentations


Ads by Google