Presentation is loading. Please wait.

Presentation is loading. Please wait.

Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason.

Similar presentations


Presentation on theme: "Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason."— Presentation transcript:

1 Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason Cantin, IBM (Ph.D. ’06) Natalie Enright Jerger Prof. Jim Smith Prof. Li-Shiuan Peh (Princeton) http://www.ece.wisc.edu/~pharm

2 Motivation Multiprocessors are commonplace Historically, glass house servers Now laptops, soon cell phones Most common multiprocessor Symmetric processors w/coherent caches Logical extension of time-shared uniprocessors Easy to program, reason about Not so easy to build Aug 30, 2007Mikko Lipasti-University of Wisconsin

3 Coherence Granularity Track each individual word Too much overhead Track larger blocks 32B – 128B common Less overhead, exploit spatial locality Large blocks cause false sharing P0P1P2P3P4P5P6P7 Solution: use multiple granularities Small blocks: manage local read/write permissions Large blocks: track global behavior Aug 30, 2007Mikko Lipasti-University of Wisconsin

4 Coarse-Grained Coherence Initially Identify non-shared regions Decouple obtaining coherence permission from data transfer Filter snoops to reduce broadcast bandwidth Later Enable aggressive prefetching Optimize DRAM accesses Customize protocol, interconnect to match Aug 30, 2007Mikko Lipasti-University of Wisconsin

5 Coarse-Grained Coherence Optimizations lead to Reduced memory miss latency Reduced cache-to-cache miss latency Reduced snoop bandwidth Fewer exposed cache misses Elimination of unnecessary DRAM reads Power savings on bus, interconnect, caches, and in DRAM World peace and end to global warming Aug 30, 2007Mikko Lipasti-University of Wisconsin

6 Coarse-Grained Coherence Tracking Memory is divided into coarse-grained regions Aligned, power-of-two multiple of cache line size Can range from two lines to a physical page A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA) Aug 30, 2007Mikko Lipasti-University of Wisconsin

7 Aug 30, 2007Mikko Lipasti-University of Wisconsin Each entry has an address tag, state, and count of lines cached by the processor The region state indicates if the processor and / or other processors are sharing / modifying lines in the region Customize policy/protocol/interconnect to exploit region state Region Coherence Arrays

8 Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline  Motivation  Overview of Coarse-Grained Coherence Techniques Broadcast Snoop Reduction [ISCA 2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview

9 Aug 30, 2007Mikko Lipasti-University of Wisconsin Unnecessary Broadcasts

10 Aug 30, 2007Mikko Lipasti-University of Wisconsin Broadcast Snoop Reduction Identify requests that don’t need a broadcast Send data requests directly to memory w/o broadcasting Reducing broadcast traffic Reducing memory latency Avoid sending non-data requests externally Example

11 Aug 30, 2007Mikko Lipasti-University of Wisconsin Simulator Evaluation PHARMsim: near-RTL but written in C Execution-driven simulator built on top of SimOS-PPC Four 4-way superscalar out-of-order processors Two-level hierarchy with split L1, unified 1MB L2 caches, and 64B lines Separate address / data networks –similar to Sun Fireplane

12 Aug 30, 2007Mikko Lipasti-University of Wisconsin Workloads Scientific Ocean, Raytrace, Barnes Multiprogrammed SPECint2000_rate, SPECint95_rate Commercial (database, web) TPC-W, TPC-B, TPC-H SPECweb99, SPECjbb2000

13 Aug 30, 2007Mikko Lipasti-University of Wisconsin Broadcasts Avoided

14 Aug 30, 2007Mikko Lipasti-University of Wisconsin Execution Time

15 Aug 30, 2007Mikko Lipasti-University of Wisconsin Summary Eliminates nearly all unnecessary broadcasts Reduces snoop activity by 65% Fewer broadcasts Fewer lookups Provides modest speedup

16 Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline  Motivation  Overview of Coarse-grained Coherence Techniques  Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview

17 Aug 30, 2007Mikko Lipasti-University of Wisconsin Prefetching in Multiprocessors Prefetching Anticipate future reference, fetch into cache Many prefetching heuristics possible Current systems: next-block, stride Proposed: skip pointer, content-based Some/many prefetched blocks are not used Multiprocessors complications Premature or unnecessary prefetches Permission thrashing if blocks are shared Separate study [ISPASS 2006]

18 Aug 30, 2007Mikko Lipasti-University of Wisconsin Lines from non-shared regions can be prefetched stealthily and efficiently Without disturbing other processors Without downgrades, invalidations Without preventing them from obtaining exclusive copies Without broadcasting prefetch requests Fetched from DRAM with low overhead Example Stealth Prefetching

19 Aug 30, 2007Mikko Lipasti-University of Wisconsin Stealth Prefetching After a threshold number of L2 misses (2), the rest of the lines from a region are prefetched These lines are buffered close to the processor for later use (Stealth Data Prefetch Buffer) After accessing the RCA, requests may obtain data from the buffer as they would from memory To access data, region must be in valid state and a broadcast unnecessary for coherent access

20 Aug 30, 2007Mikko Lipasti-University of Wisconsin L2 Misses Prefetched

21 Aug 30, 2007Mikko Lipasti-University of Wisconsin Speedup

22 Aug 30, 2007Mikko Lipasti-University of Wisconsin Summary Stealth Prefetching can prefetch data: Stealthily: Only non-shared data prefetched Prefetch requests not broadcast Aggressively: Large regions prefetched at once, 80-90% timely Efficiently: Piggybacked onto a demand request Fetched from DRAM in open-page mode

23 Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline  Motivation  Overview of Coarse-grained Coherence Techniques  Broadcast Snoop Reduction [ISCA-2005]  Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview

24 Aug 30, 2007Mikko Lipasti-University of Wisconsin Modern systems overlap the DRAM access with the snoop, speculatively accessing DRAM before snoop response Trading DRAM bandwidth for latency Wasting power Approximately 25% of DRAM requests are reads that speculatively access DRAM unnecessarily Power-Efficient DRAM Speculation Broadcast ReqSnoop TagsSend Resp DRAM ReadXmit Block

25 Aug 30, 2007Mikko Lipasti-University of Wisconsin DRAM Operations

26 Aug 30, 2007Mikko Lipasti-University of Wisconsin Direct memory requests are non-speculative Lines from externally-dirty regions likely to be sourced from another processor’s cache Region state can serve as a prediction Need not access DRAM speculatively Initial requests to a region (state unknown) have a lower but significant probability of obtaining data from other processors’ caches Power-Efficient DRAM Speculation

27 Aug 30, 2007Mikko Lipasti-University of Wisconsin Useless DRAM Reads

28 Aug 30, 2007Mikko Lipasti-University of Wisconsin Useful DRAM Reads

29 Aug 30, 2007Mikko Lipasti-University of Wisconsin DRAM Reads Performed/Delayed

30 Aug 30, 2007Mikko Lipasti-University of Wisconsin Summary Power-Efficient DRAM Speculation: Can reduce DRAM reads 20%, with less than 1% degradation in performance 7% slowdown with nonspeculative DRAM Nearly doubles interval between DRAM requests, allowing modules to stay in low-power modes longer

31 Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline  Motivation  Overview of Coarse-grained Coherence Techniques  Broadcast Snoop Reduction [ISCA-2005]  Stealth Prefetching [ASPLOS 2006]  Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview

32 Aug 30, 2007Mikko Lipasti-University of Wisconsin Chip Multiprocessor Interconnect Options Buses: don’t scale Crossbars: too expensive Rings: too slow Packet-switched mesh Attractive for all the same 1990’s DSM reasons Scalable Low latency High link utilization

33 Aug 30, 2007Mikko Lipasti-University of Wisconsin CMP Interconnection Networks But… Cables/traces are now on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop Router latency adds up 3-4 cycles per hop Store-and-forward Lots of activity/power Is this the right answer?

34 Aug 30, 2007Mikko Lipasti-University of Wisconsin Circuit-Switched Interconnects Communication patterns Spatial locality to memory Pairwise communication Circuit-switched links Avoid switching/routing Reduce latency Save power? Poor utilization! Maybe OK

35 Aug 30, 2007Mikko Lipasti-University of Wisconsin Router Design Switches consist of Configurable crossbar Configuration memory 4-stage router pipeline exposes only 1 cycle if CS Can also act as packet-switched network Design details in [CA Letters ‘07]

36 Aug 30, 2007Mikko Lipasti-University of Wisconsin Protocol Optimization Initial 3-hop miss establishes CS path Subsequent miss requests Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list Benefits Reduced 3-hop latency Less activity, less power

37 Hybrid Circuit Switching (1) Hybrid Circuit Switching improves performance by up to 7% Aug 30, 2007Mikko Lipasti-University of Wisconsin

38 Hybrid Circuit Switching (2) Positive interaction in co-designed interconnect & protocol More circuit reuse => greater latency benefit Aug 30, 2007Mikko Lipasti-University of Wisconsin

39 Aug 30, 2007Mikko Lipasti-University of Wisconsin Summary Hybrid Circuit Switching: Routing overhead eliminated Still enable high bandwidth when needed Co-designed protocol Optimize cache-to-cache transfers Substantial performance benefits To do: power analysis

40 Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline  Motivation  Overview of Coarse-grained Coherence Techniques  Broadcast Snoop Reduction [ISCA-2005]  Stealth Prefetching [ASPLOS 2006]  Power-Efficient DRAM Speculation  Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview

41 Server Consolidation on CMPs CMP as consolidation platform Simplify system administration Save power, cost and physical infrastructure Study combinations of individual workloads in full system environment Micro-coded hypervisor schedules VMs See An Evaluation of Server Consolidation Workloads for Multi-Core Designs in IISWC 2007 for additional details Nugget: shared LLC a big win Aug 30, 2007Mikko Lipasti-University of Wisconsin

42 Virtual Proximity Interactions between VM scheduling, placement, and interconnect Goal: placement agnostic scheduling Best workload balance Evaluate 3 scheduling policies Gang, Affinity and Load Balanced HCS provides virtual proximity Aug 30, 2007Mikko Lipasti-University of Wisconsin

43 Scheduling Algorithms Gang Scheduling Co-schedules all threads of a VM No idle-cycle stealing Affinity Scheduling VMs assigned to neighboring cores Can steal idle cycles across VMs sharing core Load Balanced Scheduling Ready threads assigned to any core Any/all VMs can steal idle cycles Over time, VM fragments across chip Aug 30, 2007Mikko Lipasti-University of Wisconsin

44 Load balancing wins with fast interconnect Affinity scheduling wins with slow interconnect HCS creates virtual proximity Aug 30, 2007Mikko Lipasti-University of Wisconsin

45 HCS able to provide virtual proximity Virtual Proximity Performance Aug 30, 2007Mikko Lipasti-University of Wisconsin

46 As physical distance (hop count) increases, HCS provides significantly lower latency Aug 30, 2007Mikko Lipasti-University of Wisconsin

47 Summary Virtual Proximity [in submission] Enables placement agnostic hypervisor scheduler Results: Up to 17% better than affinity scheduling Idle cycle reduction : 84% over gang and 41% over affinity Low-latency interconnect mitigates increase in L2 cache conflicts from load balancing L2 misses up by 10% but execution time reduced by 11% A flexible, distributed address mapping combined with HCS out-performs a localized affinity-based memory mapping by an average of 7% Aug 30, 2007Mikko Lipasti-University of Wisconsin

48 Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline  Motivation  Overview of Coarse-grained Coherence Techniques  Broadcast Snoop Reduction [ISCA-2005]  Stealth Prefetching [ASPLOS 2006]  Power-Efficient DRAM Speculation  Hybrid Circuit Switching  Virtual Proximity Circuit-switched snooping Research Group Overview

49 Circuit Switched Snooping (1) Scalable, efficient broadcasting on unordered network Remove latency overhead of directory indirection Extend point-to-point circuit-switched links to trees Low latency multicast via circuit-switched tree Help provide performance isolation as requests do not share same communication medium Aug 30, 2007Mikko Lipasti-University of Wisconsin

50 Circuit-Switched Snooping (2) Extend Coarse Grain Coherence Tracking (CGCT) Remove unnecessary broadcasts Convert broadcasts to multicasts Effective in Server Consolidation Workloads Very few coherence requests to globally shared data Aug 30, 2007Mikko Lipasti-University of Wisconsin

51 Aug 30, 2007Mikko Lipasti-University of Wisconsin Snooping Interconnect Switches consist of Configurable crossbar Configuration memory Circuits span two or more nodes, based on RCA Snooping occurs across circuits All sharers in region join circuit Each link can physically accommodate multiple circuits

52 Aug 30, 2007Mikko Lipasti-University of Wisconsin Circuit-Switched Snooping Use RCA to identify subsets of nodes that share data Create shared circuits among these nodes Design challenges Multi-drop, bidirectional circuits Memory ordering Results: very much in progress

53 Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline  Motivation  Overview of Coarse-grained Coherence Techniques  Broadcast Snoop Reduction [ISCA-2005]  Stealth Prefetching [ASPLOS 2006]  Power-Efficient DRAM Speculation  Hybrid Circuit Switching  Virtual Proximity  Circuit-switched snooping Research Group Overview

54 Aug 30, 2007Mikko Lipasti-University of Wisconsin Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students Gordie Bell (also IBM), Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease Graduates, current employment: Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu Seshadri IBM: Trey Cain, Jason Cantin, Brian Mestan AMD: Kevin Lepak Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay Koka

55 Aug 30, 2007Mikko Lipasti-University of Wisconsin Current Focus Areas Multiprocessors Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions Software Java Virtual Machine run-time optimization Workload development and characterization

56 Aug 30, 2007Mikko Lipasti-University of Wisconsin Funding National Science Foundation Intel Research Council IBM Faculty Partnership Awards IBM Shared University Research equipment Schneider ECE Faculty Fellowship UW Graduate School

57 Aug 30, 2007Mikko Lipasti-University of Wisconsin Questions? http://www.ece.wisc.edu/~pharm

58 Aug 30, 2007Mikko Lipasti-University of Wisconsin Backup Slides

59 Aug 30, 2007Mikko Lipasti-University of Wisconsin The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region Region Coherence Arrays

60 Aug 30, 2007Mikko Lipasti-University of Wisconsin Region Coherence Arrays On cache misses, the region state is read to determine if a broadcast is necessary On external snoops, the region state is read to provide a region snoop response Piggybacked onto the conventional response Used to update other processors’ region state The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region

61 Aug 30, 2007Mikko Lipasti-University of Wisconsin P0P0 P1P1 M0M0 M1M1 Network $0$0 RCA$1$1 001 Invalid000 DIExclusiveInvalid0000Invalid000 Invalid0000Invalid000Exclusive 0010 0011 P 1 stores 10000 2  MISS Snoop performed Response sent Data transfer Store: 10000 2 RFO: P 1, 10000 2 0010Pending001Pending Owned, Region Owned DDPending RFO: P 1, 10000 2 Owned, Region Owned DDInvalidModified Data Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Region not exclusive anymore  Hits in P 0 cache

62 Aug 30, 2007Mikko Lipasti-University of Wisconsin Overhead Storage for RCA Two bits in snoop response for region snoop response Region Externally Clean/Dirty

63 Aug 30, 2007Mikko Lipasti-University of Wisconsin Overhead RCA maintains inclusion over caches RCA must respond correctly to external requests if lines cached When regions evicted from RCA, their lines are evicted from the cache Replacement algorithm uses line count to favor regions with no lines cached

64 Aug 30, 2007Mikko Lipasti-University of Wisconsin Snoop Traffic – Peak

65 Aug 30, 2007Mikko Lipasti-University of Wisconsin Snoop Traffic – Average

66 Aug 30, 2007Mikko Lipasti-University of Wisconsin Snoop Traffic Peak snoop traffic is halved Average snoop traffic reduced by nearly two thirds The system is more scalable, and may effectively support more processors

67 Aug 30, 2007Mikko Lipasti-University of Wisconsin Coarse-Grain Coherence Tracking can be used to filter external snoops Send external requests to RCA first If region valid and line-count nonzero, send external request to cache Reduces power consumption in the cache tag arrays Increases broadcast snoop latency Tag Lookups Filtered

68 Aug 30, 2007Mikko Lipasti-University of Wisconsin Tag Lookups Filtered

69 Aug 30, 2007Mikko Lipasti-University of Wisconsin Line Evictions for Inclusion

70 Aug 30, 2007Mikko Lipasti-University of Wisconsin L2 Miss Ratio Increase

71 Aug 30, 2007Mikko Lipasti-University of Wisconsin Lines from a region may be prefetched again after a threshold number of L2 misses (currently 2). A bit mask of the lines cached since the last prefetch is used to avoid prefetching useless data Stealth Prefetching

72 Aug 30, 2007Mikko Lipasti-University of Wisconsin Stealth Prefetching Prefetched lines are managed by a simple protocol

73 Aug 30, 2007Mikko Lipasti-University of Wisconsin Prefetch Timeliness

74 Aug 30, 2007Mikko Lipasti-University of Wisconsin Data Traffic

75 Aug 30, 2007Mikko Lipasti-University of Wisconsin Period Between DRAM Requests

76 Aug 30, 2007Mikko Lipasti-University of Wisconsin Switch design

77 Aug 30, 2007Mikko Lipasti-University of Wisconsin Value-Aware Techniques Coherence misses in multiprocessors Store Value Locality [Lepak ‘03] Ensuring consistency Value-based checks [Cain ‘04] Reducing speculation Operand significance Create (nearly) nonspeculative execution schedule Java Virtual Machine runtime optimization [Su] Speculative optimizations [VEE ’07]

78 Aug 30, 2007Mikko Lipasti-University of Wisconsin Complexity-Effective Techniques Scalable dynamic scheduling hardware Half-price architecture [Kim ’03] Macro-op scheduling [Kim ’03] Operand significance [Gunadi] Scalable snoop-based coherence Coarse-grained coherence [Cantin ’06] Circuit-switched coherence [Enright]

79 Aug 30, 2007Mikko Lipasti-University of Wisconsin Power-Efficient Techniques Power-efficient techniques Reduced speculation [Gunadi] Clock gating [E. Hill] Transparent pipelines need fine-grained stalls Redistribute coarse-grained stall cycles Circuit-switched coherence [Enright] Reduce overhead of CMP cache coherence Improve latency, power

80 Aug 30, 2007Mikko Lipasti-University of Wisconsin Cache Coherence Problem P0P1 Load A A0 A0 Store A<= 1 1 Load A Memory

81 Aug 30, 2007Mikko Lipasti-University of Wisconsin Cache Coherence Problem P0P1 Load A A0 A0 Store A<= 1 Memory 1 Load A A1

82 Aug 30, 2007Mikko Lipasti-University of Wisconsin Snoopy Cache Coherence All cache misses broadcast on shared bus Processors and memory snoop and respond Cache block permissions enforced Multiple readers allowed (shared state) Only a single writer (exclusive state) Must upgrade block before writing to it Other copies invalidated Read/write-shared blocks bounce from cache to cache Migratory sharing

83 Aug 30, 2007Mikko Lipasti-University of Wisconsin Data P0P0 $0$0 Invalid0000Pending0010 Example: Conventional Snooping P1P1 $1$1 M0M0 M1M1 Network Load: 10000 2 Invalid0000 Tag State Read: P 0, 10000 2 P 0 loads 10000 2  MISS Snoop performed Invalid0000 Invalid0000 Response sent Invalid Data transfer Data Exclusive

84 Aug 30, 2007Mikko Lipasti-University of Wisconsin $0$0 RCA Coarse-Grain Coherence Tracking P0P0 P1P1 $1$1 M0M0 M1M1 Network RCA P 0 loads 10000 2 Load: 10000 2 Read: P 0, 10000 2 Invalid, Region Not Shared Data TagState Invalid0000 Invalid0000 Invalid0000 Invalid0000 Invalid000 Invalid000  MISS Pending0010 Snoop performed Pending Invalid 000 Response sent Read: P 0, 10000 2 Invalid, Region Not Shared Data transfer DIExclusive001 Region Coherence Array added; two lines per region Data P 0 has exclusive access to region

85 Aug 30, 2007Mikko Lipasti-University of Wisconsin P0P0 P1P1 M0M0 M1M1 Network $0$0 RCA$1$1 Invalid0000 001 Invalid000 0010DIExclusiveInvalid0000Invalid000 Invalid0000Invalid000 TagState P 0 loads 11000 2 Load: 11000 2  MISS, Region Hit Direct request sent Data transfer Read: P 0, 11000 2 Data Pending0011Exclusive Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Data Exclusive region state, broadcast unnecessary

86 Aug 30, 2007Mikko Lipasti-University of Wisconsin Impact on Execution Time

87 Aug 30, 2007Mikko Lipasti-University of Wisconsin P0P0 P1P1 M0M0 M1M1 Network $0$0 RCA $1$1 Invalid0000 001 Invalid000 0100 DI ExclusiveInvalid0000 Invalid000 Invalid0000 Invalid000 TagState P 0 loads 0x28 Load: 0x28  MISS, RCA Hit Direct request sent Data transfer Read: P 0, 0x28 Prefetch: 1100 2 Data Pending0101Exclusive Stealth Prefetching Data SDPB Invalid0000Invalid0000 Pending Valid 0110 0111 Prefetch data SDPB Prefetch: 1100 2 Invalid 0000 Assume 8-byte lines, 32-byte regions, 2- line threshold

88 Aug 30, 2007Mikko Lipasti-University of Wisconsin Stealth Prefetching P0P0 P1P1 M0M0 M1M1 Network $0$0 RCA $1$1 001 Invalid0000100 DI ExclusiveInvalid0000 Invalid000 Invalid 0000 Invalid000 TagState 0101Exclusive SDPB Invalid0000Invalid0000 Valid 0110 0111 P 0 loads 0x30 Load: 0x30 Pending0110 Invalid Exclusive Data  MISS, SDPB Hit SDPB Data Transfer Return Data Assume 8-byte lines, 32-byte regions, 2- line threshold

89 Communication Latencies CC-NUMACMP Local Cache Access12 Remote Cache-to-Cache Transfer 12 + 21 * H * 3 (H = hop count) 12 + 4 * H * 3 Local Memory Access150 Remote Memory Access150 + 21 * H * 2150 + 4 * H *2 Remote cache access is 2-5x faster in CMPs than NUMA machines Lower communication latencies allow for more flexible thread placement Aug 30, 2007Mikko Lipasti-University of Wisconsin

90 Configuration Simulation Parameters Cores16 single-threaded light-weight, in- order Interconnect2-D Packet-Switched Mesh 3-cycle router pipeline (baseline) Hybrid Circuit-Switched Mesh 4 Circuits L1 CacheSplit I/D, 16KB each (2 cycles) L2 CachePrivate, 128 KB (6 cycles) L3 CacheShared, 16 MB (16 1MB banks) 12 cycles Memory Latency150 cycles Workload Mixes Mix 1TPC-W (4) + TPC-H (4) Mix 2TPC-W (4) + SPECjbb (4) Mix 3TPC-H (4) + SPECjbb(4) Aug 30, 2007Mikko Lipasti-University of Wisconsin

91 Load Balancing with HCS outperforms local placement Virtual proximity to memory home node Effect of Memory Placement Aug 30, 2007Mikko Lipasti-University of Wisconsin


Download ppt "Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason."

Similar presentations


Ads by Google