Download presentation
Presentation is loading. Please wait.
Published byLee Ramsey Modified over 8 years ago
1
Coarse-Grained Coherence Mikko H. Lipasti Associate Professor Electrical and Computer Engineering University of Wisconsin – Madison Joint work with:Jason Cantin, IBM (Ph.D. ’06) Natalie Enright Jerger Prof. Jim Smith Prof. Li-Shiuan Peh (Princeton) http://www.ece.wisc.edu/~pharm
2
Motivation Multiprocessors are commonplace Historically, glass house servers Now laptops, soon cell phones Most common multiprocessor Symmetric processors w/coherent caches Logical extension of time-shared uniprocessors Easy to program, reason about Not so easy to build Aug 30, 2007Mikko Lipasti-University of Wisconsin
3
Coherence Granularity Track each individual word Too much overhead Track larger blocks 32B – 128B common Less overhead, exploit spatial locality Large blocks cause false sharing P0P1P2P3P4P5P6P7 Solution: use multiple granularities Small blocks: manage local read/write permissions Large blocks: track global behavior Aug 30, 2007Mikko Lipasti-University of Wisconsin
4
Coarse-Grained Coherence Initially Identify non-shared regions Decouple obtaining coherence permission from data transfer Filter snoops to reduce broadcast bandwidth Later Enable aggressive prefetching Optimize DRAM accesses Customize protocol, interconnect to match Aug 30, 2007Mikko Lipasti-University of Wisconsin
5
Coarse-Grained Coherence Optimizations lead to Reduced memory miss latency Reduced cache-to-cache miss latency Reduced snoop bandwidth Fewer exposed cache misses Elimination of unnecessary DRAM reads Power savings on bus, interconnect, caches, and in DRAM World peace and end to global warming Aug 30, 2007Mikko Lipasti-University of Wisconsin
6
Coarse-Grained Coherence Tracking Memory is divided into coarse-grained regions Aligned, power-of-two multiple of cache line size Can range from two lines to a physical page A cache-like structure is added to each processor for monitoring coherence at the granularity of regions Region Coherence Array (RCA) Aug 30, 2007Mikko Lipasti-University of Wisconsin
7
Aug 30, 2007Mikko Lipasti-University of Wisconsin Each entry has an address tag, state, and count of lines cached by the processor The region state indicates if the processor and / or other processors are sharing / modifying lines in the region Customize policy/protocol/interconnect to exploit region state Region Coherence Arrays
8
Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline Motivation Overview of Coarse-Grained Coherence Techniques Broadcast Snoop Reduction [ISCA 2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview
9
Aug 30, 2007Mikko Lipasti-University of Wisconsin Unnecessary Broadcasts
10
Aug 30, 2007Mikko Lipasti-University of Wisconsin Broadcast Snoop Reduction Identify requests that don’t need a broadcast Send data requests directly to memory w/o broadcasting Reducing broadcast traffic Reducing memory latency Avoid sending non-data requests externally Example
11
Aug 30, 2007Mikko Lipasti-University of Wisconsin Simulator Evaluation PHARMsim: near-RTL but written in C Execution-driven simulator built on top of SimOS-PPC Four 4-way superscalar out-of-order processors Two-level hierarchy with split L1, unified 1MB L2 caches, and 64B lines Separate address / data networks –similar to Sun Fireplane
12
Aug 30, 2007Mikko Lipasti-University of Wisconsin Workloads Scientific Ocean, Raytrace, Barnes Multiprogrammed SPECint2000_rate, SPECint95_rate Commercial (database, web) TPC-W, TPC-B, TPC-H SPECweb99, SPECjbb2000
13
Aug 30, 2007Mikko Lipasti-University of Wisconsin Broadcasts Avoided
14
Aug 30, 2007Mikko Lipasti-University of Wisconsin Execution Time
15
Aug 30, 2007Mikko Lipasti-University of Wisconsin Summary Eliminates nearly all unnecessary broadcasts Reduces snoop activity by 65% Fewer broadcasts Fewer lookups Provides modest speedup
16
Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline Motivation Overview of Coarse-grained Coherence Techniques Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview
17
Aug 30, 2007Mikko Lipasti-University of Wisconsin Prefetching in Multiprocessors Prefetching Anticipate future reference, fetch into cache Many prefetching heuristics possible Current systems: next-block, stride Proposed: skip pointer, content-based Some/many prefetched blocks are not used Multiprocessors complications Premature or unnecessary prefetches Permission thrashing if blocks are shared Separate study [ISPASS 2006]
18
Aug 30, 2007Mikko Lipasti-University of Wisconsin Lines from non-shared regions can be prefetched stealthily and efficiently Without disturbing other processors Without downgrades, invalidations Without preventing them from obtaining exclusive copies Without broadcasting prefetch requests Fetched from DRAM with low overhead Example Stealth Prefetching
19
Aug 30, 2007Mikko Lipasti-University of Wisconsin Stealth Prefetching After a threshold number of L2 misses (2), the rest of the lines from a region are prefetched These lines are buffered close to the processor for later use (Stealth Data Prefetch Buffer) After accessing the RCA, requests may obtain data from the buffer as they would from memory To access data, region must be in valid state and a broadcast unnecessary for coherent access
20
Aug 30, 2007Mikko Lipasti-University of Wisconsin L2 Misses Prefetched
21
Aug 30, 2007Mikko Lipasti-University of Wisconsin Speedup
22
Aug 30, 2007Mikko Lipasti-University of Wisconsin Summary Stealth Prefetching can prefetch data: Stealthily: Only non-shared data prefetched Prefetch requests not broadcast Aggressively: Large regions prefetched at once, 80-90% timely Efficiently: Piggybacked onto a demand request Fetched from DRAM in open-page mode
23
Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline Motivation Overview of Coarse-grained Coherence Techniques Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview
24
Aug 30, 2007Mikko Lipasti-University of Wisconsin Modern systems overlap the DRAM access with the snoop, speculatively accessing DRAM before snoop response Trading DRAM bandwidth for latency Wasting power Approximately 25% of DRAM requests are reads that speculatively access DRAM unnecessarily Power-Efficient DRAM Speculation Broadcast ReqSnoop TagsSend Resp DRAM ReadXmit Block
25
Aug 30, 2007Mikko Lipasti-University of Wisconsin DRAM Operations
26
Aug 30, 2007Mikko Lipasti-University of Wisconsin Direct memory requests are non-speculative Lines from externally-dirty regions likely to be sourced from another processor’s cache Region state can serve as a prediction Need not access DRAM speculatively Initial requests to a region (state unknown) have a lower but significant probability of obtaining data from other processors’ caches Power-Efficient DRAM Speculation
27
Aug 30, 2007Mikko Lipasti-University of Wisconsin Useless DRAM Reads
28
Aug 30, 2007Mikko Lipasti-University of Wisconsin Useful DRAM Reads
29
Aug 30, 2007Mikko Lipasti-University of Wisconsin DRAM Reads Performed/Delayed
30
Aug 30, 2007Mikko Lipasti-University of Wisconsin Summary Power-Efficient DRAM Speculation: Can reduce DRAM reads 20%, with less than 1% degradation in performance 7% slowdown with nonspeculative DRAM Nearly doubles interval between DRAM requests, allowing modules to stay in low-power modes longer
31
Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline Motivation Overview of Coarse-grained Coherence Techniques Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview
32
Aug 30, 2007Mikko Lipasti-University of Wisconsin Chip Multiprocessor Interconnect Options Buses: don’t scale Crossbars: too expensive Rings: too slow Packet-switched mesh Attractive for all the same 1990’s DSM reasons Scalable Low latency High link utilization
33
Aug 30, 2007Mikko Lipasti-University of Wisconsin CMP Interconnection Networks But… Cables/traces are now on-chip wires Fast, cheap, plentiful Short: 1 cycle per hop Router latency adds up 3-4 cycles per hop Store-and-forward Lots of activity/power Is this the right answer?
34
Aug 30, 2007Mikko Lipasti-University of Wisconsin Circuit-Switched Interconnects Communication patterns Spatial locality to memory Pairwise communication Circuit-switched links Avoid switching/routing Reduce latency Save power? Poor utilization! Maybe OK
35
Aug 30, 2007Mikko Lipasti-University of Wisconsin Router Design Switches consist of Configurable crossbar Configuration memory 4-stage router pipeline exposes only 1 cycle if CS Can also act as packet-switched network Design details in [CA Letters ‘07]
36
Aug 30, 2007Mikko Lipasti-University of Wisconsin Protocol Optimization Initial 3-hop miss establishes CS path Subsequent miss requests Sent directly on CS path to predicted owner Also in parallel to home node Predicted owner sources data early Directory acks update to sharing list Benefits Reduced 3-hop latency Less activity, less power
37
Hybrid Circuit Switching (1) Hybrid Circuit Switching improves performance by up to 7% Aug 30, 2007Mikko Lipasti-University of Wisconsin
38
Hybrid Circuit Switching (2) Positive interaction in co-designed interconnect & protocol More circuit reuse => greater latency benefit Aug 30, 2007Mikko Lipasti-University of Wisconsin
39
Aug 30, 2007Mikko Lipasti-University of Wisconsin Summary Hybrid Circuit Switching: Routing overhead eliminated Still enable high bandwidth when needed Co-designed protocol Optimize cache-to-cache transfers Substantial performance benefits To do: power analysis
40
Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline Motivation Overview of Coarse-grained Coherence Techniques Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview
41
Server Consolidation on CMPs CMP as consolidation platform Simplify system administration Save power, cost and physical infrastructure Study combinations of individual workloads in full system environment Micro-coded hypervisor schedules VMs See An Evaluation of Server Consolidation Workloads for Multi-Core Designs in IISWC 2007 for additional details Nugget: shared LLC a big win Aug 30, 2007Mikko Lipasti-University of Wisconsin
42
Virtual Proximity Interactions between VM scheduling, placement, and interconnect Goal: placement agnostic scheduling Best workload balance Evaluate 3 scheduling policies Gang, Affinity and Load Balanced HCS provides virtual proximity Aug 30, 2007Mikko Lipasti-University of Wisconsin
43
Scheduling Algorithms Gang Scheduling Co-schedules all threads of a VM No idle-cycle stealing Affinity Scheduling VMs assigned to neighboring cores Can steal idle cycles across VMs sharing core Load Balanced Scheduling Ready threads assigned to any core Any/all VMs can steal idle cycles Over time, VM fragments across chip Aug 30, 2007Mikko Lipasti-University of Wisconsin
44
Load balancing wins with fast interconnect Affinity scheduling wins with slow interconnect HCS creates virtual proximity Aug 30, 2007Mikko Lipasti-University of Wisconsin
45
HCS able to provide virtual proximity Virtual Proximity Performance Aug 30, 2007Mikko Lipasti-University of Wisconsin
46
As physical distance (hop count) increases, HCS provides significantly lower latency Aug 30, 2007Mikko Lipasti-University of Wisconsin
47
Summary Virtual Proximity [in submission] Enables placement agnostic hypervisor scheduler Results: Up to 17% better than affinity scheduling Idle cycle reduction : 84% over gang and 41% over affinity Low-latency interconnect mitigates increase in L2 cache conflicts from load balancing L2 misses up by 10% but execution time reduced by 11% A flexible, distributed address mapping combined with HCS out-performs a localized affinity-based memory mapping by an average of 7% Aug 30, 2007Mikko Lipasti-University of Wisconsin
48
Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline Motivation Overview of Coarse-grained Coherence Techniques Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview
49
Circuit Switched Snooping (1) Scalable, efficient broadcasting on unordered network Remove latency overhead of directory indirection Extend point-to-point circuit-switched links to trees Low latency multicast via circuit-switched tree Help provide performance isolation as requests do not share same communication medium Aug 30, 2007Mikko Lipasti-University of Wisconsin
50
Circuit-Switched Snooping (2) Extend Coarse Grain Coherence Tracking (CGCT) Remove unnecessary broadcasts Convert broadcasts to multicasts Effective in Server Consolidation Workloads Very few coherence requests to globally shared data Aug 30, 2007Mikko Lipasti-University of Wisconsin
51
Aug 30, 2007Mikko Lipasti-University of Wisconsin Snooping Interconnect Switches consist of Configurable crossbar Configuration memory Circuits span two or more nodes, based on RCA Snooping occurs across circuits All sharers in region join circuit Each link can physically accommodate multiple circuits
52
Aug 30, 2007Mikko Lipasti-University of Wisconsin Circuit-Switched Snooping Use RCA to identify subsets of nodes that share data Create shared circuits among these nodes Design challenges Multi-drop, bidirectional circuits Memory ordering Results: very much in progress
53
Aug 30, 2007Mikko Lipasti-University of Wisconsin Talk Outline Motivation Overview of Coarse-grained Coherence Techniques Broadcast Snoop Reduction [ISCA-2005] Stealth Prefetching [ASPLOS 2006] Power-Efficient DRAM Speculation Hybrid Circuit Switching Virtual Proximity Circuit-switched snooping Research Group Overview
54
Aug 30, 2007Mikko Lipasti-University of Wisconsin Research Group Overview Faculty: Mikko Lipasti, since 1999 Current MS/PhD students Gordie Bell (also IBM), Natalie Enright Jerger, Erika Gunadi, Atif Hashmi, Eric Hill, Lixin Su, Dana Vantrease Graduates, current employment: Intel: Ilhyun Kim, Morris Marden, Craig Saldanha, Madhu Seshadri IBM: Trey Cain, Jason Cantin, Brian Mestan AMD: Kevin Lepak Sun Microsystems: Matt Ramsay, Razvan Cheveresan, Pranay Koka
55
Aug 30, 2007Mikko Lipasti-University of Wisconsin Current Focus Areas Multiprocessors Coherence protocol optimization Interconnection network design Fairness issues in hierarchical systems Microprocessor design Complexity-effective microarchitecture Scalable dynamic scheduling hardware Speculation reduction for power savings Transparent clock gating Domain-specific ISA extensions Software Java Virtual Machine run-time optimization Workload development and characterization
56
Aug 30, 2007Mikko Lipasti-University of Wisconsin Funding National Science Foundation Intel Research Council IBM Faculty Partnership Awards IBM Shared University Research equipment Schneider ECE Faculty Fellowship UW Graduate School
57
Aug 30, 2007Mikko Lipasti-University of Wisconsin Questions? http://www.ece.wisc.edu/~pharm
58
Aug 30, 2007Mikko Lipasti-University of Wisconsin Backup Slides
59
Aug 30, 2007Mikko Lipasti-University of Wisconsin The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region Region Coherence Arrays
60
Aug 30, 2007Mikko Lipasti-University of Wisconsin Region Coherence Arrays On cache misses, the region state is read to determine if a broadcast is necessary On external snoops, the region state is read to provide a region snoop response Piggybacked onto the conventional response Used to update other processors’ region state The regions are kept coherent with a protocol, which summarizes the local and global state of lines in the region
61
Aug 30, 2007Mikko Lipasti-University of Wisconsin P0P0 P1P1 M0M0 M1M1 Network $0$0 RCA$1$1 001 Invalid000 DIExclusiveInvalid0000Invalid000 Invalid0000Invalid000Exclusive 0010 0011 P 1 stores 10000 2 MISS Snoop performed Response sent Data transfer Store: 10000 2 RFO: P 1, 10000 2 0010Pending001Pending Owned, Region Owned DDPending RFO: P 1, 10000 2 Owned, Region Owned DDInvalidModified Data Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Region not exclusive anymore Hits in P 0 cache
62
Aug 30, 2007Mikko Lipasti-University of Wisconsin Overhead Storage for RCA Two bits in snoop response for region snoop response Region Externally Clean/Dirty
63
Aug 30, 2007Mikko Lipasti-University of Wisconsin Overhead RCA maintains inclusion over caches RCA must respond correctly to external requests if lines cached When regions evicted from RCA, their lines are evicted from the cache Replacement algorithm uses line count to favor regions with no lines cached
64
Aug 30, 2007Mikko Lipasti-University of Wisconsin Snoop Traffic – Peak
65
Aug 30, 2007Mikko Lipasti-University of Wisconsin Snoop Traffic – Average
66
Aug 30, 2007Mikko Lipasti-University of Wisconsin Snoop Traffic Peak snoop traffic is halved Average snoop traffic reduced by nearly two thirds The system is more scalable, and may effectively support more processors
67
Aug 30, 2007Mikko Lipasti-University of Wisconsin Coarse-Grain Coherence Tracking can be used to filter external snoops Send external requests to RCA first If region valid and line-count nonzero, send external request to cache Reduces power consumption in the cache tag arrays Increases broadcast snoop latency Tag Lookups Filtered
68
Aug 30, 2007Mikko Lipasti-University of Wisconsin Tag Lookups Filtered
69
Aug 30, 2007Mikko Lipasti-University of Wisconsin Line Evictions for Inclusion
70
Aug 30, 2007Mikko Lipasti-University of Wisconsin L2 Miss Ratio Increase
71
Aug 30, 2007Mikko Lipasti-University of Wisconsin Lines from a region may be prefetched again after a threshold number of L2 misses (currently 2). A bit mask of the lines cached since the last prefetch is used to avoid prefetching useless data Stealth Prefetching
72
Aug 30, 2007Mikko Lipasti-University of Wisconsin Stealth Prefetching Prefetched lines are managed by a simple protocol
73
Aug 30, 2007Mikko Lipasti-University of Wisconsin Prefetch Timeliness
74
Aug 30, 2007Mikko Lipasti-University of Wisconsin Data Traffic
75
Aug 30, 2007Mikko Lipasti-University of Wisconsin Period Between DRAM Requests
76
Aug 30, 2007Mikko Lipasti-University of Wisconsin Switch design
77
Aug 30, 2007Mikko Lipasti-University of Wisconsin Value-Aware Techniques Coherence misses in multiprocessors Store Value Locality [Lepak ‘03] Ensuring consistency Value-based checks [Cain ‘04] Reducing speculation Operand significance Create (nearly) nonspeculative execution schedule Java Virtual Machine runtime optimization [Su] Speculative optimizations [VEE ’07]
78
Aug 30, 2007Mikko Lipasti-University of Wisconsin Complexity-Effective Techniques Scalable dynamic scheduling hardware Half-price architecture [Kim ’03] Macro-op scheduling [Kim ’03] Operand significance [Gunadi] Scalable snoop-based coherence Coarse-grained coherence [Cantin ’06] Circuit-switched coherence [Enright]
79
Aug 30, 2007Mikko Lipasti-University of Wisconsin Power-Efficient Techniques Power-efficient techniques Reduced speculation [Gunadi] Clock gating [E. Hill] Transparent pipelines need fine-grained stalls Redistribute coarse-grained stall cycles Circuit-switched coherence [Enright] Reduce overhead of CMP cache coherence Improve latency, power
80
Aug 30, 2007Mikko Lipasti-University of Wisconsin Cache Coherence Problem P0P1 Load A A0 A0 Store A<= 1 1 Load A Memory
81
Aug 30, 2007Mikko Lipasti-University of Wisconsin Cache Coherence Problem P0P1 Load A A0 A0 Store A<= 1 Memory 1 Load A A1
82
Aug 30, 2007Mikko Lipasti-University of Wisconsin Snoopy Cache Coherence All cache misses broadcast on shared bus Processors and memory snoop and respond Cache block permissions enforced Multiple readers allowed (shared state) Only a single writer (exclusive state) Must upgrade block before writing to it Other copies invalidated Read/write-shared blocks bounce from cache to cache Migratory sharing
83
Aug 30, 2007Mikko Lipasti-University of Wisconsin Data P0P0 $0$0 Invalid0000Pending0010 Example: Conventional Snooping P1P1 $1$1 M0M0 M1M1 Network Load: 10000 2 Invalid0000 Tag State Read: P 0, 10000 2 P 0 loads 10000 2 MISS Snoop performed Invalid0000 Invalid0000 Response sent Invalid Data transfer Data Exclusive
84
Aug 30, 2007Mikko Lipasti-University of Wisconsin $0$0 RCA Coarse-Grain Coherence Tracking P0P0 P1P1 $1$1 M0M0 M1M1 Network RCA P 0 loads 10000 2 Load: 10000 2 Read: P 0, 10000 2 Invalid, Region Not Shared Data TagState Invalid0000 Invalid0000 Invalid0000 Invalid0000 Invalid000 Invalid000 MISS Pending0010 Snoop performed Pending Invalid 000 Response sent Read: P 0, 10000 2 Invalid, Region Not Shared Data transfer DIExclusive001 Region Coherence Array added; two lines per region Data P 0 has exclusive access to region
85
Aug 30, 2007Mikko Lipasti-University of Wisconsin P0P0 P1P1 M0M0 M1M1 Network $0$0 RCA$1$1 Invalid0000 001 Invalid000 0010DIExclusiveInvalid0000Invalid000 Invalid0000Invalid000 TagState P 0 loads 11000 2 Load: 11000 2 MISS, Region Hit Direct request sent Data transfer Read: P 0, 11000 2 Data Pending0011Exclusive Coarse-Grain Coherence Tracking Region Coherence Array added; two lines per region Data Exclusive region state, broadcast unnecessary
86
Aug 30, 2007Mikko Lipasti-University of Wisconsin Impact on Execution Time
87
Aug 30, 2007Mikko Lipasti-University of Wisconsin P0P0 P1P1 M0M0 M1M1 Network $0$0 RCA $1$1 Invalid0000 001 Invalid000 0100 DI ExclusiveInvalid0000 Invalid000 Invalid0000 Invalid000 TagState P 0 loads 0x28 Load: 0x28 MISS, RCA Hit Direct request sent Data transfer Read: P 0, 0x28 Prefetch: 1100 2 Data Pending0101Exclusive Stealth Prefetching Data SDPB Invalid0000Invalid0000 Pending Valid 0110 0111 Prefetch data SDPB Prefetch: 1100 2 Invalid 0000 Assume 8-byte lines, 32-byte regions, 2- line threshold
88
Aug 30, 2007Mikko Lipasti-University of Wisconsin Stealth Prefetching P0P0 P1P1 M0M0 M1M1 Network $0$0 RCA $1$1 001 Invalid0000100 DI ExclusiveInvalid0000 Invalid000 Invalid 0000 Invalid000 TagState 0101Exclusive SDPB Invalid0000Invalid0000 Valid 0110 0111 P 0 loads 0x30 Load: 0x30 Pending0110 Invalid Exclusive Data MISS, SDPB Hit SDPB Data Transfer Return Data Assume 8-byte lines, 32-byte regions, 2- line threshold
89
Communication Latencies CC-NUMACMP Local Cache Access12 Remote Cache-to-Cache Transfer 12 + 21 * H * 3 (H = hop count) 12 + 4 * H * 3 Local Memory Access150 Remote Memory Access150 + 21 * H * 2150 + 4 * H *2 Remote cache access is 2-5x faster in CMPs than NUMA machines Lower communication latencies allow for more flexible thread placement Aug 30, 2007Mikko Lipasti-University of Wisconsin
90
Configuration Simulation Parameters Cores16 single-threaded light-weight, in- order Interconnect2-D Packet-Switched Mesh 3-cycle router pipeline (baseline) Hybrid Circuit-Switched Mesh 4 Circuits L1 CacheSplit I/D, 16KB each (2 cycles) L2 CachePrivate, 128 KB (6 cycles) L3 CacheShared, 16 MB (16 1MB banks) 12 cycles Memory Latency150 cycles Workload Mixes Mix 1TPC-W (4) + TPC-H (4) Mix 2TPC-W (4) + SPECjbb (4) Mix 3TPC-H (4) + SPECjbb(4) Aug 30, 2007Mikko Lipasti-University of Wisconsin
91
Load Balancing with HCS outperforms local placement Virtual proximity to memory home node Effect of Memory Placement Aug 30, 2007Mikko Lipasti-University of Wisconsin
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.