ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali Baniasadi

This work: Reducing redundant snoops in chip multiprocessors 2 Our Goal Improving energy efficiency of WT-based CMP Our Motivation There are long time intervals where snooping fails, wasting energy and bandwidth. Our Solution Detect such intervals and avoid snoops Key Results Memory Energy 18% Snoop Traffic 93% Performance 3.8%

Interconnect Conventional Snooping D$ CPU D$ CPU 2 1 4 4 4 controller 6 5 5 5 Redundant (miss): ~70% 3 3

WB vs. WT 4 Write-through configurationWrite-back configuration High memory trafficLow memory traffic Simple coherency mechanismSophisticated coherency mechanism Relative memory energy consumption

Previous Work: Snoop Filters 5 Good snoop filter 1. Fast & simple 2. Accurate and effective Eliminate redundant snoop (local & global) requests. Local: one core fails to provide data Global: all cores fail. Examples: RegionScout: Detects Memory Regions Not Shared (Moshovos) Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi) Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti)

Our Work Time-based Snoop Filtering Motivation: There are long intervals where snooping fails consecutively But how long & how often? 6

Our Work (Cont.) 7

Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails 8

Distribution 9 (b) GRM distribution (a) LRM distribution for different processors Periods of Data Scarcity are usually long

Time-based Global Miss predictor (TGM) 10 TGM Types: 1.TGM-First: First processor that has failed snooping survives. 2.TGM-Last: Last processor that has failed snooping survives. TGM Goals: 1.Detect GRM intervals 2.Shutting down snooping in all processors but one (surviving node).

TGM implementation 11 TGM-enhanced CMP

TGM 12 (a) Coverage (b) Accuracy

Time-based Local Miss predictor (TLM) 13 Goal: Detect LRMs How? 1. Count consecutive snoop misses in a node 2. Disable snoop when exceeds a threshold 3. Restart snooping after a number of cycles

TLM implementation 14 TGM-enhanced CMP Processing Unit (PU) First Level Cache Predictor Redundant SNoop (RSN) Counter ReStarT (RST) Counter Each Processor

TLM features 15 (a)Coverage (b) Accuracy

Methodology 16 Our Simulator: SESC Benchmarks: Splash-2 To evaluate energy: Cacti 6.5 System used:Quad-Core CMP BenchmarksInput Parameters Barnes 16K Particles Cholesky tk29.O FFT 1024k complex data points Ocean 258x258 ocean Volrend Head Water-Nsqrd 512 molecules Water-spatial 512 molecules Processor Interconnection Network Memory Frequency: 5 GHz Technology: 68 nm Branch Predictor: 16K entry bimodal and gshare Fetch/Issue/Commit 4/4/5 Branch Penalty : 17 cycles RAS: 32 entries BTB: 2k Entries, 2 way Data Interconnect: crossbar Interconnect Width: 64 B IL1: 64KB/ 2 way DL1: 64KB/4way/Write Through Access Time: 1 cycle Block Size: 64 Cache line size: 32 L2:512KB/8way/Write Through Access Time: 11 cycles Block Size: 64 Memory: 1GB Access Time: 70 cycles Page Size: 4 Kbit SPLASH-2 Benchmarks and INPUT parameters System Parameters

Relative Snoop Traffic Reduction 17 TGM-F: 58% TGM-L: 57% TLM: 77%

Relative Memory Energy 18 TGM-F: 8% TGM-L: 8.5% TLM: 11%

Relative Memory Delay 19 TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7%

Relative Performance 20 TGM-F: No Change TGM-L: 0.4% TLM: 0.3%

Summary 21  We showed:  Long data scarcity period (DSP) exist during workload runtime  During DSPs redundant snoops happen frequently and consecutively  Our solutions  TGM:  uses snoop behavior on all processors to detect and filter redundant snoops  Shutdown snoop on as much processor as possible  TLM:  Redundant snoops are filtered in a single node  Counts recent redundant snoops to detect data scarcity periods and filter upcoming redundant snoops  Simulation Results :  Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77%  Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11%  Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7%  Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3%

Thanks for your attention 22

Backup Slides 23

Discussion 24  How Characteristics of the benchmarks affect memory energy/delay reduced by our solution? 2. Share of Redundant Snoops 1. True detection of redundant snoops

Memory Energy.Delay 25 Memory Energy = Energy consumed to provide the requested data Memory Delay = time required to provide the requested data

Volrend Benchmark 26 Volrend while running rarely send snoop requests This application renders a three-dimensional volume. It renders several frames from changing viewpoints consecutive frames in rotation sequences often vary slightly in viewpoint High Temporal Locality Volrend does Load Distribution very well High Spatial Locality

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.

Similar presentations

Presentation on theme: "ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.

Similar presentations

Presentation on theme: "ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali."— Presentation transcript:

Similar presentations

About project

Feedback