ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.

Slides:



Advertisements
Similar presentations
Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.
Advertisements

A Preliminary Attempt ECEn 670 Semester Project Wei Dang Jacob Frogget Poisson Processes and Maximum Likelihood Estimator for Cache Replacement.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.
Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
Perceptron-based Global Confidence Estimation for Value Prediction Master’s Thesis Michael Black June 26, 2003.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
Multiprocessor Cache Coherency
McRouter: Multicast within a Router for High Performance NoCs
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
Analysis of Branch Predictors
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Power and Frequency Analysis for Data and Control Independence in Embedded Processors Farzad Samie Amirali Baniasadi Sharif University of Technology University.
The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.
Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.
The Evicted-Address Filter
1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.
By Islam Atta Supervised by Dr. Ihab Talkhan
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Region-Centric Memory Design AENAO Research Group Patrick Akl, M.A.Sc. Ioana Burcea, Ph.D. C. Myrto Papadopoulou, M.A.Sc. C. Elham Safi, Ph.D. C. Jason.
ASR: Adaptive Selective Replication for CMP Caches
Resource Aware Scheduler – Initial Results
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Multiprocessor Cache Coherency
Cache Memory Presentation I
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
Building Expressive, Area-Efficient Coherence Directories
Energy-Efficient Address Translation
Using Packet Information for Efficient Communication in NoCs
CARP: Compression-Aware Replacement Policies
Improving Multiple-CMP Systems with Token Coherence
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
rePLay: A Hardware Framework for Dynamic Optimization
Presentation transcript:

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali Baniasadi

This work: Reducing redundant snoops in chip multiprocessors 2 Our Goal Improving energy efficiency of WT-based CMP Our Motivation There are long time intervals where snooping fails, wasting energy and bandwidth. Our Solution Detect such intervals and avoid snoops Key Results Memory Energy 18% Snoop Traffic 93% Performance 3.8%

Interconnect Conventional Snooping D$ CPU D$ CPU controller Redundant (miss): ~70% 3 3

WB vs. WT 4 Write-through configurationWrite-back configuration High memory trafficLow memory traffic Simple coherency mechanismSophisticated coherency mechanism Relative memory energy consumption

Previous Work: Snoop Filters 5 Good snoop filter 1. Fast & simple 2. Accurate and effective Eliminate redundant snoop (local & global) requests. Local: one core fails to provide data Global: all cores fail. Examples: RegionScout: Detects Memory Regions Not Shared (Moshovos) Selective Snoop Request: Predicts Supplier (Atoofian & Baniasadi) Serial Snooping: Requests Nodes One by One (Saldanha & Lipasti)

Our Work Time-based Snoop Filtering Motivation: There are long intervals where snooping fails consecutively But how long & how often? 6

Our Work (Cont.) 7

Global Read Miss (GRM): Occurs whenever the last snoopbyall processors fail Local Read Miss (LRM): Redundant snoop occurringbya singleprocessor fails 8

Distribution 9 (b) GRM distribution (a) LRM distribution for different processors Periods of Data Scarcity are usually long

Time-based Global Miss predictor (TGM) 10 TGM Types: 1.TGM-First: First processor that has failed snooping survives. 2.TGM-Last: Last processor that has failed snooping survives. TGM Goals: 1.Detect GRM intervals 2.Shutting down snooping in all processors but one (surviving node).

TGM implementation 11 TGM-enhanced CMP

TGM 12 (a) Coverage (b) Accuracy

Time-based Local Miss predictor (TLM) 13 Goal: Detect LRMs How? 1. Count consecutive snoop misses in a node 2. Disable snoop when exceeds a threshold 3. Restart snooping after a number of cycles

TLM implementation 14 TGM-enhanced CMP Processing Unit (PU) First Level Cache Predictor Redundant SNoop (RSN) Counter ReStarT (RST) Counter Each Processor

TLM features 15 (a)Coverage (b) Accuracy

Methodology 16 Our Simulator: SESC Benchmarks: Splash-2 To evaluate energy: Cacti 6.5 System used:Quad-Core CMP BenchmarksInput Parameters Barnes 16K Particles Cholesky tk29.O FFT 1024k complex data points Ocean 258x258 ocean Volrend Head Water-Nsqrd 512 molecules Water-spatial 512 molecules Processor Interconnection Network Memory Frequency: 5 GHz Technology: 68 nm Branch Predictor: 16K entry bimodal and gshare Fetch/Issue/Commit 4/4/5 Branch Penalty : 17 cycles RAS: 32 entries BTB: 2k Entries, 2 way Data Interconnect: crossbar Interconnect Width: 64 B IL1: 64KB/ 2 way DL1: 64KB/4way/Write Through Access Time: 1 cycle Block Size: 64 Cache line size: 32 L2:512KB/8way/Write Through Access Time: 11 cycles Block Size: 64 Memory: 1GB Access Time: 70 cycles Page Size: 4 Kbit SPLASH-2 Benchmarks and INPUT parameters System Parameters

Relative Snoop Traffic Reduction 17 TGM-F: 58% TGM-L: 57% TLM: 77%

Relative Memory Energy 18 TGM-F: 8% TGM-L: 8.5% TLM: 11%

Relative Memory Delay 19 TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7%

Relative Performance 20 TGM-F: No Change TGM-L: 0.4% TLM: 0.3%

Summary 21  We showed:  Long data scarcity period (DSP) exist during workload runtime  During DSPs redundant snoops happen frequently and consecutively  Our solutions  TGM:  uses snoop behavior on all processors to detect and filter redundant snoops  Shutdown snoop on as much processor as possible  TLM:  Redundant snoops are filtered in a single node  Counts recent redundant snoops to detect data scarcity periods and filter upcoming redundant snoops  Simulation Results :  Snoop Reduction: TGM-F: 58% TGM-L: 57% TLM: 77%  Memory Energy: TGM-F: 8% TGM-L: 8.5% TLM: 11%  Memory Delay: TGM-F: 1.1% TGM-L: 2.1% TLM: 1.7%  Performance: TGM-F: no change TGM-L: 0.4% TLM: 0.3%

Thanks for your attention 22

Backup Slides 23

Discussion 24  How Characteristics of the benchmarks affect memory energy/delay reduced by our solution? 2. Share of Redundant Snoops 1. True detection of redundant snoops

Memory Energy.Delay 25 Memory Energy = Energy consumed to provide the requested data Memory Delay = time required to provide the requested data

Volrend Benchmark 26 Volrend while running rarely send snoop requests This application renders a three-dimensional volume. It renders several frames from changing viewpoints consecutive frames in rotation sequences often vary slightly in viewpoint High Temporal Locality Volrend does Load Distribution very well High Spatial Locality