Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter.

Slides:

Advertisements

Similar presentations

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

Advertisements

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Cache Coherence “Can we do a better job of supporting cache coherence?” Ross Daly Chan Kim.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Shared-memory.

The University of Adelaide, School of Computer Science

Mathew Paul and Peter Petrov Proceedings of the IEEE Symposium on Application Specific Processors (SASP ’09) July /6/13.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.

Memory Organization.

Flexible Snooping: Adaptive Forwarding and Filtering of Snoops in Embedded-Ring Multiprocessors Karin Strauss, Xiaowei Shen*, Josep Torrellas University.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

1 Presenter: Chien-Chih Chen Proceedings of the 2002 workshop on Memory system performance.

Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

Cache Organization of Pentium

Multiprocessor Cache Coherency

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

CPACT – The Conditional Parameter Adjustment Cache Tuner for Dual-Core Architectures + Also Affiliated with NSF Center for High- Performance Reconfigurable.

LOGO Multi-core Architecture GV: Nguyễn Tiến Dũng Sinh viên: Ngô Quang Thìn Nguyễn Trung Thành Trần Hoàng Điệp Lớp: KSTN-ĐTVT-K52.

A Novel Cache Architecture with Enhanced Performance and Security Zhenghong Wang and Ruby B. Lee.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (

CPE432 Chapter 5A.1Dr. W. Abu-Sufah, UJ Chapter 5B:Virtual Memory Adapted from Slides by Prof. Mary Jane Irwin, Penn State University Read Section 5.4,

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

2013/01/14 Yun-Chung Yang Energy-Efficient Trace Reuse Cache for Embedded Processors Yi-Ying Tsai and Chung-Ho Chen 2010 IEEE Transactions On Very Large.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

Cache Control and Cache Coherence Protocols How to Manage State of Cache How to Keep Processors Reading the Correct Information.

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Using Prediction to Accelerate Coherence Protocols Authors : Shubendu S. Mukherjee and Mark D. Hill Proceedings. The 25th Annual International Symposium.

A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.

ImanFaraji Time-based Snoop Filtering in Chip Multiprocessors Amirkabir University of Technology Tehran, Iran University of Victoria Victoria, Canada Amirali.

Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

OpenRISC 1000 Yung-Luen Lan, b Cache Model Perspective of the Programming Model. Hence, the hardware implementation details (cache organization.

Dynamic Phase-based Tuning for Embedded Systems Using Phase Distance Mapping + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.

1 Efficient System-on-Chip Energy Management with a Segmented Counting Bloom Filter Mrinmoy Ghosh- Georgia Tech Emre Özer- ARM Ltd Stuart Biles- ARM Ltd.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

Cache Miss-Aware Dynamic Stack Allocation Authors: S. Jang. et al. Conference: International Symposium on Circuits and Systems (ISCAS), 2007 Presenter:

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.

MPSoC Design using Application-Specific Architecturally Visible Communication Theo Kluter Philip Brisk Edoardo Charbon Paolo Ienne.

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

Cache Organization of Pentium

תרגול מס' 5: MESI Protocol

Application-Specific Customization of Soft Processor Microarchitecture

Architecture Background

RIC: Relaxed Inclusion Caches for Mitigating LLC Side-Channel Attacks

Energy-Efficient Address Translation

Ann Gordon-Ross and Frank Vahid*

Multiprocessors - Flynn’s taxonomy (1966)

Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory

CS 3410, Spring 2014 Computer Science Cornell University

Lecture 24: Virtual Memory, Multiprocessors

Lecture 23: Virtual Memory, Multiprocessors

Automatic Tuning of Two-Level Caches to Embedded Applications

Application-Specific Customization of Soft Processor Microarchitecture

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Aanjhan Ranganathan (ETH Zurich), Ali Galip Bayrak (EPFL), Theo Kluter (BFH), Philip Brisk (UC Riverside), Edoardo Charbon (TU Delft), Paolo Ienne (EPFL)

Multicore Embedded Systems Increasing number of multiprocessor based embedded systems. Low energy requirement with little compromise on performance. Significant energy consumption in the memory subsystem (caches, shared bus, main memory). 2 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Symmetric Multiprocessor System 3 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Shared Memory Shared Memory D$ I$ CPU 1 D$ I$ CPU 2 D$ I$ CPU n

Cache Coherency Problem 4 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Shared Memory Shared Memory D$ I$ CPU 1 D$ I$ CPU 2 D$ I$ CPU n

Snoopy Hardware Coherence Protocols 5 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Shared Memory Shared Memory D$ I$ CPU 1 D$ I$ CPU 2 D$ I$ CPU n Snoop misses consume excessive energy

Snoop Filters 6 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Shared Memory Shared Memory D$ I$ CPU 1 D$ I$ CPU 2 D$ I$ CPU n SF Snoop filter lookup costs lesser energy than a cache lookup

Snoop Filters in Prior Art Include, Exclude and Hybrid JETTY –Expensive for an embedded system in terms of area. –Energy consumed by the JETTYs itself is significant. Stream Registers –Present in IBM's BlueGene Supercomputer. –Inclusive filter. –Uses a base and mask register pair to track the cache lines. 7 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Stream Registers 8 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture b b Base Mask Valid No general mechanism to remove address from SR without compromising correctness Addresses with 10XX result in snoop filter hit

Drawbacks of Stream Register based Snoop Filters No efficient way to update the registers when a line is removed from cache –Degraded filtering performance over time –Additional logic units introduced but not efficient (e.g., cache wrap detection) 9 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Our Contribution Counting Stream Registers –Eliminates cache wrap detection logic –Counter to track cache lines –More robust to workload variability –Better or similar energy savings compared to SRs 10 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Counting Stream Registers 11 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture x01 0b x02 0b Base Mask Counter Removes the need for extra logic such as cache wrap detection, active register history etc. Invalidated cache lines can be tracked by decrementing the counter

Snoop Filter Architecture 12 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Index to direct mapped snoop filter table Index to direct mapped snoop filter table Set of cache lines grouped into a page Used for comparison with base register

Experimental Analysis Virtex 2 FPGA running OpenRISC soft cores –Configurable no. of processors, associativity and size of data and instruction cache, cache type and coherence protocol EEMBC Multibench Benchmarks CACTI 5.3 energy model –Total memory subsystem energy accounted for main memory r/w energy, data and instruction cache r/w energy, leakage and snoop energy 13 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Cache Design Space Exploration 14 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Results: Filtering Percentage 15 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture CSR achieves higher filtering % for smaller number of registers

Analysis: RGB2CMYK Benchmark 16 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Discussion: Energy Consumption For most benchmarks, snoop energy was around 8-10% of the total memory subsystem energy without snoop filters CSR filters more effective for certain benchmarks (H.264, Image rotation) –Better filtering performance with smaller no. of stream registers. Small reduction in overall energy –Platform limited to 32 MB of off-chip SDRAM –No complex data sharing and limited no. of multiple producers of same data 17 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture

Summary 18 Counting Stream Registers: An Efficient and Effective Snoop Filter Architecture Introduced counting stream registers based snoop filter architecture –Lesser hardware complexity and ability to track cache line invalidations Experimental evaluation shows better filtering percentage than stream registers with lesser performance variation for different workloads.