© 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch.

Slides:



Advertisements
Similar presentations
Stream Chaining: Exploiting Multiple Levels of Correlation in Data Prefetching Pedro Díaz and Marcelo Cintra University of Edinburgh
Advertisements

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
1 Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
Data Prefetching Mechanism by Exploiting Global and Local Access Patterns Ahmad SharifQualcomm Hsien-Hsin S. LeeGeorgia Tech The 1 st JILP Data Prefetching.
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.
CCNoC: On-Chip Interconnects for Cache-Coherent Manycore Server Chips CiprianSeiculescu Stavros Volos Naser Khosro Pour Babak Falsafi Giovanni De Micheli.
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.
Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.
Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)
DATA ADDRESS PREDICTION Zohair Hyder Armando Solar-Lezama CS252 – Fall 2003.
CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)
Data Cache Prefetching using a Global History Buffer Presented by: Chuck (Chengyan) Zhao Mar 30, 2004 Written by: - Kyle Nesbit - James Smith Department.
Systems I Locality and Caching
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.
Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.
Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy Jason Zebchuk, Elham Safi, and Andreas Moshovos
Electrical and Computer Engineering University of Wisconsin - Madison Prefetching Using a Global History Buffer Kyle J. Nesbit and James E. Smith.
Two Ways to Exploit Multi-Megabyte Caches AENAO Research Toronto Kaveh Aasaraai Ioana Burcea Myrto Papadopoulou Elham Safi Jason Zebchuk Andreas.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Moshovos © 1 RegionScout: Exploiting Coarse Grain Sharing in Snoop Coherence Andreas Moshovos
1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi
Sampling Dead Block Prediction for Last-Level Caches
Memory Sharing Predictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University.
CSE378 Intro to caches1 Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
Fetch Directed Prefetching - a Study
The Evicted-Address Filter
Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 5:
Cache (Memory) Performance Optimization. Average memory access time = Hit time + Miss rate x Miss penalty To improve performance: reduce the miss rate.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
Efficiently Prefetching Complex Address Patterns Manjunath Shevgoor, Sahil Koladiya, Rajeev Balasubramonian University of Utah Chris Wilkerson, Zeshan.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Multilevel Memories (Improving performance using alittle “cash”)
5.2 Eleven Advanced Optimizations of Cache Performance
Temporal Streaming of Shared Memory
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Milad Hashemi, Onur Mutlu, Yale N. Patt
Using Dead Blocks as a Virtual Victim Cache
Address-Value Delta (AVD) Prediction
Lecture 10: Branch Prediction and Instruction Delivery
15-740/ Computer Architecture Lecture 10: Runahead and MLP
Lecture 20: OOO, Memory Hierarchy
Temporal Memory Streaming
15-740/ Computer Architecture Lecture 14: Prefetching
Lois Orosa, Rodolfo Azevedo and Onur Mutlu
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

© 2005 Babak Falsafi Temporal Memory Streaming Babak Falsafi Team: Mike Ferdman, Brian Gold, Nikos Hardavellas, Jangwoo Kim, Stephen Somogyi, Tom Wenisch Collaborator: Anastassia Ailamaki & Andreas Moshovos STEMS Computer Architecture Lab Carnegie Mellon

© 2005 Babak Falsafi 2 The Memory Wall Logic/DRAM speed gap continues to increase! VAX/1980 PPro/

© 2005 Babak Falsafi 3 Current Approach Cache hierarchies: Trade off capacity for speed Exploit “reuse” But, in modern servers Only 50% utilization of one proc. [Ailamaki, VLDB’99] Much bigger problem in MPs What is wrong? Demand fetch/repl. data 100G DRAM CPUCPUCPUCPU L3 64M L2 2M 1000 clk L1 64K 1 clk 10 clk 100 clk

© 2005 Babak Falsafi 4 Prior Work (SW-Transparent) Prefetching [Joseph 97] [Roth 96] [Nesbit 04] [Gracia Pérez 04]  Simple patterns or low accuracy Large Exec. Windows / Runahead [Mutlu 03]  Fetch dependent addresses serially Coherence Optimizations [Stenström 93] [Lai 00] [Huh 04]  Limited applicability (e.g., migratory) Need solutions for arbitrary access patterns

© 2005 Babak Falsafi 5 Spatio-Temporal Memory Streaming Our Solution: Spatio-Temporal Memory Streaming Observation: Data spatially/temporally correlated Arbitrary, yet repetitive, patterns Approach  Memory Streaming Extract spat./temp. patterns Stream data to/from CPU  Manage resources for multiple blocks  Break dependence chains In HW, SW or both L1 CPU A C D B.. STEMS Z X W stream replace fetch DRAM or Other CPUs

© 2005 Babak Falsafi 6 Contribution #1: Temporal Shared-Memory Streaming Recent coherence miss sequences recur  50% misses closely follow previous sequence  Large opportunity to exploit MLP Temporal streaming engine  Ordered streams allow practical HW  Performance improvement:  7%-230% in scientific apps.  6%-21% in commercial Web & OLTP apps.

© 2005 Babak Falsafi 7 Contribution #2: Last-touch Correlated Data Streaming Last-touch prefetchers  Cache block deadtime >> livetime  Fetch on a predicted “last touch”  But, designs impractical (> 200MB on-chip) Last-touch correlated data streaming  Miss order ~ last-touch order  Stream table entries from off-chip  Eliminates 75% of all L1 misses with ~200KB

© 2005 Babak Falsafi 8 Outline STEMS OverviewSTEMS Overview Example Temporal Streaming 1.Temporal Shared-Memory Streaming 2.Last-Touch Correlated Data Streaming Summary

© 2005 Babak Falsafi 9 Record sequences of memory accesses Transfer data sequences ahead of requests Temporal Shared-Memory Streaming [ISCA’05] Baseline System CPU Mem Miss A Fill A Streaming System Miss B Fill B CPU Mem Miss A Fill A,B,C,… Accelerates arbitrary access patterns  Parallelizes critical path of pointer-chasing

© 2005 Babak Falsafi 10 Relationship Between Misses Intuition: Miss sequences repeat  Because code sequences repeat Observed for uniprocessors in [Chilimbi’02] Temporal Address Correlation  Same miss addresses repeat in the same order Correlated miss sequence = stream QWABCDERTABCDEYMiss seq. …

© 2005 Babak Falsafi 11 Relationship Between Streams Intuition: Streams exhibit temporal locality  Because working set exhibits temporal locality  For shared data, repetition often across nodes Temporal Stream Locality  Recent streams likely to recur QWABCDER TABCDEY Node 1 Node 2 Addr. correlation + stream locality = temporal correlation

© 2005 Babak Falsafi 12 Memory Level Parallelism Streams create MLP for dependent misses Not possible with larger windows / runahead Temporal streaming breaks dependence chains A B C Baseline CPU Must wait to follow pointers Temporal Streaming CPU Fetch in parallel A B C

© 2005 Babak Falsafi 13 Temporal Streaming  Record  Node 1 Miss A Miss B Miss C Miss D DirectoryNode 2

© 2005 Babak Falsafi 14 Temporal Streaming  Record  Node 1 Miss A Miss B Miss C Miss D DirectoryNode 2 Miss A Req. A Fill A

© 2005 Babak Falsafi 15 Temporal Streaming  Record  Locate  Node 1 Miss A Miss B Miss C Miss D DirectoryNode 2  Miss A Req. A Fill A Locate A Stream B, C, D

© 2005 Babak Falsafi 16 Temporal Streaming  Node 1 Miss A Miss B Miss C Miss D DirectoryNode 2  Miss A Req. A Fill A Locate A Stream B, C, D  Record  Locate  Stream  Fetch B, C, D Hit B Hit C

© 2005 Babak Falsafi 17 Temporal Streaming Engine  Record Coherence Miss Order Buffer (CMOB)  ~1.5MB circular buffer per node  In local memory  Addresses only  Coalesced accesses CPU $ Local Memory CMOB QWABCDERTY Fill E

© 2005 Babak Falsafi 18 Temporal Streaming Engine  Locate Annotate directory  Already has coherence info for every block  CMOB append  send pointer to directory  Coherence miss  forward stream request Directory A B sharedNode CMOB[23] modifiedNode CMOB[401]

© 2005 Babak Falsafi 19 Temporal Streaming Engine  Stream Fetch data to match use rate  Addresses in FIFO stream queue  Fetch into streamed value buffer FEDCB Stream Queue CPU L1 $ Streamed Value Buffer Node i: stream {A,B,C…} Adata Fetch A ~32 entries

© 2005 Babak Falsafi 20 Practical HW Mechanisms Streams recorded/followed in order  FIFO stream queues  ~32-entry streamed value buffer  Coalesced cache-block size CMOB appends Predicts many misses from one request  More lookahead  Allows off-chip stream storage  Leverages existing directory lookup

© 2005 Babak Falsafi 21 Methodology: Infrastructure SimFlex [SIGMETRICS’04]  Statistically sampling → uArch sim. in minutes  Full-system MP simulation (boots Linux & Solaris)  Uni, CMP, DSM timing models  Real server software (e.g., DB2 & Oracle)  Software publicly available for download

© 2005 Babak Falsafi 22 Methodology: Benchmarks & Parameters Model Parameters  16 4GHz SPARC CPUs  8-wide OoO; 8-stage pipe  256-entry ROB/LSQ  64K L1, 8MB L2  TSO w/ speculation Benchmark Applications Scientific  em3d, moldyn, ocean OLTP: TPC-C WH  IBM DB2 7.2  Oracle 10g SPECweb99 w/ 16K con.  Apache 2.0  Zeus 4.3

© 2005 Babak Falsafi 23 TSE Coverage Comparison TSE outperforms Stride and GHB for coherence misses

© 2005 Babak Falsafi 24 Stream Lengths Comm: Short streams; low base MLP ( ) Sci: Long streams; high base MLP ( ) Temporal Streaming addresses both cases

© 2005 Babak Falsafi 25 TSE Performance Impact em3d moldyn ocean Apache DB2 Oracle Zeus Time BreakdownSpeedup 95% CI TSE eliminates 25%-95% of coherent read stalls 6% to 230% performance improvement

© 2005 Babak Falsafi 26 TSE Conclusions Temporal Streaming  Intuition: Recent coherence miss sequences recur  Impact: Eliminates % of coherence misses Temporal Streaming Engine  Intuition: In-order streams enable practical HW  Impact: Performance improvement  7%-230% in scientific apps.  6%-21% in commercial Web & OLTP apps.

© 2005 Babak Falsafi 27 Outline Big Picture Example Streaming Techniques 1.Temporal Shared Memory Streaming 2.Last-Touch Correlated Data Streaming Summary

© 2005 Babak Falsafi 28 Enhancing Lookahead Observation [Mendelson, Wood&Hill]: Few live sets  Use until last “hit”  Data reuse  high hit rate  ~80% dead frames! Exploit for lookahead: Predict last “touch” prior to “death” Evict, predict and fetch next line Time T1 Time T2 Live sets Dead sets

© 2005 Babak Falsafi 29 How Much Lookahead? Predicting last-touches will eliminate all latency! DRAM latency L2 latency Frame Deadtimes (cycles)

© 2005 Babak Falsafi 30 Dead-Block Prediction [ISCA’00 & ’01] Per-block trace of memory accesses to a block Predicts repetitive last-touch events PC 3 : load/store A1 PC 1 : load/store A1 PC 3 : load/store A1 PC 5 : load/store A3 Accesses to a block frame (miss) (hit) (miss) PC 0 : load/store A0(hit) Trace = A1  (PC 1,PC 3, PC 3 ) Last touch First touch

© 2005 Babak Falsafi 31 Dead-Block Prefetcher (DBCP) Evict A1 Fetch A3 11 Correlation Table A3A1,PC 1,PC 3,PC 3 PC 1,PC 3 History Table (HT) PC 3  Current Access Latest A1 History & correlation tables  History ~ L1 tag array  Correlation ~ memory footprint Encoding  truncated addition Two bit saturating counter

© 2005 Babak Falsafi 32 DBCP Coverage with Unlimited Table Storage High average L1 miss coverage Low misprediction (2-bit counters)

© 2005 Babak Falsafi 33 Impractical On-Chip Storage Size Needs over 150MB to achieve full potential!

© 2005 Babak Falsafi 34 Our Observation: Signatures are Temporally Correlated Signatures need not reside on chip 1.Last-touch sequences recur Much as cache miss sequences recur [Chilimbi’02] Often due to large structure traversals 2.Last-touch order ~ cache miss order Off by at most L1 cache capacity Key implications: Can record last touches in miss order Store & stream signatures from off-chip

© 2005 Babak Falsafi 35 Last-Touch Correlated Data Streaming (LT-CORDS) Streaming signatures on chip  Keep all sigs. in sequences in off-chip DRAM  Retain sequence “heads” on chip  “Head” signals a stream fetch Small (~200KB) on-chip stream cache  Tolerate order mismatch  Lookahead for stream startup DBCP coverage with moderate on-chip storage!

© 2005 Babak Falsafi 36 DBCP Mechanisms CoreL1 L2 DRAM HT All signatures in random-access on-chip table Sigs. (160MB)

© 2005 Babak Falsafi 37 Only a subset needed at a time“Head” as cue for the “stream” Signatures stored off-chip What LT-CORDS Does CoreL1 L2 DRAM HT … and only in order

© 2005 Babak Falsafi 38 LT-CORDS Mechanisms CoreL1 L2 DRAM HT SC On-chip storage independent of footprint Heads (10K)Sigs. (200K)

© 2005 Babak Falsafi 39 Methodology SimpleScalar CPU model with Alpha ISA  SPEC CPU2000 & Olden benchmarks 8-wide out-of-order processor  2 cycle L1, 16 cycle L2, 180 cycle DRAM  FU latencies similar to Alpha EV8  64KB 2-way L1D, 1MB 8-way L2 LT-CORDS with 214KB on-chip storage Apps. with significant memory stalls

© 2005 Babak Falsafi 40 LT-CORDS vs. DBCP Coverage LT-CORDS reaches infinite DBCP coverage

© 2005 Babak Falsafi 41 LT-CORDS Speedup LT-CORDS hides large fraction of memory latency

© 2005 Babak Falsafi 42 LT-CORDS Conclusions Intuition: Signatures temporally correlated  Cache miss & last-touch sequences recur  Miss order ~ last-touch order Impact: eliminates 75% of all misses  Retains DBCP coverage, lookahead, accuracy  On-chip storage indep. of footprint  2x speedup over best prior work

© 2005 Babak Falsafi 43 For more information Visit our website: