Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

1
Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.
1 Vorlesung Informatik 2 Algorithmen und Datenstrukturen (Parallel Algorithms) Robin Pomplun.
© 2008 Pearson Addison Wesley. All rights reserved Chapter Seven Costs.
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Chapter 1 The Study of Body Function Image PowerPoint
Copyright © 2011, Elsevier Inc. All rights reserved. Chapter 6 Author: Julia Richards and R. Scott Hawley.
Author: Julia Richards and R. Scott Hawley
1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.
Properties Use, share, or modify this drill on mathematic properties. There is too much material for a single class, so you’ll have to select for your.
ALGEBRA Number Walls
Objectives: Generate and describe sequences. Vocabulary:
UNITED NATIONS Shipment Details Report – January 2006.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination. Introduction to the Business.
1 RA I Sub-Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Casablanca, Morocco, 20 – 22 December 2005 Status of observing programmes in RA I.
Properties of Real Numbers CommutativeAssociativeDistributive Identity + × Inverse + ×
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt BlendsDigraphsShort.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Year 6 mental test 10 second questions
PUBLIC KEY CRYPTOSYSTEMS Symmetric Cryptosystems 6/05/2014 | pag. 2.
1 Click here to End Presentation Software: Installation and Updates Internet Download CD release NACIS Updates.
Photo Slideshow Instructions (delete before presenting or this page will show when slideshow loops) 1.Set PowerPoint to work in Outline. View/Normal click.
REVIEW: Arthropod ID. 1. Name the subphylum. 2. Name the subphylum. 3. Name the order.
Break Time Remaining 10:00.
Advance Nano Device Lab. Fundamentals of Modern VLSI Devices 2 nd Edition Yuan Taur and Tak H.Ning 0 Ch9. Memory Devices.
PP Test Review Sections 6-1 to 6-6
EU market situation for eggs and poultry Management Committee 20 October 2011.
Bright Futures Guidelines Priorities and Screening Tables
Bellwork Do the following problem on a ½ sheet of paper and turn in.
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
2 |SharePoint Saturday New York City
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
VOORBLAD.
15. Oktober Oktober Oktober 2012.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
1 RA III - Regional Training Seminar on CLIMAT&CLIMAT TEMP Reporting Buenos Aires, Argentina, 25 – 27 October 2006 Status of observing programmes in RA.
Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)
Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.
1..
CONTROL VISION Set-up. Step 1 Step 2 Step 3 Step 5 Step 4.
© 2012 National Heart Foundation of Australia. Slide 2.
Adding Up In Chunks.
Understanding Generalist Practice, 5e, Kirst-Ashman/Hull
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
Model and Relationships 6 M 1 M M M M M M M M M M M M M M M M
25 seconds left…...
Subtraction: Adding UP
: 3 00.
5 minutes.
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Analyzing Genes and Genomes
©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
Intracellular Compartments and Transport
PSSA Preparation.
Essential Cell Biology
Immunobiology: The Immune System in Health & Disease Sixth Edition
Physics for Scientists & Engineers, 3rd Edition
Energy Generation in Mitochondria and Chlorplasts
Select a time to count down from the clock above
Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.
Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.
Presentation transcript:

Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

Multicore Challenge Chip memory (DRAM) p. 2 P $ P $ P $ P $ Managed language runtime environment Application Operating System Objects rapidly allocated and short-lived LLC

Problem: Allocation Wall Chip memory (DRAM) p. 3 P $ P $ P $ P $ Managed language runtime environment Application Operating System DEAD Objects rapidly allocated and short-lived LLC

Problem: Bandwidth & Power Wall Chip memory (DRAM) p. 4 P $ P $ P $ P $ Managed language runtime environment Application Operating System DEAD Objects rapidly allocated and short-lived Zero initialization LLC

Cooperative Cache Scrubbing Chip LLC memory (DRAM) p. 5 P $ P $ P $ P $ Managed language runtime environment Application Operating System Objects rapidly allocated and short-lived Zero initialization DEAD write read LLC

Generational Garbage Collection Young objects die quickly Nursery Traced for live objects Copy to mature space Reclaimed ‘en masse’ NurseryMature LLC 8MB p. 6 DEAD

Dead Lines in LLC (8MB) p. 7

Dead Data Written Back? Chip LLC memory (DRAM) p. 8 P $ P $ P $ P $ Managed language runtime environment Application Operating System DEAD

Useless Write Backs (8MB LLC) p. 9

Cooperative Cache Scrubbing Communicate managed language’s semantic information to hardware Caches ‘Scrub’ dead lines  Invalidate  Unset dirty bit Zero lines without fetch  Result  Better cache management  Avoid traffic to DRAM  Save DRAM energy p. 10 writes reads

Dead Data Written in Cache? Young objects die quickly Nursery Traced for live objects Copy to mature space Reclaimed ‘en masse’ NurseryMature LLC DEAD p

Dead Lines Written in LLC (8MB) p. 12

SW-HW Cooperative Scrubbing Software Identify cache line-aligned dead/zero region Generational Immix collector (stop-the-world)  After nursery collection, call scrub instruction on each line in entire range  Call zero instructions to zero region (32KB) Hardware p. 13

SW-HW Cooperative Scrubbing Software Hardware Scrubbing (LLC)  clinvalidate: invalidates cache line  clundirty: clears dirty bit  clclean: clears dirty bit, moves line to LRU Zeroing (L2)  clzero: zero cache line without fetch Modifications to MESI cache coherence protocol  Back-propagation from LLC to L1/L2 cache levels  Local coherence transitions (no off-chip) p. 14 PowerPC’s dcbi, ARM PowerPC’s dcbz

MESI Coherence Transitions p. 15 M E I S clclean/- clinvalidate/- clclean/- clinvalidate/- clclean/-

MESI Coherence Transitions p. 16 M E I S clzero/- clzero/BusInvalidate BusInvalidate external: from another LLC

Methodology Sniper simulator 4 cores, 8MB shared L3 (LLC), McPAT Extensions for JVM  Works with JIT compiler  Emulate system calls (futex & nanosleep)  JVM-simulator communication with new instruction Jikes RVM and DaCapo benchmarks Generational Immix garbage collector 4 application, 4 GC threads 2x minimum heap Replay compilation, 2 nd invocation p. 17

DRAM Writes (8MB nursery) p. 18

DRAM Writes (8MB nursery) p. 19

DRAM Writes (8MB nursery) p. 20

DRAM Reads (8MB nursery) p. 21

DRAM Reads (8MB nursery) p. 22

DRAM Reads (8MB nursery) p. 23

DRAM Reads (8MB nursery) p. 24

DRAM Reads (8MB nursery) p. 25

Dynamic DRAM Energy (8MB nursery) p. 26

Dynamic DRAM Energy (8MB nursery) p. 27

Total DRAM Energy p %

Total DRAM Energy p %

Total DRAM Traffic p x

clclean+clzero Improvements p. 31

Related Work Cooperative cache management ESKIMO by Isen & John, Micro 09  Useless reads and writes to DRAM by sequential C programs  Reduce energy  Require large map in hardware, extra cache bits Wang et al., PACT 02/ ISCA 03; Sartor et al., 05  C & Fortran static analysis to give cache hints to evict or keep data Zero initialization [Yang et al., OOPSLA 11] Studied costs in time, cache and traffic Use non-temporal writes to DRAM, increase bandwidth p. 32

Conclusions Software-hardware cooperative cache scrubbing Leverages region allocation semantics Changes to MESI coherence protocol New multicore architectural simulation methodology Reductions  59% traffic  14% DRAM energy  4.6% execution time p DEAD

p. 34

Execution Time (8MB nursery) p. 35

Changes to MESI coherence protocol Stateclinvalidateclundirty/clcl ean clzeroBusInvalidate Minvalidate L1/L2 (no WB)  I invalidate L1/L2 (no WB)  E (clclean LRU) ⁄invalidate L1/L2 (no WB)  I Einvalidate L1/L2  I invalidate L1/L2 (clclean LRU)  Minvalidate L1/L2  I Sinvalidate L1/L2  I invalidate L1/L2 (clclean LRU) BusInvalidate  M invalidate L1/L2  I I⁄⁄BusInvalidate  M ⁄ p. 36

Total DRAM Energy (8MB nursery) p. 37

Execution Time Across Nurseries p. 38

Execution Time p. 39

Dynamic DRAM Energy 8MB Nursery p. 40