Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

Wish Branches Combining Conditional Branching and Predication for Adaptive Predicated Execution The University of Texas at Austin *Oregon Microarchitecture.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

Resurrector: A Tunable Object Lifetime Profiling Technique Guoqing Xu University of California, Irvine OOPSLA’13 Conference Talk 1.

MC 2 : High Performance GC for Memory-Constrained Environments N. Sachindran, E. Moss, E. Berger Ivan JibajaCS 395T *Some of the graphs are from presentation.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

NUMA Tuning for Java Server Applications Mustafa M. Tikir.

Increasing Memory Usage in Real-Time GC Tobias Ritzau and Peter Fritzson Department of Computer and Information Science Linköpings universitet

Using Prefetching to Improve Reference-Counting Garbage Collectors Harel Paz IBM Haifa Research Lab Erez Petrank Microsoft Research and Technion.

1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.

CS 333 Introduction to Operating Systems Class 11 – Virtual Memory (1)

Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo.

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

1 An Efficient On-the-Fly Cycle Collection Harel Paz, Erez Petrank - Technion, Israel David F. Bacon, V. T. Rajan - IBM T.J. Watson Research Center Elliot.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Reducing Generational Copy Reserve Overhead with Fallback Compaction Phil McGachey and Antony L. Hosking June 2006.

Overview: Memory Memory Organization: General Issues (Hardware) –Objectives in Memory Design –Memory Types –Memory Hierarchies Memory Management (Software.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

A Characterization of Processor Performance in the VAX-11/780 From the ISCA Proceedings 1984 Emer & Clark.

CB E D F G Frequently executed path Not frequently executed path Hard to predict path A C E B H Insert select-µops (φ-nodes SSA) Diverge Branch CFM point.

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

Flexicache: Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL.

Improving the Performance of Object-Oriented Languages with Dynamic Predication of Indirect Jumps José A. Joao *‡ Onur Mutlu ‡* Hyesoon Kim § Rishi Agarwal.

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

Taking Off The Gloves With Reference Counting Immix

Ulterior Reference Counting: Fast Garbage Collection without a Long Wait Author: Stephen M Blackburn Kathryn S McKinley Presenter: Jun Tao.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Revisiting Hardware-Assisted Page Walks for Virtualized Systems

Dynamic Object Sampling for Pretenuring Maria Jump Department of Computer Sciences The University of Texas at Austin Stephen M. Blackburn.

1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.

Copyright (c) 2004 Borys Bradel Myths and Realities: The Performance Impact of Garbage Collection Paper: Stephen M. Blackburn, Perry Cheng, and Kathryn.

September 11, 2003 Beltway: Getting Around GC Gridlock Steve Blackburn, Kathryn McKinley Richard Jones, Eliot Moss Modified by: Weiming Zhao Oct

Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.

UniProcessor Garbage Collection Techniques Paul R. Wilson University of Texas Presented By Naomi Sapir Tel-Aviv University.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Processes and Virtual Memory

Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.

Reduction of Register File Power Consumption Approach: Value Lifetime Characteristics - Pradnyesh Gudadhe.

CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.

Sunpyo Hong, Hyesoon Kim

® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.

Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.

GC Assertions: Using the Garbage Collector To Check Heap Properties Samuel Z. Guyer Tufts University Edward Aftandilian Tufts University.

An Efficient, Incremental, Automatic Garbage Collector P. Deutsch and D. Bobrow Ivan JibajaCS 395T.

Memory Management What if pgm mem > main mem ?. Memory Management What if pgm mem > main mem ? Overlays – program controlled.

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

15-740/ Computer Architecture Lecture 3: Performance

Hyperthreading Technology

The University of Texas at Austin

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Address-Value Delta (AVD) Prediction

CARP: Compression-Aware Replacement Policies

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

José A. Joao* Onur Mutlu‡ Yale N. Patt*

15-740/ Computer Architecture Lecture 14: Prefetching

Garbage Collection Advantage: Improving Program Locality

Presentation transcript:

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University of Texas at Austin ‡ Computer Architecture Laboratory Carnegie Mellon University

Motivation: Garbage Collection 2 Garbage Collection (GC) is a key feature of Managed Languages  Automatically frees memory blocks that are not used anymore  Eliminates bugs and improves security GC identifies dead (unreachable) objects, and makes their blocks available to the memory allocator Significant overheads  Processor cycles  Cache pollution  Pauses/delays on the application

Software Garbage Collectors 3 Tracing collectors  Recursively follow every pointer starting with global, stack and register variables, scanning each object for pointers  Explicit collections that visit all live objects Reference counting  Tracks the number of references to each object  Immediate reclamation  Expensive and cannot collect cyclic data structures State-of-the-art: generational collectors  Young objects are more likely to die than old objects  Generations: nursery (new) and mature (older) regions

Overhead of Garbage Collection 4

Hardware Garbage Collectors 5 Hardware GC in general-purpose processors?  Ties one GC algorithm into the ISA and the microarchitecture  High cost due to major changes to processor and/or memory system  Miss opportunities at the software level, e.g. locality improvement Rigid trade-off: reduced flexibility for higher performance on specific applications Transistors are available  Build accelerators for commonly used functionality  How much hardware and how much software for GC?

Our Goal 6 Architectural and hardware acceleration support for GC  Reduce the overhead of software GC  Keep the flexibility of software GC  Work with any existing software GC algorithm

Basic Idea 7 Simple but incomplete hardware garbage collection until the heap is full Software GC runs and collects the remaining dead objects Overhead of GC is reduced

Hardware-assisted Automatic Memory Management (HAMM) 8 Hardware-software cooperative acceleration for GC  Reference count tracking  To find dead objects without software GC  Memory block reuse handling  To provide available blocks to the software allocator  Reduce frequency and overhead of software GC Key characteristics  Software memory allocator is in control  Software GC still runs and makes high-level decisions  HAMM can simplify: does not have to track all objects

ISA Extensions for HAMM 9 Memory allocation  REALLOCMEM, ALLOCMEM Pointer tracking (store pointer)  MOVPTR, MOVPTROVR  PUSHPTR, POPPTR, POPPTROVR Garbage collection

Overview of HAMM 10 LD/ST Unit L1 RCCB RC updates Core 0 L1 ABT Block address Core 1Core N … L2 RCCB L2 ABT CPU Chip 0 CPU Chip 1 CPU Chip M … Main memory RC Live objects Available Block Table (ABT) Reusable blocks L1 Reference Count Coalescing Buffer (RCCB)

addr ← REALLOCMEM size if (addr == 0) then // ABT does not have a free block → regular software allocator addr ← bump_pointer bump_pointer ← bump_pointer + size … else // use address provided by ABT end if // Initialize block starting at addr ALLOCMEM object_addr, size Modified Allocator 11

A Example of HAMM 12 LD/ST Unit L1 Reference Count Coalescing Buffer (RCCB) RC updates Core L1 ABT Block address L2 RCCB L2 ABT CPU Chip Main memory RC Available Block Table (ABT) Reusable blocks 0 ALLOCMEM A, size incRC A A: 1 PUSHPTR A incRC A A: 2 MOVPTR R3, A incRC A A: 3 MOV R3, 0x50 decRC A A: 2 MOV addr1, 0x50 decRC A A: -1 A: 1 1 MOV addr2, 0x020 decRC A A: -1 0 A REALLOCMEM R2, size A A eviction prefetch dead

ISA Extensions for HAMM 13 Memory allocation REALLOCMEM, ALLOCMEM Pointer tracking (store pointer) MOVPTR, MOVPTROVR PUSHPTR, POPPTR, POPPTROVR Garbage collection  FLUSHRC

Methodology Benchmarks: DaCapo suite on Jikes Research Virtual Machine with its best GC, GenMS Simics + cycle-accurate x86 simulator  64 KB, 2-way, 2-cycle I-cache  16 KB perceptron predictor  Minimum 20-cycle branch misprediction penalty  4-wide, 128-entry instruction window  64 KB, 4-way, 2-cycle, 64B-line, L1 D-cache  4 MB, 8-way, 16-cycle, 64B-line, unified L2 cache  150-cycle minimum memory latency Different methodologies for two components:  GC time estimated based on actual garbage collection work over the whole benchmark  Application: cycle-accurate simulation with microarchitectural modifications on 200M-instruction slices 14

GC Time Reduction 15

Application Performance 16 Since GC time is reduced by 29%, HAMM is a win

Why does HAMM work? 17 HAMM reduces GC time because  Eliminates collections: 52%/50% of nursery/full-heap  Enables memory block reuse for 69% of all new objects in nursery and 38% of allocations into older generation  Reduces GC work: 21%/49% for nursery/full-heap HAMM does not slow down the application significantly  Maximum L1 cache miss increase: 4% Maximum L2 cache miss increase: 3.5%  HAMM itself is responsible for only 1.4% of all L2 misses

Conclusion 18 Garbage collection is very useful, but it is also a significant source of overhead  Improvements on pure software GC or hardware GC are limited We propose HAMM, a cooperative hardware-software technique  Simplified hardware-assisted reference counting and block reuse  Reduces GC time by 29%  Does not significantly affect application performance  Reasonable cost (67KB on a 4-core chip) for an architectural accelerator of an important functionality  HAMM can be an enabler encouraging developers to use managed languages

Thank You! Questions?