Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Emery Berger, Kathryn McKinley *, Robert Blumofe, Paul.

Slides:



Advertisements
Similar presentations
Paging: Design Issues. Readings r Silbershatz et al: ,
Advertisements

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Memory Management for High-Performance Applications Emery Berger University of Massachusetts,
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
McRT-Malloc: A Scalable Non-Blocking Transaction Aware Memory Allocator Ali Adl-Tabatabai Ben Hertzberg Rick Hudson Bratin Saha.
U NIVERSITY OF M ASSACHUSETTS D EPARTMENT OF C OMPUTER S CIENCE Reconsidering Custom Memory Allocation Emery Berger, Ben Zorn, Kathryn McKinley.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
U NIVERSITY OF M ASSACHUSETTS – Department of Computer Science Emery Berger Scalable Memory Management for Multithreaded Applications CMPSCI 691P Fall.
Copyright © 2003, SAS Institute Inc. All rights reserved. Where's Waldo Uncovering Hard-to-Find Application Killers Claire Cates SAS Institute, Inc
Parallel Memory Allocation Steven Saunders. 2Parallel Memory AllocationSteven Saunders Introduction Fallacy: All dynamic memory allocators are either.
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
@ Zhigang Zhu, CSC212 Data Structure - Section RS Lecture 18a Trees, Logs and Time Analysis Instructor: Zhigang Zhu Department of Computer.
Threads vs. Processes April 7, 2000 Instructor: Gary Kimura Slides courtesy of Hank Levy.
CS533 Concepts of Operating Systems Class 2 Thread vs Event-Based Programming.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Garbage Collection Without Paging Matthew Hertz, Yi Feng, Emery Berger University.
Scalable Locality- Conscious Multithreaded Memory Allocation Scott Schneider Christos D. Antonopoulos Dimitrios S. Nikolopoulos The College of William.
Department of Computer Science Presenters Dennis Gove Matthew Marzilli The ATOMO ∑ Transactional Programming Language.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science PLDI 2006 DieHard: Probabilistic Memory Safety for Unsafe Programming Languages Emery.
Memory Management for High-Performance Applications - Ph.D. defense - Emery Berger 1 Emery Berger Memory Management for High-Performance Applications Department.
Advanced Distributed Software Architectures and Technology group ADSaT 1 Scalability & Availability Paul Greenfield CSIRO.
DTHREADS: Efficient Deterministic Multithreading
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
Scheduler Activations On BSD: Sharing Thread Management Between Kernel and Application Christopher Small and Margo Seltzer Harvard University Presenter:
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
/ PSWLAB Eraser: A Dynamic Data Race Detector for Multithreaded Programs By Stefan Savage et al 5 th Mar 2008 presented by Hong,Shin Eraser:
Multiprocessor Real- Time Scheduling Aaron Harris CSE 666 Prof. Ganesan.
9/13/20151 Threads ICS 240: Operating Systems –William Albritton Information and Computer Sciences Department at Leeward Community College –Original slides.
Operating System Principles Ku-Yaw Chang Assistant Professor, Department of Computer Science and Information Engineering Da-Yeh.
Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
Chapter 2 (PART 1) Light-Weight Process (Threads) Department of Computer Science Southern Illinois University Edwardsville Summer, 2004 Dr. Hiroshi Fujinoki.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
1 Memory Management Basics. 2 Program P Basic Memory Management Concepts Address spaces Physical address space — The address space supported by the hardware.
1 CMSC421: Principles of Operating Systems Nilanjan Banerjee Principles of Operating Systems Acknowledgments: Some of the slides are adapted from Prof.
CSE 326 Killer Bee-Trees David Kaplan Dept of Computer Science & Engineering Autumn 2001 Where was that root?
Hoard: A Scalable Memory Allocator for Multithreaded Applications Emery Berger, Kathryn McKinley, Robert Blumofe, Paul Wilson Presented by Dimitris Prountzos.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Yi Feng & Emery Berger University of Massachusetts Amherst A Locality-Improving.
Multithreaded Programing. Outline Overview of threads Threads Multithreaded Models  Many-to-One  One-to-One  Many-to-Many Thread Libraries  Pthread.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Software Systems Advanced Synchronization Emery Berger and Mark Corner University.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Threads. Readings r Silberschatz et al : Chapter 4.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
CS 241 Discussion Section (2/9/2012). MP2 continued Implement malloc, free, calloc and realloc Reuse free memory – Sequential fit – Segregated fit.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
Concurrency and Performance Based on slides by Henri Casanova.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Optimistic Hybrid Analysis
18-447: Computer Architecture Lecture 30B: Multiprocessors
Chapter 2 Memory and process management
Reconsidering Custom Memory Allocation
David F. Bacon, Perry Cheng, and V.T. Rajan
Chapter 4: Threads.
Getting to the root of concurrent binary search tree performance
CS510 - Portland State University
Chapter 4: Threads & Concurrency
Threads vs. Processes Hank Levy 1.
Pointer analysis John Rollinson & Kaiyuan Li
Presentation transcript:

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Emery Berger, Kathryn McKinley *, Robert Blumofe, Paul Wilson Hoard: A Scalable Memory Allocator for Multithreaded Applications Department of Computer Sciences * Department of Computer Science

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Motivation Parallel multithreaded programs becoming prevalent web servers, search engines, database managers, etc. run on SMP’s for high performance often embarrassingly parallel Memory allocation is a bottleneck prevents scaling with number of processors

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Assessment Criteria for Multiprocessor Allocators Speed competitive with uniprocessor allocators on one processor Scalability performance linear with the number of processors Fragmentation (= max allocated / max in use) competitive with uniprocessor allocators worst-case and average-case

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Uniprocessor Allocators on Multiprocessors Fragmentation: Excellent Very low for most programs [Wilson & Johnstone] Speed & Scalability: Poor Heap contention a single lock protects the heap Can exacerbate false sharing different processors can share cache lines

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Allocator-Induced False Sharing Allocators cause false sharing! Cache lines can end up spread across a number of processors Practically all allocators do this processor 1processor 2 x2 = malloc(s);x1 = malloc(s); A cache line thrash…

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Existing Multiprocessor Allocators Speed: One concurrent heap (e.g., concurrent B-tree): too expensive too many locks/atomic updates O(log n) cost per memory operation  Fast allocators use multiple heaps Scalability: Allocator-induced false sharing and other bottlenecks Fragmentation: P-fold increase or even unbounded

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Multiprocessor Allocator I: Pure Private Heaps Pure private heaps: one heap per processor. malloc gets memory from the processor's heap or the system free puts memory on the processor's heap Avoids heap contention Examples: STL, ad hoc (e.g., Cilk 4.1) x1= malloc(s) free(x1)free(x2) x3= malloc(s) x2= malloc(s) x4= malloc(s) processor 1processor 2 = allocated by heap 1 = free, on heap 2

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 How to Break Pure Private Heaps: Fragmentation Pure private heaps: memory consumption can grow without bound! Producer-consumer: processor 1 allocates processor 2 frees free(x1) x2= malloc(s) free(x2) x1= malloc(s) processor 1processor 2 x3= malloc(s) free(x3)

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Multiprocessor Allocator II: Private Heaps with Ownership Private heaps with ownership: free puts memory back on the originating processor's heap. Avoids unbounded memory consumption Examples: ptmalloc [Gloger], LKmalloc [Larson & Krishnan] x1= malloc(s) free(x1) free(x2) x2= malloc(s) processor 1processor 2

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 How to Break Private Heaps with Ownership:Fragmentation Private heaps with ownership: memory consumption can blowup by a factor of P. Round-robin producer- consumer: processor i allocates processor i+1 frees This really happens (NDS). free(x2) free(x1) free(x3) x1= malloc(s) x2= malloc(s) x3=malloc(s) processor 1processor 2processor 3

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 So What Do We Do Now?

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 The Hoard Multiprocessor Memory Allocator Manages memory in page-sized superblocks of same- sized objects - Avoids false sharing by not carving up cache lines - Avoids heap contention - local heaps allocate & free small blocks from their set of superblocks Adds a global heap that is a repository of superblocks When the fraction of free memory exceeds the empty fraction, moves superblocks to the global heap - Avoids blowup in memory consumption

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Hoard Example Hoard: one heap per processor + a global heap malloc gets memory from a superblock on its heap. free returns memory to its superblock. If the heap is “too empty”, it moves a superblock to the global heap. x1= malloc(s) processor 1global heap free(x7) …some mallocs …some frees Empty fraction = 1/3

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Summary of Analytical Results Worst-case memory consumption: O(n log M/m + P) [instead of O(P n log M/m)] n = memory required M = biggest object size m = smallest object size P = number of processors Best possible: O(n log M/m) [Robson] Provably low synchronization in most cases

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Experiments Run on a dedicated 14-processor Sun Enterprise 300 MHz UltraSparc, 1 GB of RAM Solaris 2.7 All programs compiled with g++ version Allocators: Hoard version Solaris (system allocator) Ptmalloc (GNU libc – private heaps with ownership) mtmalloc (Sun’s “MT-hot” allocator)

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Performance: threadtest speedup(x,P) = runtime(Solaris allocator, one processor) / runtime(x on P processors)

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Performance: Larson Server-style benchmark with sharing

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Performance: false sharing Each thread reads & writes heap data

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Fragmentation Results On most standard uniprocessor benchmarks, Hoard’s fragmentation was low: p2c (Pascal-to-C): 1.20 espresso:1.47 LRUsim: 1.05Ghostscript:1.15 Within 20% of Lea’s allocator On the multiprocessor benchmarks and other codes: Fragmentation was between 1.02 and 1.24 for all but one anomalous benchmark (shbench: 3.17).

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Hoard Conclusions Speed: Excellent As fast as a uniprocessor allocator on one processor amortized O(1) cost 1 lock for malloc, 2 for free Scalability: Excellent Scales linearly with the number of processors Avoids false sharing Fragmentation: Very good Worst-case is provably close to ideal Actual observed fragmentation is low

Hoard: A Scalable Memory Allocator for Multithreaded Applications -- Berger et al. -- ASPLOS 2000 Hoard Heap Details “Segregated size class” allocator Size classes are logarithmically- spaced Superblocks hold objects of one size class empty superblocks are “recycled” Approximately radix-sorted: Allocate from mostly-full superblocks Fast removal of mostly-empty superblocks sizeclass bins radix-sorted superblock lists (emptiest to fullest) superblocks