A non-blocking approach on GPU dynamical memory management Joy NVIDIA.

Slides:



Advertisements
Similar presentations
Chapter 12: File System Implementation
Advertisements

Stacks, Queues, and Linked Lists
1/1/ / faculty of Electrical Engineering eindhoven university of technology Memory Management and Protection Part 2: The hardware view dr.ir. A.C. Verschueren.
Part IV: Memory Management
CSE Lecture 3 – Algorithms I
1 ITCS 5/4145 Parallel computing, B. Wilkinson, April 11, CUDAMultiDimBlocks.ppt CUDA Grids, Blocks, and Threads These notes will introduce: One.
The Linux Kernel: Memory Management
Data Structures Using C++ 2E
File Systems.
KERNEL MEMORY ALLOCATION Unix Internals, Uresh Vahalia Sowmya Ponugoti CMSC 691X.
Chapter 11: File System Implementation
Memory Management Chapter 4. Memory hierarchy Programmers want a lot of fast, non- volatile memory But, here is what we have:
Memory Management Chapter 7. Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated efficiently to pack as.
Dynamic memory allocation and fragmentation Seminar on Network and Operating Systems Group II.
Memory Management Memory Areas and their use Memory Manager Tasks:
File System Implementation CSCI 444/544 Operating Systems Fall 2008.
Chapter 3.1 : Memory Management
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 Optimizing Malloc and Free Professor Jennifer Rexford
Memory Management A memory manager should take care of allocating memory when needed by programs release memory that is no longer used to the heap. Memory.
Hashing General idea: Get a large array
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
1 Chapter 3.1 : Memory Management Storage hierarchy Storage hierarchy Important memory terms Important memory terms Earlier memory allocation schemes Earlier.
The memory allocation problem Define the memory allocation problem Memory organization and memory allocation schemes.
Dynamic Partition Allocation Allocate memory depending on requirements Partitions adjust depending on memory size Requires relocatable code –Works best.
MEMORY MANAGEMENT Presented By:- Lect. Puneet Gupta G.P.C.G. Patiala.
 2007 Pearson Education, Inc. All rights reserved C Data Structures.
ICS 145B -- L. Bic1 Project: Main Memory Management Textbook: pages ICS 145B L. Bic.
Subject: Operating System.
CSC 211 Data Structures Lecture 13
1 Memory Management Basics. 2 Program P Basic Memory Management Concepts Address spaces Physical address space — The address space supported by the hardware.
1 Advanced Memory Management Techniques  static vs. dynamic kernel memory allocation  resource map allocation  power-of-two free list allocation  buddy.
QCAdesigner – CUDA HPPS project
CE Operating Systems Lecture 17 File systems – interface and implementation.
COSC 2007 Data Structures II Chapter 13 Advanced Implementation of Tables IV.
1 Algorithms Queues, Stacks and Records stored in Linked Lists or Arrays.
CS 241 Discussion Section (12/1/2011). Tradeoffs When do you: – Expand Increase total memory usage – Split Make smaller chunks (avoid internal fragmentation)
Data Structures David Kauchak cs302 Spring Data Structures What is a data structure? Way of storing data that facilitates particular operations.
1 Chapter 2 Program Performance. 2 Concepts Memory and time complexity of a program Measuring the time complexity using the operation count and step count.
CSC 143T 1 CSC 143 Highlights of Tables and Hashing [Chapter 11 p (Tables)] [Chapter 12 p (Hashing)]
Linked Lists Source: presentation based on notes written by R.Kay, A. Hill and C.Noble ● Lists in general ● Lists indexed using pointer arrays ● Singly.
Storage Management Different-sized Items. Light blue indicates allocated items Heap Memory with Different-sized Items.
Memory Management Chapter 7.
Chapter 17 Free-Space Management
Memory Management Memory Areas and their use Memory Manager Tasks:
Flood fill algorithm Also called seed fill, is an algorithm that determines the area connected to a given node in a multi-dimensional array, When applied.
Chapter 12 – Data Structures
Day 19 Memory Management.
Top 50 Data Structures Interview Questions
Data Structures I (CPCS-204)
Data Structures Using C++ 2E
12 C Data Structures.
Partitioned Memory Allocation
Data Structure and Algorithms
Memory Management Memory Areas and their use Memory Manager Tasks:
Hashing CENG 351.
Data Structures Using C++ 2E
Optimizing Malloc and Free
Object Oriented Programming COP3330 / CGS5409
Arrays and Linked Lists
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Introduction to Database Systems
Virtual Memory Hardware
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Database Design and Programming
Memory Management (1).
Chapter 14: File-System Implementation
Presentation made by Steponas Šipaila
Dynamic Memory Allocation
Memory Management Memory Areas and their use Memory Manager Tasks:
Presentation transcript:

A non-blocking approach on GPU dynamical memory management Joy NVIDIA

Outline Introduce Buddy memory system Our parallel implementation Performance comparison Discussion

Fixed size memory (memory pool) Ever fastest & simplest memory system Free list (item = address) Each item of free list records the available address to allocate. Free list can be implement with queue, stack, list, or any data structure. Allocate Just take one item from free list Free Just return the address to free list. Performance Constant time on both allocation & free Free list 0x0000 0x0100 0x0200 0x0300 ….

Multi-lists memory For management on non-fixed size memory system, a natural extension from fixed size memory is multi-lists memory system Free list multi free lists of fixed size memory with different size (ex: twice size grow) Allocate Find the first free list with size larger than request size by arithmetic operation example: ceil(log2(size)) Take one element from the target free list Free Find the correct free list to free Return the address to the target free list. Performance Constant time on both allocation & free, since it is possible to find suitable free list with arithmetic operation instead of linear searching. Drawback: waste memory Free lists Size = 256 Size = 512 Size = 1024 Size = 2048 …. … … … … …

Buddy memory To avoid the wasting memory problem in multi-lists memory, it is natural to allocate memory from the direct upper layers (twice size) when the free list is empty, instead of pre-allocated memory in all free lists. Free list multi free lists of fixed size memory, with sizes growing up in power of 2 Allocate Find the first free list with size larger than request size Take one element from the target free list If the free list is empty, create pairs from upper list Free Find the correct free list to free (using records) Return the address to the target free list. If the buddy is also in the free list, then free to upper. Performance Constant time on both allocation & free Free lists Size = 256 Size = 512 Size = 1024 Size = 2048 Size = 4096 …

Buddy memory Good internal de-fragment The buddy address can be calculated by address XOR size Constant time operation O(h), where h = log2(max size/min size) is a constant. buddythis

Memory layers Just implement one class of single layer, other layers are instances with different size. Lower layer The memory layer with 1/2 size of current layer Current layer The allocating request layer Upper layer The memory layer with 2x size of current layer Lower layer Upper layer Current layer Free lists Size = 256 Size = 512 Size = 1024 Size = 2048 Size = 4096 …

Pair creation If the current free list is empty, it will allocate memory from upper allocator. Since the size of upper is 2x, it will create a pair of available memory into current free list. If there are N threads simultaneously allocate memory in current layer, of that the free list is empty, only N/2 threads shall allocate memory from upper layer. Memory from upper layer Memory to current layer

Free Queue The free list was implemented with queue, of which head can run over tail. Head<Tailavailable memory (directly allocate from this free list) Head=Tailempty free list Head>Tailunder available (require pair creation from upper layer) Use the above states to determine which threads shall call pair_creation() from upper layer.

Parallel strategy (Alloc) Each allocation requestor creates a socket to listen the address. The socket was implemented on free queue. atomicAdd(&head,1) creates a socket. The output address can come from current free list or pair creation from upper free list. Head Tail Available memory in free queue Need pair creation from upper layer New Head Threads with allocation requests to this layer

Odd/Even Pair Creation The under available threads will perform pair creations in odd/even loop until new tail >= new head to avoid the overhead of simultaneous pair creation. Head TailNew Head Threads with allocation requests to this layer New Tail Pair Creations

Parallel strategy (Free) Store the freed address to free list Calculate the buddy address. XOR(addr, size) Check if the buddy is already in the free list. Use hand shake algorithm for fast lookup If YES, mark both elements in free list as N/A, then free the memory block into upper layer.

Hand shake The freed memory record its index in free list The free list record the freed memory address Fast check if buddy memory address is in free list Calculate buddy memory address (XOR) Read the index from this address Check if the address of this index in free list is equal to the buddy memory address. Memory block Record index in free list Record address of memory

Performance gridDim=512 blockDim=512 K20 CUDA 5.0ThisSpeedup 256 bytes alloc/free single time ms10.8 ms25.8 x 256 bytes alloc ms10.48 ms682 x 256 bytes free ms7.27 ms780 x Random # of bytes alloc/free 35 times size < lower 2 layer ms65.8 ms81.7x Random # of bytes alloc/free 35 times full range ms370.5 ms11.2 x

Discussion Warp level group allocation Dynamic expanding free queue

Backup Slides

Slow atomicCAS() loop long ret=now; do{ now=ret; ret=atomicCAS(&head, now, now->next); }while(ret!=now);