Cache-Conscious Data Placement Amy M. Henning CS 612 April 7, 2005.

Slides:



Advertisements
Similar presentations
Part IV: Memory Management
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Memory Management Chapter 7.
Memory Management Chapter 7. Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated efficiently to pack as.
Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable supply of ready processes to.
Chapter 7 Memory Management
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
OS Fall’02 Memory Management Operating Systems Fall 2002.
1 CSE 380 Computer Operating Systems Instructor: Insup Lee University of Pennsylvania, Fall 2002 Lecture Note: Memory Management.
Virtual Memory Chapter 8.
CS 536 Spring Run-time organization Lecture 19.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Memory Management 2010.
Virtual Memory Chapter 8.
1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Memory Management Chapter 5.
Cache Conscious Allocation of Pointer Based Data Structures, Revisited with HW/SW Prefetching by: Josefin Hallberg, Tuva Palm and Mats Brorsson Presented.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
1 Lecture 8: Memory Mangement Operating System I Spring 2008.
Memory Management Last Update: July 31, 2014 Memory Management1.
Applying Data Copy To Improve Memory Performance of General Array Computations Qing Yi University of Texas at San Antonio.
Operating System Chapter 7. Memory Management Lynn Choi School of Electrical Engineering.
Memory Management Chapter 7.
MEMORY MANAGEMENT Presented By:- Lect. Puneet Gupta G.P.C.G. Patiala.
1. Memory Manager 2 Memory Management In an environment that supports dynamic memory allocation, the memory manager must keep a record of the usage of.
Cache Locality for Non-numerical Codes María Jesús Garzarán University of Illinois at Urbana-Champaign.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
© 2004, D. J. Foreman 1 Memory Management. © 2004, D. J. Foreman 2 Building a Module -1  Compiler ■ generates references for function addresses may be.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Subject: Operating System.
Cache-Conscious Structure Definition By Trishul M. Chilimbi, Bob Davidson, and James R. Larus Presented by Shelley Chen March 10, 2003.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Consider Starting with 160 k of memory do: Starting with 160 k of memory do: Allocate p1 (50 k) Allocate p1 (50 k) Allocate p2 (30 k) Allocate p2 (30 k)
Informationsteknologi Wednesday, October 3, 2007Computer Systems/Operating Systems - Class 121 Today’s class Memory management Virtual memory.
Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.
Cache-Conscious Data Placement Adapted from CS 612 talk by Amy M. Henning.
Cache Perf. CSE 471 Autumn 021 Cache Performance CPI contributed by cache = CPI c = miss rate * number of cycles to handle the miss Another important metric.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Memory Management Chapter 5 Advanced Operating System.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
A Graph Theoretic Approach to Cache-Conscious Placement of Data for Direct Mapped Caches Mirza Beg and Peter van Beek University of Waterloo June
Memory Management What if pgm mem > main mem ?. Memory Management What if pgm mem > main mem ? Overlays – program controlled.
Memory Management Chapter 7.
Memory Management.
Chapter 2 Memory and process management
Chapter 9 – Real Memory Organization and Management
Run-time organization
Main Memory Management
Improving cache performance of MPEG video codec
Department of Computer Science University of California, Santa Barbara
Ann Gordon-Ross and Frank Vahid*
Main Memory Background Swapping Contiguous Allocation Paging
Operating System Chapter 7. Memory Management
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Support for Adaptivity in ARMCI Using Migratable Objects
10/18: Lecture Topics Using spatial locality
Presentation transcript:

Cache-Conscious Data Placement Amy M. Henning CS 612 April 7, 2005

What is Cache-Conscious Data Placement? Software-based technique to improve data cache performance by relocating variables in the cache and virtual memory space. Goals: Increase cache locality. Reduce cache conflicts for data accesses.

Referenced Research “Cache-Conscious Data Placement” -Calder et al. ‘98 Use smart heuristics: Profiling program and reordering data. Assume programs have similar behavior even with varying inputs. Focus is on reducing cache conflict misses. “Cache-Conscious Structure Layout” -Chilimbi et al. ‘99 Use of Layout Tools: Structure Reorganizer Memory Allocator Focuses on pointer-based codes. Focuses on improving temporal and spatial locality of cache.

Effects of variable placement Conflict Misses: Referenced blocks to same set exceeds associativity. Solution: Place objects with high temporal locality into different cache blocks. Capacity Misses: Working set doesn’t fit in cache. Solution: Move infrequently referenced variables out of cache blocks – replace with more frequent variables. Compulsory Misses: First time referenced. Solution: Group variables w/high temporal locality into same cache block – more effective prefetches.

Terminology Data Placement Used to control contents and location of block Process of assigning addresses to data objs. Objects Any region of memory that program views as a single contiguous space. 1. Stack – referenced as one large contiguous object. 2. Global – all are treated as single objects. 3. Heap – dynamically managed at runtime. 4. Constant – treated as loads to constant data.

Framework Profiler Gather information on structures. 2 types: Name & Temporal Relationship Graph. Data Placement Optimizer Uses profiled info at runtime. Reorders global data segments. Determines new starting location for global segments and stack. Run-time Support For custom allocation of heap objects. Guide placement of heap objects.

Profiling: Naming Strategy Assign names to all variables. Has a profound effect on profiling quality and effectiveness of placement. Essential for binding both runs: Profile and Data Placement/Optimization. Provides the following for each object: Name Number of times referenced Size Life-time

Profiling: Naming Strategy Implementation Names do not change between runs. Computing names incur minimal run-time overhead. Stack and global variables uses their initial address. Heap variables combine the address of call site to malloc () and a few return addresses from the stack. Problem - concurrently live variables can possibly possess the same name!

Profiling: Temporal Relationships Conflict Cost Metric Used to determine the ordering for object placement. Estimates cache misses caused by placing a group of overlapping objects into same cache line. Temporal Relationship Graph (TRGplace Graph) Two objects for every relation. Edge represent degree of temporal locality. Weight - estimated number of misses that would occur if the 2 objects mapped to same cache set, but were in different blocks.

Profiling: Temporal Relationship Graph Implementation Keeps a queue (Q) of the most frequently accessed data objects (obj). Entry - (obj, X), where X is the conflict weight of edge. Procedure: 1. Search Q for current obj. 2. If found, increment each obj’s X from front of Q to the obj’s location. 3. Remove obj and place at front of Q. For large objects, “chunks” are used instead of whole objects in order to kept track of temporal information.

Data Placement Algorithm Designed to eliminate cache conflicts and increase cache line utilization. Input: temporal relationship graph Output: placement map Phase 0: Split objects into popular and unpopular sets. Phase 1: Preprocess heap objs and assign bin tags. Phase 2: Place stack in relation to constants. Phase 3: Make popular objs into compound nodes. Phase 4: Create TRG select edges btw compound nodes. Phase 5: Place small objs together for a cache line reuse. Phase 6: Place global and heap objs to minimize conflict. Phase 7: Place global vars. Emphasizing cache line reuse. Phase 8: Finish placing vars. & write placement map.

Allocation of Heap Objects Implemented at run-time using a customized malloc routine. Objects of temporal use and locality are guided by data placement into allocation bins. Each name has an associated tag. There are several free-lists that have associated tags that are used to allocate the object. Popular heap objects are given a cache start offset. Focuses on temporal locality near each other in memory.

Methodology Hardware: DEC Alpha processor Benchmarks: SPEC95 programs C/Fortran/C++ programs Instrumentation Tool: ATOM Used to gather the Name and TRG profiles. Interface that allows elements of the program executable to be queried and manipulated.

Data Cache Performance Improvement in terms of data cache miss rates. For 8K direct mapped cache with 32 byte lines. Globals had largest problem and ccdp improvement. Heap had least improvement.

Frequency of Objects Breakdown of frequency of references to objects in terms of their size in bytes. %static global and heap obj ( % dynamic references, average % of references per obj).

Behavior of Heap Objects Shows challenge for CCDP on heap objects. Large miss rate are sparse. Objects tend to be small, short-lived.

Contributions First general framework for data layout optimization. Show that data cache misses arise from interactions between all segments of the program address space. Their data placement algorithm shows improvement.

Motivation: Chilimbi et al. Application workloads Performance dominated on memory references. Limited by techniques focused on memory latency, not on the cause - poor reference locality. Changes in data structure Array to a mix of pointer-based.

Pointer Structures Key Property: Locational transparency Elements in structure can be placed at different memory locations without changing the semantics of the program. Placement techniques can be used to improve cache performance by: Increasing a data structure’s spatial and temporal locality. Reducing cache conflicts.

Placement Techniques Two general data placement techniques: Clustering Places structure elements likely to be accessed simultaniously in the same cache block. Coloring Places heavily and infrequently accessed elements in non-conflicting regions. Goal is to improve a pointer structure’s cache performance.

Improves spatial and temporal locality. Provides implicit prefetching. Subtree Clustering - packing subtrees into a single cach block. This scheme is far more efficient than allocation- order clustering. CCDP Technique: Clustering

CCDP Technique: Coloring Used non-conflicting regions of cache to map elements that are concurrently accessed. Frequently accessed structure elements are mapped to the first region. Ensures that heavily accessed elements do not conflict among themselves and not replaced.

CCDP Technique: Coloring 2-color scheme, 2-way set-associative cache C, cache sets and p, partitioned sets

Considerations for techniques Requires detailed knowledge of program’s code and data structures. Architectural familiarity needs to be known. Considerable programmer effort. Solution: Two strategies can be applied to CCDP techniques to reduce the level of programming effort. Cache-Conscious Reorganization Cache-Conscious Allocation

Strategy: Data Reorganization Addresses the problem of resulting layouts that interact poorly the program’s data access patterns. Eliminates profiling by using tree structures which possess topological properties. Tool: ccmorph Semantic-preserving cache-conscious tree reorganizer. Applies both clustering and coloring techniques.

ccmorph Appropriate for read-mostly data structures. Built early in computation. Heavily referenced. Operates on a tree-like structure. Homogeneous elements. No external pointers to the middle of structure. Copies structure into a contiguous block of memory. Partitions a tree-like structure into subtrees. Structure is colored to map first p elements.

Strategy: Heap Allocation Complementary approach to reorganization for when elements that are allocated. Must have low overhead since it is invoked more frequently. Has a local view of structure. Safe. Tool: ccmalloc locates new data item into same cache block as existing item.

ccmalloc Focuses only on L2 cache blocks. Overhead is inversely proportional to size of a cache block. If cache block is full, strategy for where to allocate new data item is used: Closest New-block First-fit

Methodology Hardware: Sun Ultraserver E Mhz UltraSPARC processors 2 GB memory L btye lines L byte lines Benchmarks: Tree Microbenchmarks Preforms random searches on different types of balances. Macrobenchmarks Real-world applications Olden Benchmarks Pointer-based applications

Tree Microbenchmark Measures performance of ccmorph. Combines 2M keys and uses 40MB memory. No clustering is done due to L1 size. B-trees reserve extra space in tree nodes to handle insertion, hence not managing cache as well as C-tree.

Macrobenchmarks Radiance 3D model of the space. Depth-first used no ccmalloc VIS Verification Interacting w/Synthesis of finite state systems. Uses binary decision diagrams no ccmorph Radiance - 42% speedup from C+C VIS - 27% speedup from cc heap allocation

Olden Benchmarks Cycle-by-cycle uniprocessor simulation RSIM - MIPS R10000 processor Comparison of semi-automated CCDP techniques against other latency reducing schemes.

Olden Benchmarks ccmorph outperformed hw/sw prefetching: 3-138% ccmalloc-new-block outperformed prefetching: %

Contributions Dealt with cache-conscious data placement as if memory access costs were not uniformed. Cache-conscious data placement techniques to improve pointer structure’s cache performance. Strategies/tools for applying these techniques that are semi-automatic and don’t require profiling.

Suggested Future Work To further reduce the requirements of programmer to detect data structure characteristics. More profiling or static program analysis. Focus more on the frequent item sets. Incorporate pattern recognition with data placement. Both papers suggested that compilers and run-time systems can help close the processor-memory performance gap.