The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.

Slides:

Advertisements

Similar presentations

Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors Onur Mutlu, The University of Texas at Austin Jared Start,

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

DEV300: A Tiny CPU and OS in C# Scott Hanselman Technology Evangelist/Architect Corillian Corporation

High Performing Cache Hierarchies for Server Workloads

Cache Performance 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with.

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CS752 Decoupled Architecture for Data Prefetching Jichuan Chang Kai Xu.

Decomposing Memory Performance Data Structures and Phases Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Shangri-La: Achieving High Performance from Compiled Network Applications while Enabling Ease of Programming Michael K. Chen, Xiao Feng Li, Ruiqi Lian,

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

1  2004 Morgan Kaufmann Publishers Chapter Seven.

Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.

DTHREADS: Efficient Deterministic Multithreading

CS 7810 Lecture 9 Effective Hardware-Based Data Prefetching for High-Performance Processors T-F. Chen and J-L. Baer IEEE Transactions on Computers, 44(5)

Processes CS 416: Operating Systems Design, Spring 2001 Department of Computer Science Rutgers University

8/16/2015\course\cpeg323-08F\Topics1b.ppt1 A Review of Processor Design Flow.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

1 Introduction to SimpleScalar (Based on SimpleScalar Tutorial) CPSC 614 Texas A&M University.

TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.

Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.

1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.

1 Recursive Data Structure Profiling Easwaran Raman David I. August Princeton University.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Processes CS 6560: Operating Systems Design. 2 Von Neuman Model Both text (program) and data reside in memory Execution cycle Fetch instruction Decode.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Operated by Los Alamos National Security, LLC for DOE/NNSA DC Reviewed by Kei Davis SKA – Static Kernel Analysis using LLVM IR Kartik Ramkrishnan and Ben.

Lecture 14: Caching, cont. EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014, Dr.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.

Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.

An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.

Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Continuous Flow Multithreading on FPGA Gilad Tsoran & Benny Fellman Supervised by Dr. Shahar Kvatinsky Bsc. Winter 2014 Final Presentation March 1 st,

Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.

Embedded Real-Time Systems Processing interrupts Lecturer Department University.

1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.

Embedded Real-Time Systems

??? ple r B Amulya Sai EDM14b005 What is simple scalar?? Simple scalar is an open source computer architecture simulator developed by Todd.

Cache Memory and Performance

Interpreted languages Jakub Yaghob

Introduction to SimpleScalar

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Introduction to SimpleScalar (Based on SimpleScalar Tutorial)

Energy-Efficient Address Translation

A Review of Processor Design Flow

CMSC 341 Prof. Michael Neary

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Chapter 5 Memory CSE 820.

Memory Management Overview

Massachusetts Institute of Technology

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Sampoorani, Sivakumar and Joshua

Virtual Memory Overcoming main memory size limitation

José A. Joao* Onur Mutlu‡ Yale N. Patt*

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Lecture 4: Instruction Set Design/Pipelining

Presentation transcript:

The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University of Texas at Austin

2 Memory hierarchy trends Growing latency to main memory Growing cache complexity –More cache levels –New mechanisms, optimizations Growing application complexity –Lots of abstraction Hard to predict how an application will perform on a specific system

3 Application understanding is hard Observations can generate Gigabytes of data –Aggregation is necessary Current metrics are too lossy –Different application behaviors →similar miss-rate New metrics needed, richer but still concise Our approach: data structure decomposition

4 Why decompose by data structure? Irregular app = multiple regular data structures –while (tmp) tmp=tmp->next; Data structures are high-level –Results easy to visualize –Can be correlated back to application source code

5 Outline Data structure decomposition using DTrack –Automatic instrumentation + timing simulation Methodology –Tools, configurations simulated, benchmarks studied Results –Data structures causing the most misses –Different types of access patterns –Case study: data structure criticality

6 Conventional simulation methodology Simulated application shares resources with simulator –disk, file system, network..but is not aware of it Application Simulator Host Processor Resources

7 A different perspective Application can communicate with simulator Leave core application oblivious; automatically add simulator-aware instrumentation Application Simulator Resources

8 DTrack Application Sources Detailed Statistics Application Executable Instrumented Sources Source Translator CompilerSimulator - DTrack’s protocol for application-simulator communication

9 DTrack’s protocol 1.Application stores mapping at a predetermined shared location –(start address, end address) → variable name 2.Application signals simulator somehow We enhance ISA with new opcode Other techniques possible 3.Simulator detects signal, reads shared location 1.Application stores mapping at a predetermined shared location –(start address, end address) → variable name 2.Application signals simulator somehow We enhance ISA with new opcode Other techniques possible 3.Simulator detects signal, reads shared location Simulator now knows variable names of address regions

10 DTrack instrumentation Global variables: just after initialization int globalTime ; int main () { … } int Time ; int main () { print (FILE, “Time”, Time, sizeof(Time)); … asm (“mop”); } After: Before:

11 DTrack instrumentation Heap variables: just after allocation x = malloc(4); DTRACK_PTR = x ; DTRACK_NAME = “x” ; DTRACK_SIZE = 4 ; asm(“mop”); After: Before:

12 Design decisions Source-based rather than binary-based translation Local variables – no instrumentation –Instrumenting every call/return is too much overhead –Doesn’t cause many cache misses anyway –Dynamic allocation on the stack: handle alloca just like malloc Signalling opcode: overload an existing one –avoid modifying compiler, allow running natively

13 Instrumentation can perturb app behavior Minimizing perturbance –Global variables are easy One-time cost –Heap variables are hard DTRACK_PTR, etc. always hit in the cache Measuring perturbance –Communicate specific start and end points in application to simulator –Compare instruction counts between them with and without instrumentation Minimizing perturbance –Global variables are easy One-time cost –Heap variables are hard DTRACK_PTR, etc. always hit in the cache Measuring perturbance –Communicate specific start and end points in application to simulator –Compare instruction counts between them with and without instrumentation Instruction count <4% even with frequent malloc

14 Outline Data structure decomposition using DTrack –Automatic instrumentation + timing simulation Methodology –Tools, configurations simulated, benchmarks studied Results –Data structures causing the most misses –Different types of access patterns –Case study: data structure criticality

15 Methodology Source translator: C-Breeze Compiler: Alpha GEM cc Simulator: sim-alpha –Validated model of pipeline Simulated machine: Alpha –4-way issue, 64KB 3-cycle DL1 Benchmarks: 12 C applications from SPEC CPU2000 suite

16 Major data structures by DL1 misses % DL1 misses

17 twolf mcf art f1[i] i=i+1 t1 = b[c[i]  cblock] t2 = t1  tile  term t3 = n[t2  net] … bu[i] i=i+1 node[i] i=i+1 node  child node  parent node  sibling node  siblingp i=rand() node = DFS(node) twolf mcf art f1[i] i=i+1 t1 = b[c[i]  cblock] t2 = t1  tile  term t3 = n[t2  net] … bu[i] i=i+1 node[i] i=i+1 node  child node  parent node  sibling node  siblingp i=rand() node = DFS(node) Large variety in access patterns Code + Data profile = Access pattern

18 Most misses ≣ Most pipeline stalls? Process: –Detect stall cycles when no instructions were committed –Assign blame to data structure of oldest instruction in pipeline Result –Stall cycle ranks track miss count ranks –Exceptions: tds in 179.art search in 186.crafty

19 Summary Toolchain for mapping addresses to high- level data structure –Communicating information to simulator Reveals new patterns about applications –Applications show wide variety of distributions –Within an application, data structures have a variety of access patterns –Misses not correlated to accesses or footprint –..but they correlate well with data structure criticality