What we need to be able to count to tune programs

Slides:



Advertisements
Similar presentations
Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
Advertisements

G Robert Grimm New York University Virtual Memory.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 3 Memory Management Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
Virtual Memory Introduction to Operating Systems: Module 9.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
CS 153 Design of Operating Systems Spring 2015
Spring 2003CSE P5481 Introduction Why memory subsystem design is important CPU speeds increase 55% per year DRAM speeds increase 3% per year rate of increase.
Memory Management (II)
Computer ArchitectureFall 2008 © November 10, 2007 Nael Abu-Ghazaleh Lecture 23 Virtual.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
Memory Management 2010.
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Translation Buffers (TLB’s)
Disco Running Commodity Operating Systems on Scalable Multiprocessors.
Virtual Memory and Paging J. Nelson Amaral. Large Data Sets Size of address space: – 32-bit machines: 2 32 = 4 GB – 64-bit machines: 2 64 = a huge number.
Computer Organization and Architecture
OS Spring’04 Virtual Memory: Page Replacement Operating Systems Spring 2004.
Caching and Virtual Memory. Main Points Cache concept – Hardware vs. software caches When caches work and when they don’t – Spatial/temporal locality.
ITEC 325 Lecture 29 Memory(6). Review P2 assigned Exam 2 next Friday Demand paging –Page faults –TLB intro.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Cosc 3P92 Week 9 & 10 Lecture slides
Operating Systems ECE344 Ding Yuan Paging Lecture 8: Paging.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
Processes and Virtual Memory
Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a “cache” for secondary (disk) storage – Managed jointly.
CS203 – Advanced Computer Architecture Virtual Memory.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
1 University of Maryland Using Information About Cache Evictions to Measure the Interactions of Application Data Structures Bryan R. Buck Jeffrey K. Hollingsworth.
Computer Architecture Lecture 12: Virtual Memory I
CS161 – Design and Architecture of Computer
Translation Lookaside Buffer
Basic Paging (1) logical address space of a process can be made noncontiguous; process is allocated physical memory whenever the latter is available. Divide.
MODERN OPERATING SYSTEMS Third Edition ANDREW S
Processes and threads.
ECE232: Hardware Organization and Design
CS161 – Design and Architecture of Computer
Memory Caches & TLB Virtual Memory
143A: Principles of Operating Systems Lecture 6: Address translation (Paging) Anton Burtsev October, 2017.
Memory Hierarchy Virtual Memory, Address Translation
CS510 Operating System Foundations
Module 9: Virtual Memory
OS Virtualization.
Lecture 28: Virtual Memory-Address Translation
Virtual Memory 3 Hakim Weatherspoon CS 3410, Spring 2011
Chapter 9: Virtual-Memory Management
Lecture 6 Memory Management
Page Replacement.
5: Virtual Memory Background Demand Paging
Translation Lookaside Buffer
Translation Buffers (TLB’s)
Virtual Memory Overcoming main memory size limitation
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
© 2004 Ed Lazowska & Hank Levy
Virtual Memory Prof. Eric Rotenberg
Translation Buffers (TLB’s)
CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs
Memory Management CSE451 Andrew Whitaker.
Lecture 8: Efficient Address Translation
CSE 471 Autumn 1998 Virtual memory
CSE 542: Operating Systems
Translation Buffers (TLBs)
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Module 9: Virtual Memory
Synonyms v.p. x, process A v.p # index Map to same physical page
Chapter 9: Virtual Memory
Review What are the advantages/disadvantages of pages versus segments?
Virtual Memory.
Virtual Memory 1 1.
Presentation transcript:

What we need to be able to count to tune programs 11/10/2018 What we need to be able to count to tune programs Mustafa M. Tikir Bryan R. Buck Jeffrey K. Hollingsworth

Cache Scope Measurement library Uses hardware monitors in Itanium II Uses Dyninst to instrument program Insert calls to Initialize, start measurement, replace allocation calls Uses hardware monitors in Itanium II Event Address Registers for L1D cache misses & FP loads Interrupt on overflow Perfmon randomization feature to access registers Objects grouped into “stat buckets” Each variable assigned own stat bucket User can name explicitly for heap allocations Results sorted by latency (rather than misses) View by stat bucket or function Filter by function or stat bucket

Dynamic Page Migration User-level dynamic page migration Profiling and page migration during the same run Application Profiling Gathers data from Sun Fire Link hardware monitors Samples the interconnect transactions Transaction Type + Physical Address + Processor ID Identifies preferred locations of memory pages Memory local to the processor that accesses most Page Placement Kernel moves pages to their preferred locations At fixed time intervals Using the madvise system calls Pages are frozen for a while if recently migrated Eliminates ping-ponging of memory pages

NUMA-Aware Java Heaps NUMA-Aware generation Divided into segments for locality groups on the system Each segment is local to its locality group NUMA-Aware young and old generations For initial object allocation Dynamic object migrations Data from hardware monitors Relate profiles to heap objects Identify preferred locations of objects Evaluation using a hybrid execution simulator Underlying memory management libraries Representative parallel workloads From actual runs of server applications Using data from hardware monitors Sampling memory accesses via hardware monitors

What Else Could Monitors Do? Previous use Information about the hardware components in processor Hardware designers Hand-tuning of the systems Our use of hardware monitors Data centric measurement of program behavior Automatic tuning for memory access locality To be more beneficial in automatic tuning Information on the cause of the events Cache eviction information More specialized information Address Translation Counters for dynamic page migration

Cache Eviction Information Insight into interactions among data Particularly useful for data layout optimizations Physical address is available to hardware Can calculate from tag of evicted cache line Information in OS can map physical to virtual CPU L2 cache and main memory interrupt virtual address to physical tag of evicted data virtual address of miss performance monitors miss count address of last miss tag of last eviction L1 cache ... tag data

Address Translation Counters (ATC) Access frequencies to pages by a processor A counter for each TLB entry, E CE  0 TLB entry is loaded due to TLB miss TLB entry invalidation Context switch Cache coherency control operation CE  CE + 1 Virtual to physical address translation Valid Dirty Physical Address Virtual Address ATC Counter, CE TLB Entry, E

Gathering Information from ATC Sampling TLB content Via system calls At fixed time intervals List of valid TLB entries Virtual Address + ATC value Low overhead traps by the OS At TLB entry eviction or invalidation Processor ID + Virtual Address + ATC value Additional fields in page table entries Page table update at context switch Counter for each processor for each page

Conclusions Current hardware monitors are good at Would like to Counting events (result of problem) Would like to Count cause of events (why the problem happened) Gather more specialized information Future hardware monitors For automatic tuning of programs Must be sufficiently simple to get implemented Collaboration for future monitors Application developers System software designers Processor architects

References Data Centric Cache Measurement on the Intel Itanium 2 Processor Buck and Hollingsworth, SC'04. Using Hardware Counters to Automatically Improve Memory Performance Tikir and Hollingsworth, SC'04. NUMA-Aware Java Heaps for Server Applications Tikir and Hollingsworth, IPDPS’05. Data Centric Cache Measurement Using Hardware And Software Instrumentation Bryan R. Buck, PhD Thesis Using Hardware Monitors to Generate Parallel Workloads Tikir and Hollingsworth, Under review for EuroPar’05.