Chenxi Wang Huimin Cui Ting Cao John Zigman Haris Volos

Slides:



Advertisements
Similar presentations
Paging: Design Issues. Readings r Silbershatz et al: ,
Advertisements

Object-Orientation Meets Big Data Language Techniques towards Highly- Efficient Data-Intensive Computing Harry Xu UC Irvine.
Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.
1 Optimizing compilers Managing Cache Bercovici Sivan.
SDN + Storage.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Exploiting Spatial Locality in Data Caches using Spatial Footprints Sanjeev Kumar, Princeton University Christopher Wilkerson, MRL, Intel.
Lecture 10: Heap Management CS 540 GMU Spring 2009.
International Conference on Supercomputing June 12, 2009
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
Generational Stack Collection And Profile driven Pretenuring Perry Cheng Robert Harper Peter Lee Presented By Moti Alperovitch
Memory Subsystem Performance of Programs using Coping Garbage Collection Authers: Amer Diwan David Traditi Eliot Moss Presented by: Ronen Shabo.
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.
CS-3013 & CS-502, Summer 2006 Memory Management1 CS-3013 & CS-502 Summer 2006.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
NVM Programming Model. 2 Emerging Persistent Memory Technologies Phase change memory Heat changes memory cells between crystalline and amorphous states.
Lecture 10 : Introduction to Java Virtual Machine
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Message Analysis-Guided Allocation and Low-Pause Incremental Garbage Collection in a Concurrent Language Konstantinos Sagonas Jesper Wilhelmsson Uppsala.
The Design and Implementation of Log-Structure File System M. Rosenblum and J. Ousterhout.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.
ThyNVM Enabling Software-Transparent Crash Consistency In Persistent Memory Systems Jinglei Ren, Jishen Zhao, Samira Khan, Jongmoo Choi, Yongwei Wu, and.
Jiahao Chen, Yuhui Deng, Zhan Huang 1 ICA3PP2015: The 15th International Conference on Algorithms and Architectures for Parallel Processing. zhangjiajie,
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
NUMA Optimization of Java VM
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
CMSC 611: Advanced Computer Architecture Memory & Virtual Memory Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material.
Memory Management What if pgm mem > main mem ?. Memory Management What if pgm mem > main mem ? Overlays – program controlled.
Persistent Memory (PM)
Computer Architecture Lecture 12: Virtual Memory I
UH-MEM: Utility-Based Hybrid Memory Management
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Employing compression solutions under openacc
Seth Pugsley, Jeffrey Jestes,
Xiaodong Wang, Shuang Chen, Jeff Setter,
Java 9: The Quest for Very Large Heaps
Online parameter optimization for elastic data stream processing
Green Software Engineering Prof
Scalable High Performance Main Memory System Using PCM Technology
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Short Circuiting Memory Traffic in Handheld Platforms
Energy-Efficient Address Translation
What we need to be able to count to tune programs
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Yak: A High-Performance Big-Data-Friendly Garbage Collector
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Row Buffer Locality Aware Caching Policies for Hybrid Memories
CMSC 611: Advanced Computer Architecture
Address-Value Delta (AVD) Prediction
Ann Gordon-Ross and Frank Vahid*
Optimizing MapReduce for GPUs with Effective Shared Memory Usage
Memory Hierarchy Memory: hierarchy of components of various speeds and capacities Hierarchy driven by cost and performance In early days Primary memory.
Dongyun Jin, Patrick Meredith, Dennis Griffith, Grigore Rosu
Evolution in memory management techniques
Evolution in memory management techniques
Evolution in memory management techniques
Garbage Collection Advantage: Improving Program Locality
Rajeev Balasubramonian
Haonan Wang, Adwait Jog College of William & Mary
Janus Optimizing Memory and Storage Support for Non-Volatile Memory Systems Sihang Liu Korakit Seemakhupt, Gennady Pekhimenko, Aasheesh Kolli, and Samira.
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

Panthera: Holistic Memory Management for Big Data Processing over Hybrid Memories Chenxi Wang Huimin Cui Ting Cao John Zigman Haris Volos My name is Chenxi Wang, a postdoctor in UCLA, CS department. The title of my talk is : Panthera, Holistic memory management for big data processing over hybrid memories. This work is done with Huimin Cui, Ting Cao, John Zigman, Haris Volos, Onur Mutlu, Fang lv, Xiaobing Feng and Harry Xu. Onur Mutlu Fang Lv Xiaobing Feng Guoqing Harry Xu

MLlib Big Data Workloads Written in managed languages Current Memory : DRAM 40% of total energy consumption Capacity starts to hit limit Lots of big data workloads, such as Spark and Hadoop, are written in managed languages. They require higher and higher memory capacities. At the same time, we want to reduce the energy consumption of the servers. However, DRAM accounts for up to 40% of the total energy consumption and continues growing. Also, the DRAM capacity density is starting to hit the fundamental limit.

Non-Volatile Memory (NVM) Byte addressable memory material Pros Higher memory capacity density Lower price Negligible background energy consumption Non volatile memory is an emerging byte addressable memory technology. Compared to DRAM, NVM has higher memory capacity, lower price and negligible background energy consumption. However, its performance is worse than DRAM. It has increased memory access latency and reduced bandwidth. Cons Increased read/write latency Reduced bandwidth

Hybrid Memory : DRAM + Non-Volatile Memory (NVM) Cache Hot Cache DRAM Warm DRAM Non Volatile Memory Cold SSD/HD In order to utilize the good performance of DRAM and lower energy consumption of NVM, integrate DRAM and NVM into a hybrid memory system is a promising way to satisfy big data applications’ requirements. Because the performance of NVM is worse than DRAM, the key challenge to use this new memory architecture is how to place data appropriately so that we can reduce energy consumption at a minimum time cost. SSD/HD Divide and place data into Hybrid Memory Current Memory Architecture Hybrid Memory Architecture

Hybrid memory management for big data Opportunities & Challenges Next, I will introduce the opportunities and challenges of hybrid memory management for big data applications. 2 mins 30s,

Current Solution of Hybrid Memory Management Divide Java heap into a DRAM and an NVM areas* Frequently accessed object Profile and migrate the frequently accessed objects to DRAM* Java Heap Young Generation Old Generation A common approach to manage the hybrid memory in managed runtime is to divide the java heap into DRAM and NVM areas. And then, profile the access frequency of objects and migrate the frequently accessed objects from NVM to DRAM space. If the purpose is to optimize the durability of NVM, for example, the Write Rationing , it finds and migrates the write-intensive objects from NVM into DRAM. Because the write endurance of NVM is much worse than DRAM. However, this approach is not suitable for big data applications running on cloud. The big data applications have billions of objects. Doing online profiling at object granularity will cause significant performance overhead. Significant online profiling overhead! DRAM NVM [*] Write-rationing garbage collection for hybrid memories, Akram et al., PLDI’18

Data Characteristics in Big Data Systems Application-level memory subsystem Coarse-grained granularity Spark Distributed data collection (RDD) Different RDDs have different and clear access patterns Temporary RDDs Frequently used RDDs Fault Tolerance RDDs Big data systems, such as Spark, all have application-level memory subsystems, which can perform memory management at coarse-grained granularity. For example, Spark manages its data and does computation at RDD granularity. RDD is a distributed data collection defined in Spark. Different RDDs have different and clear access patterns. For example, some RDDs are only used for fault tolerance, and some RDDs are frequently reused for computation. Execution Memory Storage Memory Off-Heap Memory Spark Memory Management

Working with Big Data Characteristics Use the characteristics of RDD to do coarse-grained data division Java objects Objects within one RDD have the same access pattern and lifetime => Save lots of profiling overhead ! RDD #1 RDD #2 The objects within one RDD all have the same access pattern and lifetime. It’s a good opportunity to use the characteristics of RDD to do coarse-grained data division, which can save lots of online profiling overhead. However, the problem is that runtime can not see the RDDs, except for a bunch of low-level java objects. Therefore, we need to bridge the gap between the high level data structure and the low-level java objects. ❌ Runtime cannot see these semantics of RDDs Runtime

Design Next, I introduce our solution, the Panthera. 5 mins,

Panthera - Holistic Memory Management System for Big Data Systems RDD #1 RDD #2 Spark Applications Map Java heap space to physical NVM/DRAM Data profiling : Static inference Coarse-grained dynamic analysis Runtime (OpenJDK) Based on OpenJDK, we developed Panthera, the holistic hybrid memory management system for big data applications running on managed runtime, such as Spark. Panthera manages hybrid memory in a holistic way: First, it gets physical DRAM and NVM from OS and then maps Java heap spaces to the physical DRAM and NVM. Second, it uses application-level semantics to divide objects into coarse-grained categories, and migrate objects by GC. For the data profiling mechanism, Panthera provides both static inference and dynamic coarse-grained analysis. Next, I will introduce the design details. Physical Memory DRAM NVM

Static Inference of RDD Memory Tags DRAM Data intensive applications Manage data in coarse-grained manner i.e., RDD Data access pattern and lifetime are statically observed var links = ctx.textFile..persist() for ( i <- 1 to iters ){ ..... var contribs = links.join(..)..persist() } NVM Infer the access frequency of RDD by def-use analysis Mark Hot RDD with DRAM tag Mark Cold RDD with NVM tag As we have said, the data intensive applications manage data in coarse-grained manner, and the data access patterns and lifetime can be statically observed. So, Panthera uses static analysis, the def-use analysis, to infer the access frequency of RDDs. For example, the picture shows the code segment of Spark Pagerank. RDD links is defined beyond the loop and frequently reused within the loop, so it should be marked with DRAM tag. The RDD contribs, is defined in the loop, and each iteration will generate new data for this RDD. If programmer persist RDD contribs for tolerance, it should be marked as NVM tag. Finally, we mark the RDD with different tags based on their access frequency. PageRank

Pass DRAM/NVM Tags via GC Java Object It’s not practical for a static analysis to find and mark all the objects within a RDD Java Object with DRAM tag RDD root object GC traces all alive objects from root Utilize GC to propagate the DRAM/NVM tags from RDD root objects to all reachable data objects However, each RDD consists a bunch of low-level java objects. All the objects within the RDD should be placed into DRAM or NVM according to RDD’s tag. It’s not practical for a static analysis to mark all the objects of a RDD. The Garbage Collection brings the opportunity to bridge the gap between the Application level data structure and Runtime objects. During each GC, it needs to trace all the alive objects from the root. We only need to tag the root object of a RDD, and then GC can propagate the tag to all the reachable data objects of this RDD. Finally, Panthera can divide the objects according to the semantics of RDD. The static analysis and tags propagation mechanisms incur zero online profiling overhead. Zero online profiling overhead!

Dynamic Profiling of RDD Method Invocations Low-overhead Dynamic Profiling mechanism Monitor the number of function invocations on RDD root objects Update the tags of the RDD root objects and propagate them to other objects during major GC Although static analysis is good enough for big data applications, Panthera also provides a low-overhead dynamic profiling mechanism, in case the static analysis isn’t that accurate. Panthera monitors the number of function invocation, such as the RDD transformation operations, on RDD root objects, and write this value into the object headers. During each Major GC, it updates the tags ,which were assigned by the static analysis, of the objects and migrates them according to their new tags.

Data Placement in Panthera Based on DRAM/NVM tags GC Java Heap Data placement is done based on DRAM and NVM tags Objects will be allocated according to their tags and whenever the tags are changed, they will be migrated by GC according to their new tags. Young Generation Old Generation DRAM NVM

Runtime Optimizations Utilize application-level semantics to do runtime optimizations Eager promotion of RDD data objects Big array optimization: alignment padding After getting application level semantics, Panthera have done some runtime level optimizations. For example, for the “eager promotion”. All the tagged objects belong to persist RDDs, and they all have long-lifetime. So, Panthera promotes these objects to old generation directly without waiting for that they survive several minor GC. If you want to know more details about these optimizations, please read the paper.

Evaluation Next, we introduce the evaluation. 11 mins,

Our Hybrid Memory Emulator Supports Big Data Apps Run Big Data applications QuickPath Interconnect Host CPU Remote CPU (QPI) DRAM NVM Latency (ns) 120 300 Bandwidth (GB/s) 30 10 Thermal Register Local DRAM Remote DRAM Existing NVM simulator can’t support commercial runtime well, such OpenJDK . Based on NUMA architecture, we developed a Hybrid Memory emulator which can support Big Data applications. We fix the application running on one CPU, and when it accesses the DRAM of another CPU, the QPI makes the memory access latency much longer. Besides, inspired by the paper, Quartz, we use the thermal register to limit the bandwidth of remote memory. Finally, compared to DRAM, the emulated NVM has a 2.5 times longer latency and 1/3 DRAM bandwidth. And these emulated parameters are consistent with NVM specifics. DRAM Emulated NVM Hybrid Memory Emulator* * Quartz: A Lightweight Performance Emulator for Persistent Memory Software, Volos et al., Middleware’15

Old generation: DRAM and NVM Interleave at a specific ratio Experiment Setup Comparisons Panthera Unmanaged Young generation: mapped on DRAM Old generation: mapped on DRAM and NVM interleaved At least as good as Write Rationing GC* Baseline DRAM Only Applications running on DRAM only is used as the baseline. And compare our work, Panthera, with the Unmanaged version. For the Unmanaged version, we put its Young generation on DRAM, as Write Rationing GC does. And its old generation is mapped to DRAM and NVM interleaved, as the picture shows. (The DRAM and NVM interleaves at 1/3 DRAM ratio.) We use different DRAM ratio in our experiments. The performance of Unmanaged version is at least as good as Write Rationing GC. Old generation: DRAM and NVM Interleave at a specific ratio (i.e., 1/3 DRAM ratio) NVM DRAM NVM DRAM [*] Write-rationing garbage collection for hybrid memories, Akram et al., PLDI’18

Overall Results – Performance Overhead 64 GB heap, DRAM/Memory = 1/3, Panthera has only 4% performance overhead 1.21 1.04 This slide shows the results of overall performance overhead. All the results are normalized to baseline, DRAM only. The blue bars show the results of unmanaged. The orange bars show the results of Panthera. Under this configuration, 64 GB heap, 1/3 DRAM ratio, Panthera only has 4% performance overhead. And at the same time, unmanaged version has a 21% performance overhead. Average

Overall Results – Energy Consumption 64 GB heap, DRAM/Memory = 1/3, Panthera saves 32% energy consumption 0.73 This slide shows the overall results of energy consumption. All results are normalized to baseline, DRAM only. Blue bars are unmanaged version, the orange bars are panthera. With the same configuration, panthera saves 32% energy consumption. We can get much more energy consumption benefits with larger memory size. For example, under 120 GB java heap, Panthera saves 50% energy consumption. Average

GC Performance 64 GB heap, DRAM/Memory = 1/3, Panthera has 1% GC performance overhead Unmanaged has 59% GC Performance overhead This slide shows the results of GC performance, grey bars … The unmanaged version has 60% GC performance overhead than DRAM only. This is because GC is highly paralleled, and it is bound by both NVM latency and bandwidth heavily. Panthera eliminates both the NVM latency and bandwidth bound by doing 2 things, First, it moves frequently accessed, long-lifetime, data from NVM to DRAM. Second, after getting the application level semantics, Panthera has done some runtime level optimizations to eliminate useless memory access. Finally, Panthera only has a 1% GC performance overhead. Average

Mutator (Computation) Performance 64 GB heap, DRAM/Memory = 1/3, Panthera has 3% Mutator performance overhead Unmanaged has 6% Mutator Performance overhead This slide shows the results of mutator. Although the mutators are also highly paralleled, they do much more computation than GC. And they don’t generate that high memory bandwidth. Under the same configuration, 64 GB heap, 1/3 DRAM ratio. Compared to baseline, unmanaged has 6% mutator performance overhead, And after applying the optimization, Panthera only has 3% mutator performance overhead. Average

Conclusions Panthera: a holistic memory management system for hybrid memory Designed for big data applications Use application-level semantics for coarse-grained data placement Reduce energy consumption significantly at a small time cost

Thanks! Q & A Possible Question: