NUMA Optimization of Java VM

Slides:



Advertisements
Similar presentations
A Study of Garbage Collector Scalability on Multicores LokeshGidra, Gaël Thomas, JulienSopena and Marc Shapiro INRIA/University of Paris 6.
Advertisements

1 Write Barrier Elision for Concurrent Garbage Collectors Martin T. Vechev Cambridge University David F. Bacon IBM T.J.Watson Research Center.
Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.
Nikos Hardavellas, Northwestern University
KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.
Improving Cache Performance by Exploiting Read-Write Disparity
Microarchitectural Characterization of Production JVMs and Java Workload work in progress Jungwoo Ha (UT Austin) Magnus Gustafsson (Uppsala Univ.) Stephen.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
ParMarkSplit: A Parallel Mark- Split Garbage Collector Based on a Lock-Free Skip-List Nhan Nguyen Philippas Tsigas Håkan Sundell Distributed Computing.
Memory System Characterization of Big Data Workloads
Using Prefetching to Improve Reference-Counting Garbage Collectors Harel Paz IBM Haifa Research Lab Erez Petrank Microsoft Research and Technion.
1 The Compressor: Concurrent, Incremental and Parallel Compaction. Haim Kermany and Erez Petrank Technion – Israel Institute of Technology.
Task-aware Garbage Collection in a Multi-Tasking Virtual Machine Sunil Soman Laurent Daynès Chandra Krintz RACE Lab, UC Santa Barbara Sun Microsystems.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
An Adaptive, Region-based Allocator for Java Feng Qian & Laurie Hendren 2002.
Age-Oriented Concurrent Garbage Collection Harel Paz, Erez Petrank – Technion, Israel Steve Blackburn – ANU, Australia April 05 Compiler Construction Scotland.
1 Reducing Generational Copy Reserve Overhead with Fallback Compaction Phil McGachey and Antony L. Hosking June 2006.
Comparison of JVM Phases on Data Cache Performance Shiwen Hu and Lizy K. John Laboratory for Computer Architecture The University of Texas at Austin.
School of ComputingJanos Project Processes in KaffeOS: Isolation, Resource Management, and Sharing in Java Godmar Back Wilson HsiehJay Lepreau School of.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
Taking Off The Gloves With Reference Counting Immix
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
Supporting GPU Sharing in Cloud Environments with a Transparent
Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.
A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Fast Conservative Garbage Collection Rifat Shahriyar Stephen M. Blackburn Australian National University Kathryn S. M cKinley Microsoft Research.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.
Embedded System Lab. 정범종 A_DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters H. Wang et al. VEE, 2015.
Min Lee, Vishal Gupta, Karsten Schwan
380C lecture 19 Where are we & where we are going –Managed languages Dynamic compilation Inlining Garbage collection –Opportunity to improve data locality.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
1 Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT) Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss (UMass),
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Full and Para Virtualization
Big Data Engineering: Recent Performance Enhancements in JVM- based Frameworks Mayuresh Kunjir.
CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.
Sunpyo Hong, Hyesoon Kim
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
© 2004 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Understanding Virtualization Overhead.
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
1 The Garbage Collection Advantage: Improving Program Locality Xianglong Huang (UT), Stephen M Blackburn (ANU), Kathryn S McKinley (UT) J Eliot B Moss.
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Institute of Parallel and Distributed Systems (IPADS)
CS427 Multicore Architecture and Parallel Computing
Java 9: The Quest for Very Large Heaps
OCR on Knights Landing (Xeon-Phi)
Sub-millisecond Stateful Stream Querying over
Improving java performance using Dynamic Method Migration on FPGAs
BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.
Accelerating Linked-list Traversal Through Near-Data Processing
Accelerating Linked-list Traversal Through Near-Data Processing
What we need to be able to count to tune programs
NumaGiC: A garbage collector for big-data on big NUMA machines
Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century May 4th 2017 Ben Lenard.
Adaptive Code Unloading for Resource-Constrained JVMs
Many-Core Graph Workload Analysis
Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J
Garbage Collection Advantage: Improving Program Locality
Chenxi Wang Huimin Cui Ting Cao John Zigman Haris Volos
Presentation transcript:

NUMA Optimization of Java VM 성균관대학교 장재영 (이재욱 교수 연구실) Kick-off @ 영종도 스카이 리조트 2016. 4. 22.

3차년도 개발 목표 및 계획 목표: 모노리틱 운영체제에서 실행 시스템의 스케일러빌리티 지원 상세 설계 내용 Java Virtual Machine (JVM)의 NUMA환경 메모리 지역성 (locality) 개선 연구 2차년도 상세 설계 내용 구현 및 성능 평가: KNL 출시 지연시, 에뮬레이션 환경에서 평가 JVM의 garbage collector (GC) 스케일러빌리티 개선을 위한 알고리즘 설계 실행 시스템의 GC 스케일러빌리티 측정 및 병목 분석 GC 스케일러빌리티 개선을 위한 실행 환경 알고리즘 상세 설계 4월 5월 6월 7월 8월 9월 10월 11월 12월 모노리틱 운영체제에서 실행 시스템의 스케일러빌리티 지원 상세 설계 NUMA 환경 JVM 메모리 지역성 상세 설계 내용 구현 NUMA 환경 JVM 메모리 지역성 개선 성능 평가 GC 스케일러빌리티 성능 측정 및 병목 분석 GC 스케일러 빌리티 개선 알고리즘 상세 설계

연구배경 및 문제점 Future manycore platforms will be based on near-far memory systems. e.g., Intel’s KNL: 3D stacked DRAM (near) + off-package DDR4 (far) e.g., DDR4 DRAM (near) + Intel’s 3D XPoint (far – NVM-based memory) A novel NUMA platform - different from conventional multi-socket NUMA NUMA optimization is key for scalable performance on those platforms. Especially for emerging memory-intensive workloads such as in-memory DB and massively parallel processing engine (e.g., Spark)

NUMA-aware JVM Optimization (1): Emulation Testbed Testbed: Using background traffic generators* Performance slowdown is about 5x (5.27~5.49) slowdown with memory intensive program Consistent with recent report with Intel Knights Landing Over 5x difference between in-package MCDRAM vs. DDR4 for STREAM** * [PACT ’15] Mark Oskin et al., A Software-managed Approach to Die-stacked DRAM ** Reported in ISC ‘15 IXPUG Workshop

NUMA-aware JVM Optimization (2): Workloads Memory intensive workloads Apache Spark: Big data processing applications (PageRank, TeraSort) DaCapo: Large heap benchmarks (h2, tradebeans, tradesoap) Performance potential: all-near vs. interleaved (1:8) memory allocation 43% difference in execution time on average Heap (GB) LLC MPKI # GCs (Full) PageRank 8 1.94 89 (1) TeraSort 1.09 100 (1) tradebeans 2 1.10 175 (9) tradesoap 3.32 289 (15) h2 0.93 54 (6)

NUMA-aware JVM Optimization (3): Baseline Design Allocating objects in near memory first Reduce allocation overhead Benefit from near memory accesses Additional capacity in far memory Reduces YGC overhead Utilizing redundant space in far memory Tenuring based on objects’ hotness

NUMA-aware JVM Optimization (4): Two Main Challenges Hotness-aware tenuring (migration) 의 두 가지 디자인 이슈 How to measure hotness of an object? 현재: Object header에 counter를 두어 access 수를 count 함 (before- cache) 문제점: before-cache accesses != after-cache accesses 해결 방향 Before-cache access를 after-cache access의 proxy metric으로 사용 (검증필요) Object의 reuse distance등 다른 metric 사용 How to keep the overhead of access counting manageable? 현재: 모든 object access의 전수(全數) counting 문제점: too slow Sampling 기법 적용을 통한 오버헤드 저감 Object 대신 allocation site(new X)의 hotness를 기반으로 분류

NUMA-aware JVM Optimization (5): Multi-Generational Heap Layout Near-far memory system에서 heap layout 구성에 관한 세 가지 질문: 제한된 양의 near memory를 각각의 generation에 어떻게 할당할 것인가? 가용한 near memory의 양과 generation space sizing의 상관관계는? 하나의 generation이 near-far memory에 걸쳐 있을 때 최적의 구성 방법은?

NUMA-aware JVM Optimization (6): Multi-Generational Heap Layout Q1: 제한된 양의 near memory를 각각의 generation에 어떻게 할당할 것인가? In most cases it is beneficial to assign all available near memory to a single generation rather than distribute it over multiple generations. The favored generation is application-specific. PageRank (favoring Young) h2 (favoring Old)

NUMA-aware JVM Optimization (7): Multi-Generational Heap Layout Q2: 가용한 near memory의 양과 generation space sizing의 상관관계는? Unless near memory is too small or too large, it is optimal to size the favored generation to be the same as the available near memory size (shown in red arrows). If the available near memory is too small or too large, coupling the size of favored generation to it is not a good strategy due to GC overhead (shown in blue arrows). PageRank

NUMA-aware JVM Optimization (8): Multi-Generational Heap Layout Q3: 하나의 generation이 near-far memory에 걸쳐 있을 때 최적의 구성 방 법은? In most cases placing the near memory at the top of the heterogeneous generational space (first-chunk) achieves slightly better performance than interleaving near and far memory (interleaving). Young:Old Ratio Execution time difference between first-chunk policy and interleaving policy

Summary: 구현 결과물 현황 및 향후 계획 NUMA-optimized JVM ☐ Heap allocation 휴리스틱 설계 보완 및 개선 – 진행중 ☐ Heap allocation 휴리스틱 구현 – 진행중 ☐ NUMA-optimized JVM 성능 평가 – 진행 예정 ☐ Garbage collector (GC) 코드 분석 및 알고리즘 최적화 – 진행 예정 3차년도 결과물 (목표치) 논문: 최우수 학회 (0.5편), SCI 또는 우수학회 (1편) 상세설계서: GC 스케일러빌리티 개선 상세 설계서