Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.
Computers Organization & Assembly Language Chapter 1 THE 80x86 MICROPROCESSOR.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Performance Analysis of Virtualization for High Performance Computing A Practical Evaluation of Hypervisor Overheads Matthew Cawood University of Cape.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Performance Analysis of Multiprocessor Architectures
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
Embedded Systems Programming
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
UC Regents Spring 2014 © UCBCS 152 L22: GPU + SIMD + Vectors John Lazzaro (not a prof - “John” is always OK) CS 152 Computer Architecture and.
Panda: MapReduce Framework on GPU’s and CPU’s
Session-02. Objective In this session you will learn : What is Class Loader ? What is Byte Code Verifier? JIT & JAVA API Features of Java Java Environment.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Intro to Java The Java Virtual Machine. What is the JVM  a software emulation of a hypothetical computing machine that runs Java bytecodes (Java compiler.
Java Introduction 劉登榮 Deng-Rung Liu 87/7/15. Outline 4 History 4 Why Java? 4 Java Concept 4 Java in Real World 4 Language Overview 4 Java Performance!?
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Exploring Multi-Threaded Java Application Performance on Multicore Hardware Ghent University, Belgium OOPSLA 2012 presentation – October 24 th 2012 Jennifer.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Boosting Mobile GPU Performance with a Decoupled Access/Execute Fragment Processor José-María Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis.
Performance Tuning on Multicore Systems for Feature Matching within Image Collections Xiaoxin Tang*, Steven Mills, David Eyers, Zhiyi Huang, Kai-Cheung.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
1 Fast and Efficient Partial Code Reordering Xianglong Huang (UT Austin, Adverplex) Stephen M. Blackburn (Intel) David Grove (IBM) Kathryn McKinley (UT.
History of Microprocessor MPIntroductionData BusAddress Bus
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
VTU – IISc Workshop Compiler, Architecture and HPC Research in Heterogeneous Multi-Core Era R. Govindarajan CSA & SERC, IISc
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Investigating the Effects of Using Different Nursery Sizing Policies on Performance Tony Guan, Witty Srisa-an, and Neo Jia Department of Computer Science.
1 Latest Generations of Multi Core Processors
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
CSE 598c – Virtual Machines Survey Proposal: Improving Performance for the JVM Sandra Rueda.
® July 21, 2004GC Summer School1 Cycles to Recycle: Copy GC Without Stopping the World The Sapphire Collector Richard L. Hudson J. Eliot B. Moss Originally.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
“Processors” issues for LQCD January 2009 André Seznec IRISA/INRIA.
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
Sung-Dong Kim, Dept. of Computer Engineering, Hansung University Java - Introduction.
NFV Compute Acceleration APIs and Evaluation
Yang Gao and Dr. Jason D. Bakos
Institute of Parallel and Distributed Systems (IPADS)
Before You Begin Nahla Abuel-ola /WIT.
Topic: Difference b/w JDK, JRE, JIT, JVM
Runtime Analysis of Hotspot Java Virtual Machine
Vector Processing => Multimedia
Introduction to CUDA Programming
M S COLLEGE ART’S, COMM., SCI. & BMS
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems A Comprehensive Study of Java HPC on Intel Many-core Architecture

HPC and Many-core Architectures High-performance computing (HPC) continually evolves □ Spread all practical fields □ Massive parallel processing □ Strong computing power 2 Stimulates new processor architecture □ More cores onto one single chip □ GPUs, Xeon Phi, etc.

Java on HPC □ Easy and portable programmability □ Built-in multithreading mechanism □ Strong community/corp. support 3

Gap between Java HPC and Many-core Works focusing on running Java on GPU □ JCUDA, Aparapi, JOCL, etc. □ Convert Java bytecodes into CUDA/OpenCL 4 Deficiencies □ Not running managed runtime on many-core □ Cannot utilize good Java features No official support for Java on Intel’s MIC

Bridge the gap ExperimentsObservations Semi-automatic vectorization Agenda

Intel Xeon Phi Coprocessor Intel® Knight Corner(KNC) □ More than 60 in-order coprocessor cores, ~1GHz □ Based on x86 ISA, extended with new 512-bit wide SIMD vector instructions and registers. 6 Each Coprocessor core □ Supports 4 hardware threads □ 32KB L1 data & instruction cache, 512KB L2 cache No traditional LLC □ Interconnected L2 caches □ Memory controllers □ Bidirectional ring bus Architecture overview of an Intel® MIC Architecture core

Java Platform OpenJDK □ A free and open-source implementation of the Java Platform, Standard Edition (Java SE) □ Consist of HotSpot (the virtual machine), Java Class Library and javac compiler, etc. 7 Execution engine – HotSpot VM □ Execute Java bytecodes in class files □ Class loader, Java interpreter, just-in-time compiler (JIT), garbage collector, etc.

Challenges Lack of dependent libraries for cross-building □ Libraries related to graphics, fonts, etc. 8 μOS on Xeon Phi is oversimplified □ Lack of necessary tools for developing and debugging Incompatibility between HotSpot’s assembly library and Xeon Phi ISA □ Floating-point related, SSE and AVX □ mfence, clflush, etc.

Porting OpenJDK to Xeon Phi Lack of dependent libraries for cross-building □ A “headless” build of OpenJDK – no graphics support 9 μOS on Xeon Phi is oversimplified □ Cross-compile missing tools from source packages Incompatibility between HotSpot’s assembly library and Xeon Phi ISA □ 512-bit vector instructions & legacy x87 instructions □ Fine-grained modification based on semantics in HotSpot

Bridge the gap ExperimentsObservations Semi-automatic vectorization Agenda 10

Environment 11 ParameterIntel Xeon Phi TM Coprocessor 5110P Intel (R) Xeon (R) CPU E Chips11 Physical cores606 Threads per core42 Frequency MHz2.00 GHz Data Caches32 KB L1, 512 KB L2 per core 32 KB L1d, 32 KB L1i 256 KB L2, per core 15 MB L3, shared Memory Capacity7697 MB32 GB Memory TechnologyGDDR5DDR3 Peak Memory Bandwidth320 GB/s42.6 GB/s Vector Length512 bits256 bits (Intel (R) AVX) Memory Access Latency340 cycles140 cycles

Experiment Setup 12 Java environment and benchmarks □ OpenJDK 7u6 version (build b24) □ Thread version 1.0 of Java Grande benchmark suite → Crypt, Series, SOR, SparseMatmult, LUFact Single-threaded execution □ Java and C versions □ -no-vec, -no-opt-prefetch, -no-fma Multi-threaded execution □ Application threads pinned evenly onto each physical core → 1, 20, 40, 60*, 120, 180 and 240 threads on Xeon Phi → 1, 2, 4, 6*, 9 and 12 threads on CPU □ Average of 5 iterative runs for each benchmark-thread pair

Benchmark Characteristics 13

Bridge the gap ExperimentsObservations Semi-automatic vectorization Agenda 14

Single-threaded performance – CPU vs MIC 15 Memory latency : 140 vs. 340 cycles Instruction decoder : 4 decoder units vs. two-cycle unit Execution engine : out-of-order vs. in-order Clock frequency : 2.0 vs. ~1 GHz Java C C

Single-threaded performance – CPU vs MIC 16 On-chip caches critical to performance JVM memory management, TLAB, garbage collector Porting overhead

Scalability of Multi-threads 17 □ Much better scalability for all programs can be observed on Xeon Phi CPU MIC □ Throughputs increase before 120 threads for all programs on Xeon Phi □ SparseMatmult scales up to 240 threads on Xeon Phi □ Crypt is not able to scale even a little after exceeding two running threads per core

Throughputs 18

Optimizing Solutions Enable 512-bit vectorization Software prefetching in JIT Optimization for in-order execution mode 19

Bridge the gap ExperimentsObservations Semi-automatic vectorization Agenda 20

Auto-vectorization in HotSpot 21 X86 platform

Restrictions 22

Semi-automatic Vectorization Front-end scheme in Javac □ Annotation before innermost loop □ New “vector bytecodes” 23 Implementation in HotSpot □ Parse “vector bytecodes” □ Generate 512-bit vector instructions □ Meet 64-byte alignment

Speedup of Throughput 24 Throughput of LUFact with varying number of threads

Throughput Comparison -- CPU & MIC 25 Performance gains by vectorization for LUFact >3 x

Conclusions First porting of OpenJDK to Intel Xeon Phi coprocessor □ A build of complete Java runtime environment on modern many-core architecture 26 A comprehensive study on performance issues of Java HPC benchmarks on Xeon Phi □ Single-threaded and multi-threaded runs □ Throughput and scalability Semi-automatic vectorization scheme in Hotspot VM □ Up to 3.4x speedup for LUFact on Xeon Phi compared to CPU

Thanks 27 Questions