The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.

Introduction.  Professor  Adam Porter 

1 Lecture 10 Intermediate Representations. 2 front end »produces an intermediate representation (IR) for the program. optimizer »transforms the code in.

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

Cosine similarity metric calculation on low power heterogeneous computing platform Michał Karwatowski 1,2, Sebastian Koryciak 1,2, Ernest Jamro 1,2, Agnieszka.

Compilation 2007 Code Generation Michael I. Schwartzbach BRICS, University of Aarhus.

1 1 Lecture 14 Java Virtual Machine Instructors: Fu-Chiung Cheng ( 鄭福炯 ) Associate Professor Computer Science & Engineering Tatung Institute of Technology.

Data Parallel Algorithms Presented By: M.Mohsin Butt

Basic CUDA Programming Shin-Kai Chen VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao.

Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.

Java Flowpaths: Efficiently Generating Circuits for Embedded Systems from Java WorldComp ESA 2006 Las Vegas, Nevada EXCERPT Darrin Hanna, Michael DuChene,

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

JVM-1 Introduction to Java Virtual Machine. JVM-2 Outline Java Language, Java Virtual Machine and Java Platform Organization of Java Virtual Machine Garbage.

Chapter 16 Java Virtual Machine. To compile a java program in Simple.java, enter javac Simple.java javac outputs Simple.class, a file that contains bytecode.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Compiler design Computer Science Rensselaer Polytechnic Lecture 1.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Code Generation Introduction. Compiler (scalac, gcc) Compiler (scalac, gcc) machine code (e.g. x86, arm, JVM) efficient to execute i=0 while (i < 10)

CISC673 – Optimizing Compilers1/34 Presented by: Sameer Kulkarni Dept of Computer & Information Sciences University of Delaware Phase Ordering.

Compiler Construction Lecture 17 Mapping Variables to Memory.

Lecture 8: Caffe - CPU Optimization

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Application Security Tom Chothia Computer Security, Lecture 14.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Java Bytecode What is a.class file anyway? Dan Fleck George Mason University Fall 2007.

Lecture 10 : Introduction to Java Virtual Machine

Java Programming Introduction & Concepts. Introduction to Java Developed at Sun Microsystems by James Gosling in 1991 Object Oriented Free Compiled and.

CSE 219 Computer Science III Performance & Optimization.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

1 Introduction to JVM Based on material produced by Bill Venners.

Study of the parallel techniques for dimensionality reduction and its impact on quality of the text processing algorithms Marcin Pietroń 1,2, Maciej Wielgosz.

The versatile hardware accelerator framework for sparse vector calculations Michał Karwatowski 1,2, Kazimierz Wiatr 12 1 AGH University of Science and.

Energy efficient calculations of text similarity measure on FPGA-accelerated computing platforms Michał Karwatowski 1,2, Paweł Russek 1,2, Maciej Wielgosz.

HiPC 2010 AN INTEGER PROGRAMMING FRAMEWORK FOR OPTIMIZING SHARED MEMORY USE ON GPUS Wenjing Ma Gagan Agrawal The Ohio State University.

SuperLU_DIST on GPU Cluster Sherry Li FASTMath Meeting, Oct. 1-2, /2014 “A distributed CPU-GPU sparse direct solver”, P. Sao, R. Vuduc and X.S. Li, Euro-Par.

GPU Programming with CUDA – CUDA 5 and 6 Paul Richmond

Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.

Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Muthu Baskaran 1 Uday Bondhugula 1 Sriram Krishnamoorthy 1 J Ramanujam 2 Atanas Rountev 1 P Sadayappan 1 1 Department of Computer Science & Engineering.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

More on MIPS programs n SPIM does not support everything supported by a general MIPS assembler. For example, –.end doesn’t work Use j $ra –.macro doesn’t.

Code generation exercises. Function body Transform the following code into java bytecode: def middle(small: Int, big: Int): Int = { val mid = small +

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

CS/EE 217 GPU Architecture and Parallel Programming Midterm Review

CS 153: Concepts of Compiler Design November 18 Class Meeting Department of Computer Science San Jose State University Fall 2015 Instructor: Ron Mak

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

CS 732: Advance Machine Learning

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Scaling up R computation with high performance computing resources.

RealTimeSystems Lab Jong-Koo, Lim

GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.

GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.

Prof. Zhang Gang School of Computer Sci. & Tech.

Big Data A Quick Review on Analytical Tools

EECE571R -- Harnessing Massively Parallel Processors ece

CS216: Program and Data Representation

Basic CUDA Programming

Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang

Multi-Layer Perceptron On A GPU

Java Virtual Machine (JVM)

CSc 453 Interpreters & Interpretation

6- General Purpose GPU Programming

CSc 453 Interpreters & Interpretation

Presentation transcript:

The Java profiler based on byte code analysis and instrumentation for many-core hardware accelerators Marcin Pietroń 1,2, Michał Karwatowski 1,2, Kazimierz Wiatr 12 1 AGH University of Science and Technology, al. Mickiewicza 30, Kraków, 2 ACK Cyfronet AGH, ul. Nawojki 11, Kraków RUC Kraków

Agenda GPU acceleration Code analysis and instrumentation Experiments Results Conclusion and future work 2

GPU as modern hardware accelerators Computing power (over 1 Tflops) Availability High parallelism (SIMT architecture) High level programming tools (CUDA, OpenCL) 3

GPU hardware accelerators Number of algorithms from different domains implemented in GPU: Linear algebra (e.g. cublas, cula) Deep learning, neural networks, machine learning algorithms (e.g. SVM) Computational intelligence (e.g. genetic, memetic algorithms) Data and text mining 4

Code analysis Implementation should be preceded by appropriate analysis Analysis can be automated Static analysis for finding hidden parallelism (Banarjee, Range Test, Omega Test) and data reusing and distribution Profiling as dynamic analysis 5

Byte code analysis and instrumentation Byte code analysis just in time Apprioprate instrumentation for profiling and static analysis Results of analysis and profiling can be used for implementation 6

System architecture 7

Byte code instrumentation instrumenting array data read instructions instrumenting array data write instructions instrumenting array data read and write instructions for counting number of accesses and standard deviation, instrumenting single variables read and write for counting number of accesses. 8

Byte code instrumentation 9 for (int i = 1; i < 100; i++) { test_1[i] = 100; test_2[i] = test_1[i-1] + 10; } for (int i = 1; i < 100; i++) { test_1[i] = 100; test_1_mon[i] = i; test_2[i] = test_1[i-1] + 10; if (test_1_mon[i-1] < i) { dist_vectors[i-1] = i-test_1_mon[i]; } 27: iconst_1 28: istore%6 30: iload%6 32: bipush100 34: if_icmpge #93 37: aload_1 38: iload%6 40: bipush100 42: iastore 43: aload_3 44: iload%6 46: iload%6 48: iastore. 70: if_icmpge #87 73: aload%5 75: iload%6 77: iconst_1 78: isub 79: iload%6 81: aload_3 82: iload%6 84: iaload 85: isub 86: iastore 87: iinc%61 90: goto#30

GPU implementation rules if data is reused between iterations (between threads) this data should be transfer to shared memory, data reused by only single iteration should be transfer to local memory (registers), data which is reused, read only and without regular accesses should be allocated in texture memory, 10

GPU implementation rules common constant values used by threads should be write to constant memory, data with single access but without coalesced access should be transfer in a group in a coalesced manner to shared memory and then read from this memory for further computing. 11

JCuda generation Implementation can be done manually or partly in automated way Rules generate some parallel code patterns 12

Experimental results 13 size of matrixGPU time [ms] CPU time (MKL BLAS) [ms] 256× × × × ×

Conlusions and future work Implementation preceded by source code analysis helps adaption algorithm in GPU Automated parallel code generation in GPU save a lot of time Based on byte code = portable Optimizations in code generation must be done furter in our system (memories access patterns) 14

Questions 15