Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.
SHREYAS PARNERKAR. Motivation Texture analysis is important in many applications of computer image analysis for classification or segmentation of images.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
ECE 562 Computer Architecture and Design Project: Improving Feature Extraction Using SIFT on GPU Rodrigo Savage, Wo-Tak Wu.
Lincoln University Canterbury New Zealand Evaluating the Parallel Performance of a Heterogeneous System Elizabeth Post Hendrik Goosen formerly of Department.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
1 Effect of Increasing Chip Density on the Evolution of Computer Architectures R. Nair IBM Journal of Research and Development Volume 46 Number 2/3 March/May.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
A Survey of Parallel Tree- based Methods on Option Pricing PRESENTER: LI,XINYING.
1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Advisor: Dr. Aamir Shafi Co-Advisor: Mr. Ali Sajjad Member: Dr. Hafiz Farooq Member: Mr. Tahir Azim Optimizing N-body Simulations for Multi-core Compute.
Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Molecular Dynamics Simulations on a GPU in OpenCL Alex Cappiello.
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Status of Reconstruction in CBM
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
QCAdesigner – CUDA HPPS project
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CDVS on mobile GPUs MPEG 112 Warsaw, July Our Challenge CDVS on mobile GPUs  Compute CDVS descriptor from a stream video continuously  Make.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Smoothed Particle Hydrodynamics Matthew Zhu CSCI 5551 — Fall 2015.
Sunpyo Hong, Hyesoon Kim
Pierre VANDE VYVRE ALICE Online upgrade October 03, 2012 Offline Meeting, CERN.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
ALICE O 2 | 2015 | Pierre Vande Vyvre O 2 Project Pierre VANDE VYVRE.
Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.
S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
Accelerating particle identification for high-speed data-filtering using OpenCL on FPGAs and other architectures for FPL 2016 Srikanth Sridharan CERN 8/31/2016.
Intelligent trigger for Hyper-K
FPGAs for next gen DAQ and Computing systems at CERN
Chilimbi, et al. (2014) Microsoft Research
Fast Parallel Event Reconstruction
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Implementing Boosting and Convolutional Neural Networks For Particle Identification (PID) Khalid Teli .
From Turing Machine to Global Illumination
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Department of Computer Science University of California, Santa Barbara
Hybrid Programming with OpenMP and MPI
COMP60621 Fundamentals of Parallel and Distributed Systems
Department of Computer Science University of California, Santa Barbara
Multicore and GPU Programming
Force Directed Placement: GPU Implementation
Presentation transcript:

Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)

Introducing Myself!  A summer student at CERN this year  Worked in ALICE O2 project  GPU benchmarking for ITS Cluster Finder  Carry on this summer project to be a Master Thesis  Computing Platform Benchmark with two advisors  Prof. Tiranee Achalakul, KMUTT  Mr. Sylvain Chapeland, ALICE O2, CERN  Study platforms through various implementations (CUDA, C, OpenCL) of ALICE applications 2

Outline  ALICE Upgrade  Why Platform Benchmarking?  Survey Discussion  Example Applications  Evaluation Method  Initial Result  Conclusion 3

ALICE Upgrade  Expected to be installed in 2018  What’s new?  Improve the read-out rate  Peak at 1TB/S  Improve Impact parameter resolution  Improve tracking efficiency by increasing granularity  Improve the computing system  Processing data online 4

Upgraded System Architecture 5

 First Level Processor (FLP)  connected to the receiver at the detector  grouping and aggregating each collision of particle inside the ring (Reducing data)  Event Processing Node (EPN)  For calculation and reconstruction for physic experiment  Receive processed data from FLP 6

Why benchmarking?  To find out which platform produce the highest throughput for ALICE applications  Each platform will have its own implementation for optimum result  The end result will be used to suggest the suitable platform for each ALICE application type 7

Targeted Accelerators  Graphic Processing Unit (GPU)  High performance per cost and energy efficiency  Had been accepted and used widely to accelerate scientific application  Many Integrated Core (MIC)  Fewer processors than GPU, but each is more powerful  Highly portable (compare to CUDA&OpenCL)  Accelerated Processing Unit (APU)  CPU+GPU on the same chip  GPU can access CPU memory directly  Consume low energy 8

Project Objectives  To study the potential performance of each accelerators for ALICE applications  To study factor(s) that affect the performance of applications on each accelerators  To study the performance of OpenCL on all targeted accelerators  To study the tradeoffs between each accelerator 9

Questions  The result should answer these questions.  What is the performance overhead in OpenCL and CUDA? Does it worth the portability tradeoff?  Which accelerator produces the best result with OpenCL implementations?  Which accelerators should be suggested to be integrated in the upgraded ALICE system? 10

Survey Discussion  Several previous works had been done  “A CPU, GPU, FPGA System for X-ray Image Processing using High-speed Scientific Cameras” (Binotto et al., 2013)  “Accelerating Geospatial Applications on Hybrid Architectures” (Lai et al., 2013)  “MIC Acceleration of Short-Range Molecular Dynamics Simulations” (Wu et al, 2010)  Face detection, Ocean Surface simulation, Dwarfs and the likes 11

Survey Discussion  Yet, they are not quite connected with ALICE Application  Different Data Format  Different Algorithms and problem specifications  To optimize the result, better work with the real problem definitions 12

Application Categories  Categorized into 3 category  Data Intensive  Computing Intensive  Communication Intensive  Communication intensive applications are not presented in ALICE  Only Data Intensive and Computing Intensive will be focused 13

Data Intensive  High dependency between each element in the data  Data is needed to be accessed and updated multiple times  Example  ITS Cluster Finder  Put particles into groups  Calculate the “Center of Gravity” of the cluster  Discard coordinates and use only CG to represent the cluster 14

Computing Intensive  Most of the work is computation  Little to none dependency between elements  Sometimes, Embarrassingly parallel can be used  Example  TPC Track Identification  Using Hough Transform to identify track  True computing intensive application  Highly Parallelizable 15

Design of Experiment  Responses  Throughput  Scalability  Control Factor: Type of platform, Languages  GPU (CUDA and OpenCL)  MIC (C and OpenCL)  APU (OpenCL)  Blocking Factor: Application Category  Data Intensive and Computing Intensive 16

Design of Experiment  Experiment Plan  Throughput Analysis  Scalability Analysis  Vary the thread numbers  Plot the Throughput against Thread Numbers  The trend in the graph will determine the scalability 17

Evaluation  Throughput  Set the baseline performance  Using the CPU result  Speed up from the baseline is computed  Determine the most suitable accelerator from the highest throughput  Scalability  Fixed input size with varied thread numbers  Varied input size and fixed thread numbers  Throughput should be on the rise when thread number is increased  Maintain the peak performance when input size is increased 18

Initial Result  ITS Cluster Finder on Tesla K20xm 19

Initial Result  OpenCL implementation of ITS Cluster Finder was completed  Showed similar results as CUDA  APU and MIC is not yet tested  Next is to improve it with the pipeline method 20

Discussion  High dependency made it hard to work efficiently on GPU  GPU provide very little synchronization in Kernel  Not in the GPU specialties: Only load, compare and store  Data Intensive should perform better on MIC (from speculation)  Data Intensive can then be separate into two  With dependency and No dependency 21

Expected Milestone  January, 2015  Optimize CUDA and OpenCL implementation of Cluster Finder  C Implementation for Cluster Finder to be tested on MIC  Study the TPC Track Identification problem definition and design  February, 2015  Complete all implementations of TPC Track Identification  Acquire more examples for implementation 22

Conclusion  ALICE Upgrade calls for a high performance computing system  Cope with the higher read-out rate  Online processing  Accelerators are aimed to be integrated to increase the throughput  Benchmark is done to suggest the most suitable platform  Using ALICE applications to benchmark  GPU, MIC and APU 23