Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
IIAA GPMAD A beam dynamics code using Graphics Processing Units GPMAD (GPU Processed Methodical Accelerator Design) utilises Graphics Processing Units.
From Quark to Jet: A Beautiful Journey Lecture 1 1 iCSC2014, Tyler Dorland, DESY From Quark to Jet: A Beautiful Journey Lecture 1 Beauty Physics, Tracking,
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
CMPT 300: Final Review Chapters 8 – Memory Management: Ch. 8, 9 Address spaces Logical (virtual): generated by the CPU Physical: seen by the memory.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.
Panda: MapReduce Framework on GPU’s and CPU’s
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Synergy.cs.vt.edu Power and Performance Characterization of Computational Kernels on the GPU Yang Jiao, Heshan Lin, Pavan Balaji (ANL), Wu-chun Feng.
February 19th 2009AlbaNova Instrumentation Seminar1 Christian Bohm Instrumentation Physics, SU Upgrading the ATLAS detector Overview Motivation The current.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Tracking at the ATLAS LVL2 Trigger Athens – HEP2003 Nikos Konstantinidis University College London.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Test Of Distributed Data Quality Monitoring Of CMS Tracker Dataset H->ZZ->2e2mu with PileUp - 10,000 events ( ~ 50,000 hits for events) The monitoring.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Many-Core Scalability of the Online Event Reconstruction in the CBM Experiment Ivan Kisel GSI, Germany (for the CBM Collaboration) CHEP-2010 Taipei, October.
1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.
Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.
Extracted directly from:
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
GPU Programming with CUDA – Optimisation Mike Griffiths
Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Use of GPUs in ALICE (and elsewhere) Thorsten Kollegger TDOC-PG | CERN |
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
ALICE Upgrade for Run3: Computing HL-LHC Trigger, Online and Offline Computing Working Group Topical Workshop Sep 5 th 2014.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
By Dirk Hekhuis Advisors Dr. Greg Wolffe Dr. Christian Trefftz.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
David Angulo Rubio FAMU CIS GradStudent. Introduction  GPU(Graphics Processing Unit) on video cards has evolved during the last years. They have become.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
LHCbComputing Computing for the LHCb Upgrade. 2 LHCb Upgrade: goal and timescale m LHCb upgrade will be operational after LS2 (~2020) m Increase significantly.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Perspective for parallel computing in a close and far future Felice Pantaleo CERN – PH Department Felice Pantaleo CERN – PH Department
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
New Track Seeding Techniques at the CMS experiment
EECE571R -- Harnessing Massively Parallel Processors ece
Presented by: Isaac Martin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presentation transcript:

Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University of Hamburg) Supervisors: B. Hegner (CERN) V. Innocente (CERN) A. Meyer (DESY) A. Pfeiffer (CERN) A. Schmidt (University of Hamburg) Supervisors: B. Hegner (CERN) V. Innocente (CERN) A. Meyer (DESY) A. Pfeiffer (CERN) A. Schmidt (University of Hamburg) 1

Outline Track Trigger Track Trigger Parallel Computer Architectures Parallel Computer Architectures Trigger framework Trigger framework Conclusion and Outlook Conclusion and Outlook 2

Tracking at CMS Particles produced in the collisions leave traces (hits) as they fly through the detector Particles produced in the collisions leave traces (hits) as they fly through the detector The innermost detector of CMS is called Tracker The innermost detector of CMS is called Tracker Tracking : the art of associate each hit to the particle that left it Tracking : the art of associate each hit to the particle that left it The collection of all the hits left by the same particle in the tracker along with some additional information (e.g. momentum, charge) defines a track The collection of all the hits left by the same particle in the tracker along with some additional information (e.g. momentum, charge) defines a track Pile-up : # of p-p collisions per bunch crossing Pile-up : # of p-p collisions per bunch crossing 3

Detector structure 4 25nshours+sec1ms1us

5 Event Selection Flow 10 9 Ev/s 10 2 Ev/s % Low Level Trigger 99.9 % High Level Trigger 0.1 % 0.01 %

Future plans for the LHC: HL-LHC High Luminosity LHC High Luminosity LHC – Luminosity increased to cm -2 s -1 – Pile-up increased to 140 HL-LHC HL-LHC – Huge amount of information – The current approach does not scale with the pile-up – Coping with this amount of data possible if tracking information available at trigger level – Many hardware implementations in development 6

Meanwhile in HPC… Use several platforms containing GPUs to solve one single problem Programming challenges: – Algorithm parallelization – Perform computation in GPUs – Execution in a distributed system where platforms have their own memory – Network communication. 7

CPU and GPU architectures SMX* executes kernels (aka functions) using hundreds of threads concurrently. SMX* executes kernels (aka functions) using hundreds of threads concurrently. SIMT (Single-Instruction, Multiple- Thread) SIMT (Single-Instruction, Multiple- Thread) Instructions pipelined Instructions pipelined Thread-level parallelism Thread-level parallelism Instructions issued in order Instructions issued in order No Branch prediction No Branch prediction Branch predication Branch predication Cost ranging from few hundreds to a thousand euros depending on features (e.g. NVIDIA GTX euros) Cost ranging from few hundreds to a thousand euros depending on features (e.g. NVIDIA GTX euros) Large caches (slow memory accesses to quick cache accesses) Large caches (slow memory accesses to quick cache accesses) SIMD SIMD Branch prediction Branch prediction Data forwarding Data forwarding Powerful ALU Powerful ALU Pipelining Pipelining *SMX = Streaming multiprocessor CPU GPU 8

Exploiting GPU in trigger x86 CPUs are not direct competitors of GPUs in embedded applications x86 CPUs are not direct competitors of GPUs in embedded applications – Latency stability – Power efficiency – Performance 9

Parallel Track Trigger framework Tracker data partitioning Tracker data partitioning – The information produced by the whole tracker cannot be processed by one GPU Data needs to be transferred between network interfaces and multiple GPUs Data needs to be transferred between network interfaces and multiple GPUs Data crunching must be fast Data crunching must be fast Execution kernel has to be already waiting to be fed to avoid overhead Execution kernel has to be already waiting to be fed to avoid overhead 10

Partitioning Tracks ~straight if seen from a longitudinal perspective (z,R) plane Tracks ~straight if seen from a longitudinal perspective (z,R) plane Number of tracks approx. uniform in eta Number of tracks approx. uniform in eta 11

Partitioning (ctd.) Eta bins could have been treated independently Eta bins could have been treated independently – Pile-up and longitudinal impact parameter (displacement of the collision point along the z-axis) limit this hypothesis – Area on the next layer that needs to be evaluated for hit searching not obvious 12

Partitioning (ctd.) Simulation for different longitudinal impact parameters Simulation for different longitudinal impact parameters Lists of segments on subsequent layers evaluated beforehand Lists of segments on subsequent layers evaluated beforehand Each streaming multiprocessor on a GPU is in charge of one list Each streaming multiprocessor on a GPU is in charge of one list 13

Data movement without GPUDirect Copy to the main memory managed by the CPU (kernel space) Copy to the main memory managed by the CPU (kernel space) Copy to userspace pinned memory Copy to userspace pinned memory Copy to GPU memory Copy to GPU memory GPU pattern recognition to be launched by the CPU GPU pattern recognition to be launched by the CPU 14

Data movement with GPUDirect GPUDirect accelerated communication with network and storage devices GPUDirect accelerated communication with network and storage devices GPUDirect supports RDMA allowing latencies ~1us and link bandwidth ~7GB/s GPUDirect supports RDMA allowing latencies ~1us and link bandwidth ~7GB/s 15

Always hungry kernel GPU pattern recognition function in a while(true) loop GPU pattern recognition function in a while(true) loop – In order to reduce the overhead given by the CPU launching a function to be executed by the GPU GPU polling and checking for new data to crunch GPU polling and checking for new data to crunch 16

Conclusion and outlook GPUs seem to represent a good opportunity, not only for analysis and simulation applications, but also for more “hardware” jobs GPUs seem to represent a good opportunity, not only for analysis and simulation applications, but also for more “hardware” jobs Fast test and deployment phases Fast test and deployment phases Possibility to change the trigger on the fly and to run multiple triggers at the same time Possibility to change the trigger on the fly and to run multiple triggers at the same time Hardware development by Computer Graphics industry Hardware development by Computer Graphics industry Trigger framework in test with an external data sender Trigger framework in test with an external data sender Data format under evaluation Data format under evaluation Replacing custom electronics with affordable fully programmable processors to provide the maximum possible flexibility is a reality not so far in the future Replacing custom electronics with affordable fully programmable processors to provide the maximum possible flexibility is a reality not so far in the future Evaluation of fast parallel pattern recognition algorithms to be run on each GPU streaming multiprocessor Evaluation of fast parallel pattern recognition algorithms to be run on each GPU streaming multiprocessor 17