L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Intel Redefines GPU: Larrabee Tianhao Tong Liang Wang Runjie Zhang Yuchen Zhou.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multi-core and tera- scale computing A short overview of benefits and challenges CSC 2007 Andrzej Nowak, CERN
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
CBM meeting, Dubna 2008/10/14 L1 CA Track Finder Status Ivan Kisel (KIP, Uni-Heidelberg), Irina Rostovtseva (ITEP Moscow) Study of L1 CA track finder with.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Monte-Carlo method and Parallel computing  An introduction to GPU programming Mr. Fang-An Kuo, Dr. Matthew R. Smith NCHC Applied Scientific Computing.
A many-core GPU architecture.. Price, performance, and evolution.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
ALICE HLT High Speed Tracking and Vertexing Real-Time 2010 Conference Lisboa, May 25, 2010 Sergey Gorbunov 1,2 1 Frankfurt Institute for Advanced Studies,
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
COMPUTER ARCHITECTURE (for Erasmus students)
GPU Programming with CUDA – Accelerated Architectures Mike Griffiths
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
CA+KF Track Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting GSI, February 28, 2008.
Online Event Selection in the CBM Experiment I. Kisel GSI, Darmstadt, Germany RT'09, Beijing May 12, 2009.
Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Online Track Reconstruction in the CBM Experiment I. Kisel, I. Kulakov, I. Rostovtseva, M. Zyzak (for the CBM Collaboration) I. Kisel, I. Kulakov, I. Rostovtseva,
Many-Core Scalability of the Online Event Reconstruction in the CBM Experiment Ivan Kisel GSI, Germany (for the CBM Collaboration) CHEP-2010 Taipei, October.
Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Helmholtz International Center for CBM – Online Reconstruction and Event Selection Open Charm Event Selection – Driving Force for FEE and DAQ Open charm:
Elastic Neural Net for standalone RICH ring finding Sergey Gorbunov Ivan Kisel DESY, Zeuthen KIP, Uni-Heidelberg DESY, Zeuthen KIP, Uni-Heidelberg KIP.
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Speed-up of the ring recognition algorithm Semeon Lebedev GSI, Darmstadt, Germany and LIT JINR, Dubna, Russia Gennady Ososkov LIT JINR, Dubna, Russia.
GPU Architecture and Programming
Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.
Fast track reconstruction in the muon system and transition radiation detector of the CBM experiment at FAIR Andrey Lebedev 1,2 Claudia Höhne 3 Ivan Kisel.
Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.
Standalone FLES Package for Event Reconstruction and Selection in CBM DPG Mainz, 21 March 2012 I. Kisel 1,2, I. Kulakov 1, M. Zyzak 1 (for the CBM.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
First Level Event Selection Package of the CBM Experiment S. Gorbunov, I. Kisel, I. Kulakov, I. Rostovtseva, I. Vassiliev (for the CBM Collaboration (for.
Multicore – The future of Computing Chief Engineer Terje Mathisen.
CA+KF Track Reconstruction in the STS S. Gorbunov and I. Kisel GSI/KIP/LIT CBM Collaboration Meeting Dresden, September 26, 2007.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
Fast parallel tracking algorithm for the muon detector of the CBM experiment at FAIR Andrey Lebedev 1,2 Claudia Höhne 1 Ivan Kisel 1 Gennady Ososkov 2.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
Kevin Skadron University of Virginia Dept. of Computer Science LAVA Lab Trends in Multicore Architecture.
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Parallel Implementation of the KFParticle Vertexing Package for the CBM and ALICE Experiments Ivan Kisel 1,2,3, Igor Kulakov 1,4, Maksym Zyzak 1,4 1 –
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Y. Fisyak1, I. Kisel2, I. Kulakov2, J. Lauret1, M. Zyzak2
Kalman filter tracking library
Fast Parallel Event Reconstruction
ALICE HLT tracking running on GPU
TPC reconstruction in the HLT
Compiler Back End Panel
Compiler Back End Panel
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Multicore and GPU Programming
Presentation transcript:

L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008

16 October 2008, DubnaIvan Kisel, GSI2/15 Many-core HPC High performance computing (HPC) High performance computing (HPC) Highest clock rate is reached Highest clock rate is reached Performance/power optimization Performance/power optimization Heterogeneous systems of many (>8) cores Heterogeneous systems of many (>8) cores Similar programming languages (Ct and CUDA) Similar programming languages (Ct and CUDA) We need a uniform approach to all CPU/GPU families We need a uniform approach to all CPU/GPU families On-line event selection On-line event selection Mathematical and computational optimization Mathematical and computational optimization SIMDization of the algorithm (from scalars to vectors) SIMDization of the algorithm (from scalars to vectors) MIMDization (multi-threads, multi-cores) MIMDization (multi-threads, multi-cores) Optimize the STS geometry (strips, sector navigation) Optimize the STS geometry (strips, sector navigation) Smooth magnetic field Smooth magnetic field Gaming STI: Cell STI: CellGaming GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee CPU/GPU AMD: Fusion AMD: FusionCPU/GPU ?? ? ?

16 October 2008, DubnaIvan Kisel, GSI3/15 NVIDIA GeForce GTX 280   NVIDIA GT200 GeForce GTX MB.   933 GFlops single precision (240 FPUs).   finally double precision support, but only ~ 90 GFlops (8 core Xeon ~80 GFlops).   Currently under investigation:   Tracking   Linpack   Image Processing Sebastian Kalcher CUDA (Compute Unified Device Architecture) CUDA (Compute Unified Device Architecture)

16 October 2008, DubnaIvan Kisel, GSI4/15 Intel Larrabee: 32 Cores L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: use the x86 instruction set with Larrabee-specific extensions; use the x86 instruction set with Larrabee-specific extensions; feature cache coherency across all its cores; feature cache coherency across all its cores; include very little specialized graphics hardware. include very little specialized graphics hardware. The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: LRB's x86 cores will be based on the much simpler Pentium design; LRB's x86 cores will be based on the much simpler Pentium design; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; LRB includes one fixed-function graphics hardware unit; LRB includes one fixed-function graphics hardware unit; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB includes explicit cache control instructions; LRB includes explicit cache control instructions; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register. each core supports 4-way simultaneous multithreading, with 4 copies of each processor register.

16 October 2008, DubnaIvan Kisel, GSI5/15 Intel Ct Language Ct: Throughput Programming in C++. Tutorial. Intel.   Ct adds new data types (parallel vectors) & operators to C++   Library-like interface and is fully ANSI/ISO-compliant   Ct abstracts away architectural details   Vector ISA width / Core count / Memory model / Cache sizes   Ct forward-scales software written today   Ct platform-level API, Virtual Intel Platform (VIP), is designed to be dynamically retargetable to SSE, SSEx, LRB, etc   Ct is fully deterministic   No data races   Nested data parallelism and deterministic task parallelism differentiate Ct on parallelizing irregular data and algorithm Extend C++ for Throughput-Oriented Computing Dot Product Using C Loops for (i = 0; i < n; i++) { dst += src1[i] * src2[i]; } Dot Product Using Ct TVEC Dst, Src1(src1, n), Src2(src2, n); Dst = addReduce(Src1*Src2); Element-wise multiply 3 Reduction (a global sum) 1 Vector operations subsumes loop The basic type in Ct is a TVEC

16 October 2008, DubnaIvan Kisel, GSI6/15 Ct vs. CUDA Matthias Bach

16 October 2008, DubnaIvan Kisel, GSI7/15 Multi/Many-Core Investigations CA: Game of Life CA: Game of Life L1/HLT CA Track Finder L1/HLT CA Track Finder SIMD KF Track Fitter SIMD KF Track Fitter LINPACK LINPACK MIMDization (multi-threads, multi-cores) MIMDization (multi-threads, multi-cores) GSI, KIP, CERN, Intel

16 October 2008, DubnaIvan Kisel, GSI8/15 SIMD KF Track Fit on Multicore Systems: Scalability Using Intel Threading Building Blocks – linear scaling on multiple cores #threads Håvard Bjerke Real fit time/track (  s)

16 October 2008, DubnaIvan Kisel, GSI9/15 Parallelization of the L1 CA Track Finder 1 Create tracklets 2 Collect tracks GSI, KIP, CERN, Intel, ITEP, Uni-Kiev

16 October 2008, DubnaIvan Kisel, GSI10/15 L1 Standalone Package for Event Selection Igor Kulakov

16 October 2008, DubnaIvan Kisel, GSI11/15 KFParticle: Primary Vertex Finder Ruben Moor The algorithm is implemented and passed first tests.

16 October 2008, DubnaIvan Kisel, GSI12/15 L1 Standalone Package for Event Selection Igor Kulakov, Iouri Vassiliev Efficiency Reference set97.1% All set91.9% Extra set81.9% Clone3.5% Ghost3.2% Tracks/even691 Efficiency of D + selection: 48.9%

16 October 2008, DubnaIvan Kisel, GSI13/15 Magnetic Field: Smooth in the Acceptance 1. Approximate with a polynomial in the plane of each station 2. Approximate with a parabolic function between each 3 stations We need a smooth magnetic field in the acceptance

16 October 2008, DubnaIvan Kisel, GSI14/15 CA on the STS Geometry with Overlapping Sensors UrQMD MC central Au+Au 25AGeV Efficiency and fraction of killed tracks ok up to ∆Z = Z hit - Z station < ~0.2cm Irina Rostovtseva

16 October 2008, DubnaIvan Kisel, GSI15/15 Summary and Plans  Learn Ct (Intel) and CUDA (Nvidia) programming languages  Develop the L1 standalone package for event selection  Parallelize the CA track finder  Investigate large multi-core systems (CPU and GPU)  Parallel hardware -> parallel languages -> parallel algorithms