Download presentation
Presentation is loading. Please wait.
Published byRegina Simpson Modified over 9 years ago
1
Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009
2
21 May 2009, DubnaIvan Kisel, GSI2/24 Many-core HPC Heterogeneous systems of many cores Heterogeneous systems of many cores Uniform approach to all CPU/GPU families Uniform approach to all CPU/GPU families Similar programming languages (CUDA, Ct, OpenCL) Similar programming languages (CUDA, Ct, OpenCL) Parallelization of the algorithm (vectors, multi-threads, many-cores) Parallelization of the algorithm (vectors, multi-threads, many-cores) On-line event selection On-line event selection Mathematical and computational optimization Mathematical and computational optimization Optimization of the detector Optimization of the detector ? OpenCL ? Gaming STI: Cell STI: CellGaming GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee ? CPU Intel: XX-cores Intel: XX-coresCPU FPGA Xilinx: Virtex Xilinx: VirtexFPGA ? CPU/GPU AMD: Fusion AMD: FusionCPU/GPU ? GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla
3
21 May 2009, DubnaIvan Kisel, GSI3/24 Current and Expected Eras of Intel Processor Architectures From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. Future programming is 3-dimentional Future programming is 3-dimentional The amount of data is doubling every 18-24 month The amount of data is doubling every 18-24 month Massive data streams Massive data streams The RMS (Recognition, Mining, Synthesis) workload in real time The RMS (Recognition, Mining, Synthesis) workload in real time Supercomputer-level performance in ordinary servers and PCs Supercomputer-level performance in ordinary servers and PCs Applications, like real-time decision-making analysis Applications, like real-time decision-making analysisCores HW Threads SIMD width
4
21 May 2009, DubnaIvan Kisel, GSI4/24 Cores and HW Threads CPU architecture in 2009 CPU of your laptop in 2015 CPU architecture in 19XX 1 Process per CPU CPU architecture in 2000 2 Threads per Process per CPU Process Thread1 Thread2 exe r/w r/w exe exe r/w...... Cores and HW threads are seen by an operating system as CPUs: > cat /proc/cpuinfo Maximum half of threads are executed
5
21 May 2009, DubnaIvan Kisel, GSI5/24 SIMD Width D1 D2 S4 S3S2 S1 D S8 S7S6 S5S4S3S2 S1 S16 S15S14 S13S12S11S10 S9 S8 S7S6 S5 S4 S3S2 S1 Scalar double precision (64 bits) Vector (SIMD) double precision (128 bits) Vector (SIMD) single precision (128 bits) Intel AVX (2010) vector single precision (256 bits) Intel LRB (2010) vector single precision (512 bits) 2 or 1/2 4 or 1/4 8 or 1/8 16 or 1/16 Faster or Slower ? SIMD = Single Instruction Multiple Data SIMD uses vector registers SIMD exploits data-level parallelism CPU Scalar Vector D SS SS
6
21 May 2009, DubnaIvan Kisel, GSI6/24 SIMD KF Track Fit on Intel Multicore Systems: Scalability H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering Real-time performance on different CPU architectures – speed-up 100 with 32 threads Speed-up 3.7 on the Xeon 5140 (Woodcrest) Real-time performance on different Intel CPU platforms scalar double single -> 24 8 16 32 1.0010.000.10 0.01 2xCell SPE ( 16 ) Woodcrest ( 2 ) Clovertown ( 4 ) Dunnington ( 6 ) # threads Fit time, s/track
7
21 May 2009, DubnaIvan Kisel, GSI7/24 Intel Larrabee: 32 Cores L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008. LRB vs. GPU: Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: use the x86 instruction set with Larrabee-specific extensions; use the x86 instruction set with Larrabee-specific extensions; feature cache coherency across all its cores; feature cache coherency across all its cores; include very little specialized graphics hardware. include very little specialized graphics hardware. LRB vs. CPU: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: LRB's 32 x86 cores will be based on the much simpler Pentium design; LRB's 32 x86 cores will be based on the much simpler Pentium design; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; LRB includes explicit cache control instructions; LRB includes explicit cache control instructions; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB includes one fixed-function graphics hardware unit. LRB includes one fixed-function graphics hardware unit.
8
21 May 2009, DubnaIvan Kisel, GSI8/24 General Purpose Graphics Processing Units (GPGPU) Substantial evolution of graphics hardware over the past years Substantial evolution of graphics hardware over the past years Remarkable programmability and flexibility Remarkable programmability and flexibility Reasonably cheap Reasonably cheap New branch of research – GPGPU New branch of research – GPGPU
9
21 May 2009, DubnaIvan Kisel, GSI9/24 NVIDIA Hardware S. Kalcher, M. Bach Streaming multiprocessors Streaming multiprocessors No overhead thread switching No overhead thread switching FPUs instead of cache/control FPUs instead of cache/control Complex memory hierarchy Complex memory hierarchy SIMT – Single Instruction Multiple Threads SIMT – Single Instruction Multiple Threads GT200 30 multiprocessors 30 multiprocessors 30 DP units 30 DP units 8 SP FPUs per MP 8 SP FPUs per MP 240 SP units 240 SP units 16 000 registers per MP 16 000 registers per MP 16 kB shared memory per MP 16 kB shared memory per MP >= 1 GB main memory >= 1 GB main memory 1.4 GHz clock 1.4 GHz clock 933 GFlops SP 933 GFlops SP
10
21 May 2009, DubnaIvan Kisel, GSI10/24 SIMD/SIMT Kalman Filter on the CSC-Scout Cluster CPU 1600 GPU 9100 M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth 18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB) + 27xTesla S1070(4x(GT200, 4 GB))
11
21 May 2009, DubnaIvan Kisel, GSI11/24 CPU/GPU Programming Frameworks Cg, OpenGL Shading Language, Direct X Cg, OpenGL Shading Language, Direct X Designed to write shaders Require problem to be expressed graphically AMD Brook AMD Brook Pure stream computing Pure stream computing No hardware specific No hardware specific AMD CAL (Compute Abstraction Layer) AMD CAL (Compute Abstraction Layer) Generic usage of hardware on assembler level Generic usage of hardware on assembler level NVIDIA CUDA (Compute Unified Device Architecture) NVIDIA CUDA (Compute Unified Device Architecture) Defines hardware platform Defines hardware platform Generic programming Generic programming Extension to the C language Extension to the C language Explicit memory management Explicit memory management Programming on thread level Programming on thread level Intel Ct (C for throughput) Intel Ct (C for throughput) Extension to the C language Extension to the C language Intel CPU/GPU specific Intel CPU/GPU specific SIMD exploitation for automatic parallelism SIMD exploitation for automatic parallelism OpenCL (Open Computing Language) OpenCL (Open Computing Language) Open standard for generic programming Open standard for generic programming Extension to the C language Extension to the C language Supposed to work on any hardware Supposed to work on any hardware Usage of specific hardware capabilities by extensions Usage of specific hardware capabilities by extensions
12
21 May 2009, DubnaIvan Kisel, GSI12/24 Cellular Automaton Track Finder 500 200 10 10
13
21 May 2009, DubnaIvan Kisel, GSI13/24 L1 CA Track Finder: Efficiency Track category Efficiency, % Reference set (>1 GeV/c) 95.2 All set (≥4 hits,>100 MeV/c) 89.8 Extra set (<1 GeV/c) 78.6 Clone2.8 Ghost6.6 MC tracks/ev found 672 Speed, s/ev 0.8 I. Rostovtseva Fluctuated magnetic field? Fluctuated magnetic field? Too large STS acceptance? Too large STS acceptance? Too large distance between STS stations? Too large distance between STS stations?
14
21 May 2009, DubnaIvan Kisel, GSI14/24 L1 CA Track Finder: Changes I. Kulakov
15
21 May 2009, DubnaIvan Kisel, GSI15/24 L1 CA Track Finder: Timing I. Kulakov Time oldnew 1 thread2 threads3 threads CPU Time [ms]575278321335 Real Time [ms]576286233238 old – old version (from CBMRoot DEC08) new – new paralleled version Statistic: 100 central events Processor: Pentium D, 3.0 GHz, 2 MB. R [cm]109876543210.5- CPU time [ms]32028525422019217114913212311310696 Real time [ms]2332131931751541441291201081009485 Ref set0.970.97 0.97 0.96 All set0.920.92 0.920.91 Extra0.810.81 0.82 0.820.810.80 Clone0.040.04 0.04 Ghost0.040.04 0.04 tracks/event686687 688 688687684682
16
21 May 2009, DubnaIvan Kisel, GSI16/24 On-line = Off-line Reconstruction ? Off-line and on-line reconstructions will and should be parallelized Both versions will be run on similar many-core systems or even on the same PC farm Both versions will use (probably) the same parallel language(s), such as OpenCL Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA? If the final code is fast, can we think about a global on-line event reconstruction and selection? Intel SIMD Intel MIMD Intel Ct NVIDIA CUDA OpenCLSTS++++– MuCh RICH TRD Your Reco Open Charm Analysis Your Analysis
17
21 May 2009, DubnaIvan Kisel, GSI17/24Summary Think parallel ! Think parallel ! Parallel programming is the key to the full potential of the Tera-scale platforms Parallel programming is the key to the full potential of the Tera-scale platforms Data parallelism vs. parallelism of the algorithm Data parallelism vs. parallelism of the algorithm Stream processing – no branches Stream processing – no branches Avoid direct accessing main memory, no maps, no look-up-tables Avoid direct accessing main memory, no maps, no look-up-tables Use SIMD unit in the nearest future (many-cores, TF/s, …) Use SIMD unit in the nearest future (many-cores, TF/s, …) Use single-precision floating point where possible Use single-precision floating point where possible In critical parts use double precision if necessary In critical parts use double precision if necessary Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) New parallel languages appear: OpenCL, Ct, CUDA New parallel languages appear: OpenCL, Ct, CUDA GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! Should we start buying them for testing the algorithms now? Should we start buying them for testing the algorithms now?
18
21 May 2009, DubnaIvan Kisel, GSI18/24 Back-up Slides (1-5) Back-up
19
21 May 2009, DubnaIvan Kisel, GSI19/24 Back-up Slides (1/5) Back-up
20
21 May 2009, DubnaIvan Kisel, GSI20/24 Back-up Slides (2/5) Back-up
21
21 May 2009, DubnaIvan Kisel, GSI21/24 Back-up Slides (3/5) Back-up
22
21 May 2009, DubnaIvan Kisel, GSI22/24 Back-up Slides (4/5) Back-up SIMD is out of consideration (I.K.)
23
21 May 2009, DubnaIvan Kisel, GSI23/24 Back-up Slides (5/5) Back-up
24
21 May 2009, DubnaIvan Kisel, GSI24/24 Tracking Workshop Please be invited to the Tracking Workshop 15-17 June 2009 at GSI
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.