Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.

Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009

12 March 2009, GSIIvan Kisel, GSI2/14 L1 CA Track Finder Efficiency Track category Efficiency, % Reference set (>1 GeV/c) 95.2 All set (≥4 hits,>100 MeV/c) 89.8 Extra set (<1 GeV/c) 78.6 Clone2.8 Ghost6.6 MC tracks/ev found 672 Speed, s/ev 0.8 I. Rostovtseva Fluctuated magnetic field? Fluctuated magnetic field? Too large STS acceptance? Too large STS acceptance? Too large distance between STS stations? Too large distance between STS stations?

12 March 2009, GSIIvan Kisel, GSI3/14 Many-core HPC High performance computing (HPC) High performance computing (HPC) Highest clock rate is reached Highest clock rate is reached Performance/power optimization Performance/power optimization Heterogeneous systems of many (>8) cores Heterogeneous systems of many (>8) cores Similar programming languages (OpenCL, Ct and CUDA) Similar programming languages (OpenCL, Ct and CUDA) We need a uniform approach to all CPU/GPU families We need a uniform approach to all CPU/GPU families On-line event selection On-line event selection Mathematical and computational optimization Mathematical and computational optimization SIMDization of the algorithm (from scalars to vectors) SIMDization of the algorithm (from scalars to vectors) MIMDization (multi-threads, many-cores) MIMDization (multi-threads, many-cores) ?? Gaming STI: Cell STI: CellGaming GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee? GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla ? CPU Intel: XX-cores Intel: XX-coresCPU FPGA Xilinx XilinxFPGA ? CPU/GPU AMD: Fusion AMD: FusionCPU/GPU ?

12 March 2009, GSIIvan Kisel, GSI4/14 Current and Expected Eras of Intel Processor Architectures From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. Future programming is 3-dimentional Future programming is 3-dimentional The amount of data is doubling every 18-24 month The amount of data is doubling every 18-24 month Massive data streams Massive data streams The RMS (Recognition, Mining, Synthesis) workload in real time The RMS (Recognition, Mining, Synthesis) workload in real time Supercomputer-level performance in ordinary servers and PCs Supercomputer-level performance in ordinary servers and PCs Applications, like real-time decision-making analysis Applications, like real-time decision-making analysisCoresThreads SIMD width

12 March 2009, GSIIvan Kisel, GSI5/14 Cores and Threads CPU architecture in 2009 CPU of your laptop in 2015 CPU architecture in 19XX 1 Process per CPU CPU architecture in 2000 2 Threads per Process per CPU Process Thread1 Thread2 exe r/w r/w exe exe r/w......

12 March 2009, GSIIvan Kisel, GSI6/14 SIMD Width D1 D2 S4 S3S2 S1 D S8 S7S6 S5S4S3S2 S1 S16 S15S14 S13S12S11S10 S9 S8 S7S6 S5 S4 S3S2 S1 Scalar double precision (64 bits) Vector (SIMD) double precision (128 bits) Vector (SIMD) single precision (128 bits) Intel AVX (2010) vector single precision (256 bits) Intel LRB (2010) vector single precision (512 bits) 2 or 1/2 4 or 1/4 8 or 1/8 16 or 1/16 Faster or Slower ? SIMD = Single Instruction Multiple Data SIMD uses vector registers SIMD exploits data-level parallelism CPU Scalar Vector D SS SS

12 March 2009, GSIIvan Kisel, GSI7/14 SIMD KF Track Fit on Intel Multicore Systems: Scalability H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering Real-time performance on the quad-core Xeon 5345 (Clovertown) at 2.4 GHz – speed-up 30 with 16 threads Speed-up 3.7 on the Xeon 5140 (Woodcrest) at 2.4 GHz using icc 9.1 Real-time performance on different Intel CPU platforms

12 March 2009, GSIIvan Kisel, GSI8/14 Intel Larrabee: 32 Cores L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008. LRB vs. GPU: Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways: use the x86 instruction set with Larrabee-specific extensions; use the x86 instruction set with Larrabee-specific extensions; feature cache coherency across all its cores; feature cache coherency across all its cores; include very little specialized graphics hardware. include very little specialized graphics hardware. LRB vs. CPU: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: LRB's 32 x86 cores will be based on the much simpler Pentium design; LRB's 32 x86 cores will be based on the much simpler Pentium design; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; each core supports 4-way simultaneous multithreading, with 4 copies of each processor register; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; each core contains a 512-bit vector processing unit, able to process 16 single precision floating point numbers at a time; LRB includes explicit cache control instructions; LRB includes explicit cache control instructions; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory; LRB includes one fixed-function graphics hardware unit. LRB includes one fixed-function graphics hardware unit.

12 March 2009, GSIIvan Kisel, GSI9/14 General Purpose Graphics Processing Units (GPGPU) Substantial evolution of graphics hardware over the past years Substantial evolution of graphics hardware over the past years Remarkable programmability and flexibility Remarkable programmability and flexibility Reasonably cheap Reasonably cheap New branch of research – GPGPU New branch of research – GPGPU

12 March 2009, GSIIvan Kisel, GSI10/14 NVIDIA Hardware S. Kalcher, M. Bach Streaming multiprocessors Streaming multiprocessors No overhead thread switching No overhead thread switching FPUs instead of cache/control FPUs instead of cache/control Complex memory hierarchy Complex memory hierarchy SIMT – Single Instruction Multiple Threads SIMT – Single Instruction Multiple Threads GT200 30 multiprocessors 30 multiprocessors 30 DP units 30 DP units 8 SP FPUs per MP 8 SP FPUs per MP 240 SP units 240 SP units 16 000 registers per MP 16 000 registers per MP 16 kB shared memory per MP 16 kB shared memory per MP >= 1 GB main memory >= 1 GB main memory 1.4 GHz clock 1.4 GHz clock 933 GFlops SP 933 GFlops SP

12 March 2009, GSIIvan Kisel, GSI11/14 SIMD/SIMT Kalman Filter on the CSC-Scout Cluster CPU 1600 GPU 9100 M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth 18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB) + 27xTesla S1070(4x(GT200, 4 GB))

12 March 2009, GSIIvan Kisel, GSI12/14 CPU/GPU Programming Frameworks Cg, OpenGL Shading Language, Direct X Cg, OpenGL Shading Language, Direct X Designed to write shaders Require problem to be expressed graphically AMD Brook AMD Brook Pure stream computing Pure stream computing No hardware specific No hardware specific AMD CAL (Compute Abstraction Layer) AMD CAL (Compute Abstraction Layer) Generic usage of hardware on assembler level Generic usage of hardware on assembler level NVIDIA CUDA (Compute Unified Device Architecture) NVIDIA CUDA (Compute Unified Device Architecture) Defines hardware platform Defines hardware platform Generic programming Generic programming Extension to the C language Extension to the C language Explicit memory management Explicit memory management Programming on thread level Programming on thread level Intel Ct (C for throughput) Intel Ct (C for throughput) Extension to the C language Extension to the C language Intel CPU/GPU specific Intel CPU/GPU specific SIMD exploitation for automatic parallelism SIMD exploitation for automatic parallelism OpenCL (Open Computing Language) OpenCL (Open Computing Language) Open standard for generic programming Open standard for generic programming Extension to the C language Extension to the C language Supposed to work on any hardware Supposed to work on any hardware Usage of specific hardware capabilities by extensions Usage of specific hardware capabilities by extensions

12 March 2009, GSIIvan Kisel, GSI13/14 On-line = Off-line Reconstruction ?  Off-line and on-line reconstructions will and should be parallelized  Both versions will be run on similar many-core systems or even on the same PC farm  Both versions will use (probably) the same parallel language(s), such as OpenCL  Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA?  If the final code is fast, can we think about a global on-line event reconstruction and selection? Intel SIMD Intel MIMD Intel Ct NVIDIA CUDA OpenCLSTS++++– MuCh RICH TRD Your Reco Open Charm Analysis Your Analysis

12 March 2009, GSIIvan Kisel, GSI14/14Summary Think parallel ! Think parallel ! Parallel programming is the key to the full potential of the Tera-scale platforms Parallel programming is the key to the full potential of the Tera-scale platforms Data parallelism vs. parallelism of the algorithm Data parallelism vs. parallelism of the algorithm Stream processing – no branches Stream processing – no branches Avoid direct accessing main memory, no maps, no look-up-tables Avoid direct accessing main memory, no maps, no look-up-tables Use SIMD unit in the nearest future (many-cores, TF/s, …) Use SIMD unit in the nearest future (many-cores, TF/s, …) Use single-precision floating point where possible Use single-precision floating point where possible In critical parts use double precision if necessary In critical parts use double precision if necessary Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …) New parallel languages appear: OpenCL, Ct, CUDA New parallel languages appear: OpenCL, Ct, CUDA GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!! Should we start buying them for testing? Should we start buying them for testing? CPU/GPU AMD: Fusion AMD: FusionCPU/GPU OpenCL?OpenCL? Gaming STI: Cell STI: CellGaming GP CPU Intel: Larrabee Intel: Larrabee GP CPU Intel: Larrabee Intel: Larrabee GP GPU Nvidia: Tesla Nvidia: Tesla GP GPU Nvidia: Tesla Nvidia: Tesla CPU Intel: XXX-cores Intel: XXX-coresCPU FPGA Xilinx XilinxFPGA

Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.

Similar presentations

Presentation on theme: "Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.

Similar presentations

Presentation on theme: "Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009."— Presentation transcript:

Similar presentations

About project

Feedback