A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ 26.02.2008 Håvard Bjerke.

Slides:



Advertisements
Similar presentations
DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM DEVELOPMENT OF ONLINE EVENT SELECTION IN CBM I. Kisel (for CBM Collaboration) I. Kisel (for CBM Collaboration)
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Streaming SIMD Extension (SSE)
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
L1 Event Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting Dubna, October 16, 2008.
Computer Abstractions and Technology
Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow Language Eric Battenberg Mark Murphy CS 267, Spring 2008.
Performance Analysis of Multiprocessor Architectures
 Understanding the Sources of Inefficiency in General-Purpose Chips.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
October 26, 2006 Parallel Image Processing Programming and Architecture IST PhD Lunch Seminar Wouter Caarls Quantitative Imaging Group.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.
CA+KF Track Reconstruction in the STS I. Kisel GSI / KIP CBM Collaboration Meeting GSI, February 28, 2008.
Online Event Selection in the CBM Experiment I. Kisel GSI, Darmstadt, Germany RT'09, Beijing May 12, 2009.
Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Many-Core Scalability of the Online Event Reconstruction in the CBM Experiment Ivan Kisel GSI, Germany (for the CBM Collaboration) CHEP-2010 Taipei, October.
Status of the L1 STS Tracking I. Kisel GSI / KIP CBM Collaboration Meeting GSI, March 12, 2009.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.
Fast track reconstruction in the muon system and transition radiation detector of the CBM experiment at FAIR Andrey Lebedev 1,2 Claudia Höhne 3 Ivan Kisel.
Detector Simulation on Modern Processors Vectorization of Physics Models Philippe Canal, Soon Yung Jun (FNAL) John Apostolakis, Mihaly Novak, Sandro Wenzel.
Experiences parallelising the mixed C-Fortran Sussix BPM post-processor H. Renshall, BE Dept associate, Jan 2012 Using appendix material from CERN-ATS-Note
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Introduction to MMX, XMM, SSE and SSE2 Technology
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
First Level Event Selection Package of the CBM Experiment S. Gorbunov, I. Kisel, I. Kulakov, I. Rostovtseva, I. Vassiliev (for the CBM Collaboration (for.
11/13/2012CS4230 CS4230 Parallel Programming Lecture 19: SIMD and Multimedia Extensions Mary Hall November 13, 2012.
Single Node Optimization Computational Astrophysics.
Kalman Filter based Track Fit running on Cell S. Gorbunov 1,2, U. Kebschull 2, I. Kisel 2,3, V. Lindenstruth 2 and W.F.J. Müller 1 1 Gesellschaft für Schwerionenforschung.
Design of A Custom Vector Operation API Exploiting SIMD Intrinsics within Java Presented by John-Marc Desmarais Authors: Jonathan Parri, John-Marc Desmarais,
Aurora/PetaQCD/QPACE Metting Regensburg University, April 14-15, 2010.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.
1/13 Future computing for particle physics, June 2011, Edinburgh A GPU-based Kalman filter for ATLAS Level 2 Trigger Dmitry Emeliyanov Particle Physics.
Parallelization of the SIMD Kalman Filter for Track Fitting R. Gabriel Esteves Ashok Thirumurthi Xin Zhou Michael D. McCool Anwar Ghuloum Rama Malladi.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
Fast parallel tracking algorithm for the muon detector of the CBM experiment at FAIR Andrey Lebedev 1,2 Claudia Höhne 1 Ivan Kisel 1 Gennady Ososkov 2.
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Barthélémy von Haller CERN PH/AID For the ALICE Collaboration The ALICE data quality monitoring system.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
CS203 – Advanced Computer Architecture
Chapter 4: Multithreaded Programming
Kalman filter tracking library
SIMD Multimedia Extensions
Fast Parallel Event Reconstruction
Embedded Systems Design
Exploiting Parallelism
Morgan Kaufmann Publishers
Lecture 5: GPU Compute Architecture
Vector Processing => Multimedia
Linchuan Chen, Peng Jiang and Gagan Agrawal
High Performance Computing (CS 540)
Lecture 5: GPU Compute Architecture for the last time
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
Peng Jiang, Linchuan Chen, and Gagan Agrawal
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Multi-Core Programming Assignment
Presentation transcript:

A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke

Reconstruction Challenges 1.Track finding  High frequency of collisions (LHC: 40 MHz)‏  A lot of irrelevant particle noise Needs to be filtered in order to concentrate on most important data 2.Track fitting  Measurements are imprecise  Estimate real trajectories 3.Find vertices

3 High-Level Trigger code  Tracing particles through a magnetic field  Each track is calculated independently  Embarrassingly parallel  Optimization  Step 1: use vectors instead of scalars Allows exploitation of SIMD instructions  Step 2: add multithreading Allows exploitation of multiple cores

4 Explicit vectorisation  Operator overloading allows seamless change of data types, even between primitives (e.g. float) and classes  Two classes  P4_F32vec4 – packed single operator + = _mm_add_ps rcp = _mm_rcp_ps  Added P4_F64vec2 – packed double operator + = _mm_add_pd  rcp: No _mm_rcp_pd! Use 1. / a

5 SIMD implementation  Three modes  Scalar (SISD) double, “x87” 1 scalar double precision calculation per instruction  Packed double 2 scalar double precision calculations per instruction  Packed single 4 scalar single precision calculations per instruction 128 bits

6 Performance measurements  2.4 GHz using ICC 9.1

7 Performance counters  Using pfmon

8 Intel Threading Building Blocks  Multithreading without explicitly controlling threads  Useful functions for multithreading  parallel_for  parallel_reduce  parallel_do  Open source C++ library

9 Multithreading  parallel_for  #tasks = #tracks / grain_size  #threads <= #tasks parallel_for(blocked_range ( 0, n_tracks / 4, grain_size), ApplyFit(track_vectors,...)); for(int i = 0; i < n_tracks / 4; i++){ Fit(track_vector[i],...); } # loops = n_tracks / 4 grain_size 4 tasks

10 Measurements Real fit time/track (us)‏ #tasks Logarithmic scale!

11 Lessons  Explicit vectorisation is sometimes necessary  Can't always trust the compiler to vectorise for you  Custom vector types (classes) with operator overloading is an efficient and portable method  Difficult to achieve multithreading just with custom data types and operator overloading  Thread dispatching adds some overhead  An operation should “take at least 10'000 – 100'000 instructions to execute”  But possible...

12 Ct  A new paradigm for parallel programming for C and C++  Does not extend the syntax of C or C++  Adds new data types on which operators implicitly exploit SIMD instructions and multiple cores  Regular and irregular data structures  Hides parallelisation choices from the user

13 Ct example large vector, data parallel operation If vector is large enough, use multiple cores Use SSE Automatic mapping to parallel hardware

14 Conclusion  Track fitting with the Kalman Filter is embarrassingly parallel and scales well over multiple cores  A lot of time (= money) can be saved by properly optimizing parallel reconstruction code  Example vectorisation speedup from scalar double to packed single: 3.7  Example multithreading speedup on 8 cores: 7.2  Total w/ both optimizations: 3.7 / 0.12 = 30  Proportional speedup increase can be expected with future architectures

Backup Håvard Bjerke

16 Measurements - Itanium Real fit time/track (us)‏ #tasks

17 Total speedup  Total speedup w/ both optimizations: 3.7 / 0.12 = 30 Real fit time/track (us)‏

18 Observations  Cell (16-way) and Clovertown w/ 8 cores have highest throughput  Woodcrest w/ 4 cores has best per-core performance  GCC has doubled vectorised code performance of 3.4.6