1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,

Slides:



Advertisements
Similar presentations
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Parallelization of Tau-Leap Coarse-Grained Monte Carlo Simulations on GPUs Lifan Xu, Michela Taufer, Stuart Collins, Dionisios G. Vlachos Global Computing.
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
Back-Projection on GPU: Improving the Performance Wenlay “Esther” Wei Advisor: Jeff Fessler Mentor: Yong Long April 29, 2010.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
Graphics Processing Units
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Gregex: GPU based High Speed Regular Expression Matching Engine Date:101/1/11 Publisher:2011 Fifth International Conference on Innovative Mobile and Internet.
Fluid Simulation using CUDA Thomas Wambold CS680: GPU Program Optimization August 31, 2011.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Efficient Pseudo-Random Number Generation for Monte-Carlo Simulations Using GPU Siddhant Mohanty, Subho Shankar Banerjee, Dushyant Goyal, Ajit Mohanty.
2012/06/22 Contents  GPU (Graphic Processing Unit)  CUDA Programming  Target: Clustering with Kmeans  How to use.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
1 IE 607 Heuristic Optimization Particle Swarm Optimization.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
CUDA - 2.
"Distributed Computing and Grid-technologies in Science and Education " PROSPECTS OF USING GPU IN DESKTOP-GRID SYSTEMS Klimov Georgy Dubna, 2012.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
ICAL GPU 架構中所提供分散式運算 之功能與限制. 11/17/09ICAL2 Outline Parallel computing with GPU NVIDIA CUDA SVD matrix computation Conclusion.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
CS 179: GPU Computing LECTURE 2: MORE BASICS. Recap Can use GPU to solve highly parallelizable problems Straightforward extension to C++ ◦Separate CUDA.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Particle Swarm Optimization
Lecture 2: Intro to the simd lifestyle and GPU internals
Presented by: Isaac Martin
General Purpose Graphics Processing Units (GPGPUs)
6- General Purpose GPU Programming
Jianmin Chen, Zhuo Huang, Feiqi Su, Jih-Kwon Peir and Jeff Ho
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences, 181(20), 2011, pp Presenter: Guan-Yu Chen Stu. No. : MA0G0202 Advisor : Shu-Chen Cheng

2 Outline 1.Particle swarm optimization (PSO) 2.PSO parallelization 3.The CUDA™ architecture 4.Parallel PSO within the CUDA™ 5.Results 6.Final remarks

3 1. Particle swarm optimization (1/3) Kennedy & Eberhart (1995). –Velocity function. –Fitness function.

4 1. Particle swarm optimization (2/3) Velocity function V: the velocity of a particle. X gbest : the best-fitness point ever found by the whole swarm. C 1, C 2 : two positive constants. R 1, R 2 : two random numbers uniformly drawn between 0 and 1. w: inertia weight. X: the position of a particle. X lbest : the best-fitness position reached by the particle. t: at time t.

5 1. Particle swarm optimization (3/3) Fitness function  Self-definition.

6 2. PSO parallelization Master-Slave paradigm. Island model (coarse-grained algorithms). Cellular model (fine-grained paradigm). Synchronous or Asynchronous.

7 3. The CUDA™ architecture (1/5) CUDA™ (nVIDIA™, Nov. 2006). –A handy tool to develop scientific programs oriented to massively parallel computation. Kernels  Grid  Thread blocks  Threads How many thread blocks for problem ? How many threads per thread block?

8 3. The CUDA™ architecture (2/5) Streaming Multiprocessors (SMs) –8 scalar processing cores, –A number of fast 32-bit registers, –A parallel data cache shared between all cores, –A read-only constant cache, –A read-only texture cache.

9 3. The CUDA™ architecture (3/5) SIMT (Single Instruction, Multiple Thread) –Which creates, manages, schedules, and executes groups (warps) of 32 parallel threads. –The main difference from a SIMD (Single Instruction, Multiple Data) architecture is that SIMT instructions specify the whole execution and branching behavior of a single thread.

10 3. The CUDA™ architecture (4/5) Each kernel should reflect the following structure: a)Load data from global/texture memory; b)Process data; c)Store results back to global memory.

11 3. The CUDA™ architecture (5/5) The most important specific programming guidelines: a)Minimize data transfers between the host and the graphics card; b)Minimize the use of global memory: shared memory should be preferred; c)Ensure global memory accesses are coalesced whenever possible; d)Avoid different execution paths within the same warp.

12 4. Parallel PSO within the CUDA™ The main obstacle to PSO parallelization is the dependence between particle’s updates.  SyncPSO –X gbest or X lbest are updated at the end of each generation only.  RingPSO –Relaxing the synchronization constraint. –Allowing the computation load to be distributed over all SMs available.

Basic parallel PSO design (1/2)

Basic parallel PSO design (2/2)

Multi-kernel parallel PSO algorithm (1/3) posID = ( swarmID * n + particleID ) * D + dimensionID

Multi-kernel parallel PSO algorithm (2/3) PositionUpdateKernel ( 1st kernel ) –Be used to update the particles’ positions by scheduling a number of thread blocks equal to the number of particles. FitnessKernel ( 2nd kernel ) –Be used to compute the fitness. BestUpdateKernel ( 3rd kernel ) –Be used to update X gbest or X lbest.

Multi-kernel parallel PSO algorithm (3/3)

18 5. Results w = and C 1 = C 2 =

SyncPSO (1/2) Asus GeForce EN8800GT GPU; Intel Core2 Duo™ CPU 1.86 GHz. a)100 consecutive runs; a single swarm of 32, 64, and 128 particles; 5-dimensional Rastrigin function vs. the number of generations. b)run 10,000 generations of one swarm with 32, 64 and 128 particles scales with respect to the dimension of the generalized Rastrigin function (up to nine dimensions).

SyncPSO (2/2)

RingPSO (1/5) nVIDIA™ Quadro FX 5800; Zotac GeForce GTX260 AMP 2 edition; Asus GeForce EN8800GT SPSO on 64-bit Intel(R) Core(TM) i7 CPU 2.67 GHz. 1)The sequential SPSO version modified to implement the ring topology; 2)The ‘basic’ three-kernel version of RingPSO; 3)RingPSO implemented with two kernels only (one kernel which fuses BestUpdateKernel and PositionUpdateKernel, and FitnessKernel)

RingPSO (2/5) Sphere function D [100, 100]

RingPSO (3/5) Rastrigin function D[5.12, 5.12 ]

RingPSO (4/5) Rosenbrock function D[30, 30]

RingPSO (5/5)

26 6. Final remarks (1/2) SyncPSO is usually more than enough for any practical application. SyncPSO’s usage of computation resources is very inefficient in cases when only one or few swarms need to be simulated. SyncPSO becomes inefficient when the problem size increases above a certain threshold.

27 6. Final remarks (2/2) The drawbacks of accessing global memory for the multi-kernel version are more than compensated by the advantages of parallelization. The speed-up for the multi-kernel version increases with problem size. Both versions are far better than the most recent results published on the same task.

28 The End~ Thanks for your attention!!