CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems.

Slides:

Advertisements

Similar presentations

An Overview Of Virtual Machine Architectures Ross Rosemark.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Speed, Accurate and Efficient way to identify the DNA.

More on threads, shared memory, synchronization

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.

CUDA (Compute Unified Device Architecture) Supercomputing for the Masses by Peter Zalutski.

The PTX GPU Assembly Simulator and Interpreter N.M. Stiffler Zheming Jin Ibrahim Savran.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Using Random Numbers in CUDA ITCS 4/5145 Parallel Programming Spring 2012, April 12a, 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

An Introduction to Programming with CUDA Paul Richmond

Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.

More CUDA Examples. Different Levels of parallelism Thread parallelism – each thread is an independent thread of execution Data parallelism – across threads.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

CPS120: Introduction to Computer Science Operating Systems Nell Dale John Lewis.

Tim Madden ODG/XSD.  Graphics Processing Unit  Graphics card on your PC.  “Hardware accelerated graphics”  Video game industry is main driver.  More.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

CIS 565 Fall 2011 Qing Sun

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

CS 193G Lecture 7: Parallel Patterns II. Overview Segmented Scan Sort Mapreduce Kernel Fusion.

CUDA Streams These notes will introduce the use of multiple CUDA streams to overlap memory transfers with kernel computations. Also introduced is paged-locked.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

GPU Based Sound Simulation and Visualization Torbjorn Loken, Torbjorn Loken, Sergiu M. Dascalu, and Frederick C Harris, Jr. Department of Computer Science.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 Introduction to CUDA C (Part 2)

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 10 Reduction Trees.

Synchronization These notes introduce:

1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Would'a, CUDA, Should'a. CUDA: Compute Unified Device Architecture OU Supercomputing Symposium Highly-Threaded HPC.

CUDA Simulation Benjy Kessler.  Given a brittle substance with a crack in it.  The goal is to study how the crack propagates in the substance as a function.

Programming with CUDA WS 08/09 Lecture 2 Tue, 28 Oct, 2008.

An Out-of-core Implementation of Block Cholesky Decomposition on A Multi-GPU System Lin Cheng, Hyunsu Cho, Peter Yoon, Jiajia Zhao Trinity College, Hartford,

Naga Shailaja Dasari Ranjan Desh Zubair M Old Dominion University Norfolk, Virginia, USA.

1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.

CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.

1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.

CUDA C/C++ Basics Part 2 - Blocks and Threads

Prof. Zhang Gang School of Computer Sci. & Tech.

MASS Java Documentation, Verification, and Testing

Basic CUDA Programming

Device Routines and device variables

MASS CUDA Performance Analysis and Improvement

CS/EE 217 – GPU Architecture and Parallel Programming

Main Memory Background Swapping Contiguous Allocation Paging

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Memory Coalescing These notes will demonstrate the effects of memory coalescing Use of matrix transpose to improve matrix multiplication performance B.

Using Shared memory These notes will demonstrate the improvements achieved by using shared memory, with code and results running on coit-grid06.uncc.edu.

Device Routines and device variables

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

General Purpose Graphics Processing Units (GPGPUs)

Chapter 4:Parallel Programming in CUDA C

Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.

Synchronization These notes introduce:

GPU Scheduling on the NVIDIA TX2:

6- General Purpose GPU Programming

Presentation transcript:

CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems

Project Milestones M1: The ability to execute simulations that fit on a single device. Initial focus is on creating the command & control logic with hooks to add multi-device functionality later M2: The ability to execute simulations that fit on multiple devices. This will require extensible border exchange logic. M3: The ability to execute simulations that exceed the memory of a host’s available devices. This means a great deal of the partition and border exchange logic. UW Bothell Computing & Software Systems2

Quarter Goals (M1) Date RangeActivityDeliverable 9/15/ /15/14 Specify MASS CUDA Architecture and single-GPU implementation  Specifications for algorithms and design choices for phase I implementation 10/16/ /15/14Implement part I of MASS Agents Design multi-GPU communication algorithms  A minimally viable version of the MASS CUDA that can run a simulation of size not exceeding available memory on a single GPU.  A version of the MASS CUDA library suitable for use in CSS 534  Performance statistics for this initial implementation  Specifications for algorithms and design choices for phase II implementation UW Bothell Computing & Software Systems3

Project Status UW Bothell Computing & Software Systems4

Status Details Still need to lock down final ghost space solution for Agents Debugging is now possible on Hercules Need to talk to Jason to figure out how to remote in via GUI UW Bothell Computing & Software Systems5

Blocking Issues Hercules 2 is still not operational. Depends on Chris Fox to unblock. Places will not instantiate properly UW Bothell Computing & Software Systems6

Virtual Functions UW Bothell Computing & Software Systems7 Image Source:

Virtual Functions UW Bothell Computing & Software Systems8 Image Source:

The Workaround __global__ void createPlaces(Places ** places, int qty){ int idx = blockIdx.x * blockDim.x + threadIdx.x; if(idx < qty){ // virtual function table created on GPU places[idx] = new PlaceImpl(); } UW Bothell Computing & Software Systems9

Old Architecture UW Bothell Computing & Software Systems10

New Architecture UW Bothell Computing & Software Systems11

The New Problem Instances created on the GPU can not be copied to the host and used for analysis. Just as we break the virtual function table copying from the host to the GPU, we cannot copy from the GPU to the host. UW Bothell Computing & Software Systems12

The New Workaround UW Bothell Computing & Software Systems13

The New Workaround UW Bothell Computing & Software Systems14

Big TODO’s 1.Refactor existing code to allow for place instantiation and separation of behavior and state 2.Implement all control kernel functions 3.Write more tests UW Bothell Computing & Software Systems15

? UW Bothell Computing & Software Systems16