MASS CUDA Performance Analysis and Improvement

MASS CUDA Performance Analysis and Improvement
Ahmed Musazay Faculty Advisor: Dr. Munehiro Fukuda

MASS Multi Agent Spatial Simulation
Allows non-computing specialists to parallelize simulations Concept of Place and Agent objects Three versions: C++, Java, CUDA High-level abstraction to non-computing specialists

CUDA C/C++ extension by NVidia
A heterogeneous parallel programming interface. Host – CPU , and Device – GPU Functions executing on the GPU are called Kernel functions Take configuration parameters for number of threads -fast, but difficult to use, hard to tune up perf -utilize performance and also bring high level abstraction of mass

MASS-CUDA Current version – written by Nathaniel Hart for Master’s thesis Ported C++ version into current CUDA version Object oriented- allows users to extend Place and Agent objects Designed with intention of using multiple GPU cards Nate’s work- porting from mass to cuda

Problem Performance issues Difficult to tune performance
Goal of project: Understand MASS Library and how it works Write unit tests to find where performance issues occur Propose solutions that can be implemented to increase performance of MASS CUDA

Heat2D Fourier’s heat equation Place objects – Metal
Simulation describing spread of heat in a given region over period of time Place objects – Metal Ran at four different sizes 250x250, 500x500, 1000x1000, 2000x2000

Test Case: Running Heat2D - Primitive Array
Heat2D simulation using array of doubles No objects created to contain information as opposed to MASS Simulation functions written as kernel functions

Results

Proposed Solution Store all data in MASS as user-defined primitive type arrays Index mapping to unique element Pros Fast accesses Can run larger simulations, requiring less heap memory overhead Cons User programmability

Test Case: Running Heat2D - Place objects
Ran simulation with same objects used in MASS, without using library function calls Metal & MetalState derived from Library classes containing same memory and internal functions Simulation functions re-written in CUDA as kernel functions

Results

Proposed Solution Remove unnecessary functionality that may be slowing library down Excessive memory transfers between host and device Partitioning logic Pros Can work on adding only a single feature of library at time, making sure meeting performance standard More computation spent on actual simulation rather than management Cons Scalability of library will be missing early in development

Test Case: Running Heat2D – Coalesced Accesses
Ran simulation using primitive values, but taking advantage of coalesced memory accesses Kernel functions taking array parameters as native dimension – 2D array cudaMallocPitch(), cudaMalloc3D()

Results

Proposed Solution Let MASS run the simulation in its native dimension (1D, 2D, 3D) Pros Faster memory accesses, increasing performance Cons Extra overhead of determining dimensions to run function as Will only be able to natively run up to 3 dimensions

Conclusion Removing unused features, implementing one feature at a time Coalesced memory accesses – using native array dimensions Using primitive arrays Consider : Shared memory

Final Words Relevant courses: Special thanks to:
CSS 430 Operating Systems CSS 422 Hardware and Computer Organization Special thanks to: Dr. Fukuda Nathaniel Hart

MASS CUDA Performance Analysis and Improvement

Similar presentations

Presentation on theme: "MASS CUDA Performance Analysis and Improvement"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MASS CUDA Performance Analysis and Improvement

Similar presentations

Presentation on theme: "MASS CUDA Performance Analysis and Improvement"— Presentation transcript:

Similar presentations

About project

Feedback