Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College.

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 28, 2011 GPUMemories.ppt GPU Memories These notes will introduce: The basic memory hierarchy.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Fast Circuit Simulation on Graphics Processing Units Kanupriya Gulati † John F. Croix ‡ Sunil P. Khatri † Rahm Shastry ‡ † Texas A&M University, College.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
DCABES 2009 China University Of Geosciences 1 The Parallel Models of Coronal Polarization Brightness Calculation Jiang Wenqian.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Slide 1/8 Performance Debugging for Highly Parallel Accelerator Architectures Saurabh Bagchi ECE & CS, Purdue University Joint work with: Tsungtai Yeh,
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
David Luebke NVIDIA Research GPU Computing: The Democratization of Parallel Computing.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati & Sunil P. Khatri Department of ECE Texas A&M University,
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
1 Evaluation of parallel particle swarm optimization algorithms within the CUDA™ architecture Luca Mussi, Fabio Daolio, Stefano Cagnoni, Information Sciences,
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Accelerating Statistical Static Timing Analysis Using Graphics Processing Units Kanupriya Gulati and Sunil P. Khatri Department of ECE, Texas A&M University,
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
GPU Architecture and Programming
Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:
Accelerating Error Correction in High-Throughput Short-Read DNA Sequencing Data with CUDA Haixiang Shi Bertil Schmidt Weiguo Liu Wolfgang Müller-Wittig.
CUDA - 2.
Hardware Acceleration Using GPUs M Anirudh Guide: Prof. Sachin Patkar VLSI Consortium April 4, 2008.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
QCAdesigner – CUDA HPPS project
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1.
GPU Acceleration of Particle-In-Cell Methods B. M. Cowan, J. R. Cary, S. W. Sides Tech-X Corporation.
1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,
Computer Engg, IIT(BHU)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Accelerating MapReduce on a Coupled CPU-GPU Architecture
6- General Purpose GPU Programming
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Towards Acceleration of Fault Simulation Using Graphics Processing Units Kanupriya Gulati Sunil P. Khatri Department of ECE Texas A&M University, College Station

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

Introduction Fault Simulation (FS) is crucial in the VLSI design flow Given a digital design and a set of vectors V, FS evaluates the number of stuck at faults (F sim ) tested by applying V The ratio of F sim /F total is a measure of fault coverage Current designs have millions of logic gates The number of faulty variations are proportional to design size Each of these variations needs to be simulated for the V vectors Therefore, it is important to explore ways to accelerate FS The ideal FS approach should be Fast Scalable & Cost effective

Introduction We accelerate FS using graphics processing units (GPUs) By exploiting fault and pattern parallel approaches A GPU is essentially a commodity stream processor Highly parallel Very fast Operating paradigm is SIMD (Single-Instruction, Multiple Data) GPUs, owing to their massively parallel architecture, have been used to accelerate Image/stream processing Data compression Numerical algorithms LU decomposition, FFT etc

Introduction We implemented our approach on the NVIDIA GeForce 8800 GTX GPU By careful engineering, we maximally harness the GPU’s Raw computational power and Huge memory bandwidth We used the Compute Unified Device Architecture (CUDA) framework Open source C-like GPU programming and interfacing tool When using a single 8800 GTX GPU card ~35X speedup is obtained compared to a commercial FS tool Accounts for CPU processing and data transfer times as well Our runtimes are projected for the NVIDIA Tesla server Can house up to 8 GPU devices ~238X speedup is possible compared to the commercial engine

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

GPU – A Massively Parallel Processor Source : “NVIDIA CUDA Programming Guide” version 1.1

GeForce 8800 GTX Technical Specs. 367 GFLOPS peak performance for certain applications times of current high-end microprocessors Up to 265 GFLOPS sustained performance Massively parallel, 128 SIMD processor cores Partitioned into 16 Multiprocessors (MPs) Massively threaded, sustains 1000s of threads per application 768 MB device memory 1.4 GHz clock frequency CPU at ~4 GHz 86.4 GB/sec memory bandwidth CPU at 8 GB/sec front side bus 1U Tesla server from NVIDIA can house up to 8 GPUs

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

CUDA Programming Model The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Host Device Memory Kernel Threads (instances of the kernel) PCIe (CPU) (GPU)

CUDA Programming Model Data-parallel portions of an application are executed on the device in parallel on many threads Kernel : code routine executed on GPU Thread : instance of a kernel Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads to achieve full parallelism Allows memory access latencies to be hidden Multi-core CPUs require fewer threads, but the available parallelism is lower

Thread Batching: Grids and Blocks A kernel is executed as a grid of thread blocks (aka blocks) All threads within a block share a portion of data memory A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free common memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Grid 2Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Source : “NVIDIA CUDA Programming Guide” version 1.1

Block and Thread IDs Threads and blocks have IDs So each thread can identify what data they will operate on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D Simplifies memory addressing when processing multidimensional data Image processing Solving PDEs on volumes Other problems with underlying 1D, 2D or 3D geometry Device Grid 1 Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (4, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Thread (4, 2) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (4, 0) Source : “NVIDIA CUDA Programming Guide” version 1.1

Device Memory Space Overview Each thread has: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory The host can R/W global, constant and texture memories (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Source : “NVIDIA CUDA Programming Guide” version 1.1

Device Memory Space Usage Register usage per thread should be minimized (max registers/MP) Shared memory organized in banks Avoid bank conflicts Global memory Main means of communicating R/W data between host and device Contents visible to all threads Coalescing recommended Texture and Constant Memories Cached memories Initialized by host Contents visible to all threads (Device) Grid Constant Memory Texture Memory Global Memory Block (0, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Block (1, 0) Shared Memory Local Memory Thread (0, 0) Registers Local Memory Thread (1, 0) Registers Host Source : “NVIDIA CUDA Programming Guide” version 1.1

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

Approach We implement a Look up table (LUT) based FS All gates’ LUTs stored in texture memory (cached) LUTs of all library gates fit in texture cache To avoid cache misses during lookup Individual k-input gate LUT requires 2 k entries Each gate’s LUT entries are located at a fixed offset in the texture memory as shown above Gate output is obtained by accessing the memory at the “gate offset + input value” Example: output of AND2 gate when inputs are ‘0’ and ‘1’

Approach In practice we evaluate two vectors for the same gate in a single thread 1/2/3/4 input gates require 4/16/64/256 entries in LUT respectively Our library consists of an INV and 2/3/4 input AND, NAND, NOR and OR gates. Hence total memory required for all LUTs is 1348 words This fits in the texture memory cache (8KB per MP) We exploit both fault and pattern parallelism

Approach – Fault Parallelism All gates at a fixed topological level are evaluated in parallel. Primary Inputs Primary Outputs L Fault Parallel Logic Levels →

Approach – Pattern Parallelism Pattern Parallel Simulations for any gate, for different patterns, are done In parallel, in 2 phases Phase 1 : Good circuit simulation. Results returned to CPU Phase 2 : Faulty circuit simulation. CPU does not schedule a stuck-at-v fault in a pattern which has v as the good circuit value. For the all faults which lie in its TFI Fault injection also performed in parallel Faulty Good vector 1 2 N Good circuit value for vector 1 Faulty circuit value for vector 1

Approach – Logic Simulation typedef struct __align__(16){ int offset; // Gate type’s offset int a, b, c, d; // Input values int m 0, m 1 ; // Mask variables } threadData;

Approach – Fault Injection typedef struct __align__(16){ int offset; // Gate type’s offset int a, b, c, d; // Input values int m 0, m 1 ; // Mask variables } threadData; m0m0 m1m1 Meaning -11Stuck-a-1 Mask 1100No Fault Injection 00 Stuck-at-0 Mask

Approach – Fault Detection typedef struct __align__(16){ int offset; // Gate type’s offset int a, b, c, d; // input values int Good_Circuit_threadID; // Good circuit simulation thread ID } threadData_Detect; 3

Approach - Recap CPU schedules the good and faulty gate evaluations. Different threads perform in parallel (for 2 vectors of a gate) Gate evaluation (logic simulation) for good or faulty vectors Fault injection Fault detection for gates at the last topological level only We maximize GPU performance by: Ensuring no data dependency exists between threads issued in parallel Ensuring that the same instructions are executed by all threads, but on different data Conforms to the SIMD architecture of GPUs

Maximizing Performance We adapt to specific G80 memory constraints LUT stored in texture memory. Key advantages are : Texture memory is cached Total LUT size easily fits into available cache size of 8KB/MP No memory coalescing requirements Efficient built-in texture fetching routines available in CUDA Non-zero time taken to load texture memory, but cost easily amortized Global memory writes for level i gates (and reads for level i+1 gates) are performed in a coalesced fashion

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

Experimental Setup FS on 8800 GTX runtimes compared to a commercial fault simulator for 30 IWLS and ITC benchmarks. 32 K patterns were simulated for all 30 circuits. CPU times obtained on a 1.5 GHz 1.5 GB UltraSPARC- IV+ Processor running Solaris 9. OUR time includes Data transfer time between the GPU and CPU (both directions) CPU → GPU : 32 K patterns, LUT data GPU → CPU : 32 K good circuit evals. for all gates, array Detect Processing time on the GPU Time spent by CPU to issue good/faulty gate evaluation calls

Results Circuit#Gates#FaultsOURS (s)COMM. (s)Speed Up s9234_ s s s : b b17_ b b Avg (30 Ckts.) Computation results have been verified. On average, over 30 benchmarks, ~35X speedup obtained.

Results (IU Tesla Server) Circuit#Gates#FaultsPROJ. (s)COMM. (s)Speed Up s9234_ s s s : b b17_ b b Avg (30 Ckts.) NVIDIA Tesla 1U Server can house up to 8 GPUs Runtimes are obtained by scaling the GPU processing times only Transfer times and CPU processing times are included, without scaling On average ~240X speedup obtained.

Outline Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions

We have accelerated FS using GPUs Implement a pattern and fault parallel technique By careful engineering, we maximally harness the GPU’s Raw computational power and Huge memory bandwidths When using a Single 8800 GTX GPU ~35X speedup compared to commercial FS engine When projected for a 1U NVIDIA Tesla Server ~238X speedup is possible over the commercial engine Future work includes exploring parallel fault simulation on the GPU

Thank You