Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,

Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,

OpenFOAM on a GPU-based Heterogeneous Cluster

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Optimizing and Auto-Tuning Belief Propagation on the GPU Scott Grauer-Gray and Dr. John Cavazos Computer and Information Sciences, University of Delaware.

L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Gnort: High Performance Intrusion Detection Using Graphics Processors Giorgos Vasiliadis, Spiros Antonatos, Michalis Polychronakis, Evangelos Markatos,

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

SAGE: Self-Tuning Approximation for Graphics Engines

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

1/24 Exploring the Design Space of a Parallel Object Recognition System Bor-Yiing Su, Kurt Keutzer,

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

LOGO OPERATING SYSTEM Dalia AL-Dabbagh

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Extracted directly from:

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

CS6963 L15: Design Review and CUBLAS Paper Discussion.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

NVIDIA Fermi Architecture Patrick Cozzi University of Pennsylvania CIS Spring 2011.

CUDA Optimizations Sathish Vadhiyar Parallel Programming.

Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,

GPU Architecture and Programming

IIIT Hyderabad Scalable Clustering using Multiple GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute.

The introduction of GPGPU and some implementations on model checking Zhimin Wu Nanyang Technological University, Singapore.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

 GPU Power Model Nandhini Sudarsanan Nathan Vanderby Neeraj Mishra Usha Vinodh

Data Structures and Algorithms in Parallel Computing Lecture 7.

Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

My Coordinates Office EM G.27 contact time:

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

1 Potential for Parallel Computation Chapter 2 – Part 2 Jordan & Alaghband.

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.

CS427 Multicore Architecture and Parallel Computing

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Speedup over Ji et al.'s work

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

6- General Purpose GPU Programming

Presentation transcript:

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley) 1

Outline  Motivation  CUDA Programming Model  Parallel CKY Parsing on GPUs  Experimental Results  Conclusions 2

Outline  Motivation  CUDA Programming Model  Parallel CKY Parsing on GPUs  Experimental Results  Conclusions 3

Why Faster Parsers?  Parsing is the backbone of most NLP applications  Machine translation  Question answering  Information extraction  High-accuracy parsing takes time:  What if we want to parse the web? 4

Great Speedups: GPUs  GPUs: manycore  Hundreds of processing cores, massive parallelism  Allows general-purpose computing  Computer vision: Catanzaro, B. et al Efficient, high-quality image contour detection. In ICCV ‘09.  Speech recognition: Chong, J. et al Scalable HMM based inference engine in large vocabulary continuous speech recognition. In ICME’09.  We want to bring GPUs to the NLP community 5 130x 10.5x

CKY Parsing I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2)  Constituency parsing with a weighted CFG  Using Dynamic Programming to iteratively build parse trees with larger spans from smaller spans  In O(|G|n 3 )  n: #words in a sentence, 20 on average  |G|: grammar constant, proportional to #rules High-accuracy grammars have 1,000,000 rules! More impact on speed than n 6

Outline  Motivation  CUDA Programming Model  Computational Model  Memory Model  Parallel CKY Parsing on GPUs  Experimental Results  Conclusions 7

CUDA Computational Model  Two levels of hierarchy  Thread blocks  Threads  Thread blocks (Blocks)  Independent execution units  Max. threads per block: 512 or 1024  Threads in a block  Not independent Work best as if using vectorized units  Communicate via “shared memory” 8

CUDA Memory Model  Global memory  Off-chip, slow but large  Shared memory  On-chip, fast but small  Shared among threads in a thread block  Texture Memory  Fast memory written from CPU  Works best with read- only data 9

CUDA Programming Principles  Mapping computations to blocks and threads  Load balancing among threads in a block saves time  Efficient usage of different types of memory  Reduce global memory accesses 10 block

Outline  Motivation  CUDA Programming Model  Parallel CKY Parsing on GPUs  Mapping Thread-Based vs. Block-Based Sequential Spans vs. Parallel Spans  Atomic Operations vs. Parallel Reduction  Reducing Global Memory Accesses  Experimental Results  Conclusions 11

Parallelism in CKY Parsing  Bottleneck: binary relaxation  Parallelism in spans, symbols and rules 12 I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2) spans symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 S1S1 S2S2 S 100 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … rules

Mapping  A symbol  a thread?  Load imbalance 13 symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 S1S1 S2S2 S 100 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 …

Thread-Based Mapping  A rule  a thread  Flatten out the symbol dimension  (+) 850k rules: great parallelism  (+) load balanced  (-) a block may handle rules with different parent symbols  Harder to get maximum scores for symbols 14 rules S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … …

Block-Based Mapping  A symbol  a block  A rule  a thread  (+) All the threads in the same block have the same parent  (-) What if #rules of a symbol exceeds the limit of #threads per block?  Splitting symbols to virtual symbols 15 symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … … S1S1 S2S2 S 100 …

Sequential Spans 16 symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … … S1S1 S2S2 S 100 … spans … … … … … …

Parallel Spans 17 symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … … S1S1 S2S2 S 100 … spans … … … … … …

Atomic Operations  Multiple threads update the scores of the same parent symbol  Schedule the updates so that they don’t happen simultaneously to ensure correctness  Atomic operations  Guarantee a memory location is accessed by one thread at any time  Serialize operations if necessary 18 S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 2  S 2 S 4 scores[S 1 ] scores[S 2 ]

Parallel Reduction  Binary tree reduction  An efficient O(logN) runtime  All the threads in a block must have the same parent symbol  An option only for block-based mapping 19

Reducing Global Memory Accesses  Shared memory: frequently accessed data  Scores of parent symbols  Texture Memory: read-only data  Grammar information such as rule scores  Scores of symbols with smaller spans  Changing the layout of scores  Minimize the overhead of copying data to texture memory 20 span = 1 span = 2 span = 3 span = 4 I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2)

Outline  Motivation  CUDA Programming Model  Parallel CKY Parsing on GPUs  Experimental Results  Conclusions 21

Setup  Two GPU architectures  NVIDIA GTX285 (Tesla)  NVIDIA GTX480 (Fermi)  GTX480 better than GTX285 in #cores, support of cache, size of memory  Benchmark  1000 sentences of sec. 22 of WSJ in Penn Treebank  Speedups  Comparing to a serial C implementation of Berkeley Parser 22

GTX285 (Tesla)  No cache memory supported  Lower memory bandwidth speedups 23 serial 1.0 PSpan: Parallel Spans SSpan: Sequential Spans reduce: parallel reduction tex: texture memory thread- atomic- PSpan6.4 block- atomic- PSpan 8.1 block- atomic- SSpan 11.1 block- atomic- SSpan- tex 11.9 block- reduce- PSpan 10.1 block- reduce- SSpan 14.2 block- reduce- SSpan- tex 17.4

GTX480 (Fermi)  Cache memory supported  Higher memory bandwidth speedups 24 PSpan: Parallel Spans SSpan: Sequential Spans reduce: parallel reduction tex: texture memory 1.0 serial thread- atomic- PSpan 13.2 block- atomic- PSpan 14.1 block- atomic- SSpan 15.2 block- atom- SSpan- tex13.9 block- reduce- PSpan 25.8 Block- reduce- SSpan 23.4 block- reduce- SSpan- tex 22.2

Conclusions  We explored the design space for parallelizing CKY parsing on GPUs  Different mappings, synchronization methods  Utilizing different types of memories  We compared two GPU architectures  26X on GTX480, 17X on GTX285  We expect a scalable performance gain as the number of processing cores increases in the future GPUs 25