Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley) 1

Outline  Motivation  CUDA Programming Model  Parallel CKY Parsing on GPUs  Experimental Results  Conclusions 2

Why Faster Parsers?  Parsing is the backbone of most NLP applications  Machine translation  Question answering  Information extraction  High-accuracy parsing takes time:  What if we want to parse the web? 4

Great Speedups: GPUs  GPUs: manycore  Hundreds of processing cores, massive parallelism  Allows general-purpose computing  Computer vision: Catanzaro, B. et al. 2009. Efficient, high-quality image contour detection. In ICCV ‘09.  Speech recognition: Chong, J. et al. 2009. Scalable HMM based inference engine in large vocabulary continuous speech recognition. In ICME’09.  We want to bring GPUs to the NLP community 5 130x 10.5x

CKY Parsing I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2)  Constituency parsing with a weighted CFG  Using Dynamic Programming to iteratively build parse trees with larger spans from smaller spans  In O(|G|n 3 )  n: #words in a sentence, 20 on average  |G|: grammar constant, proportional to #rules High-accuracy grammars have 1,000,000 rules! More impact on speed than n 6

Outline  Motivation  CUDA Programming Model  Computational Model  Memory Model  Parallel CKY Parsing on GPUs  Experimental Results  Conclusions 7

CUDA Computational Model  Two levels of hierarchy  Thread blocks  Threads  Thread blocks (Blocks)  Independent execution units  Max. threads per block: 512 or 1024  Threads in a block  Not independent Work best as if using vectorized units  Communicate via “shared memory” 8

CUDA Memory Model  Global memory  Off-chip, slow but large  Shared memory  On-chip, fast but small  Shared among threads in a thread block  Texture Memory  Fast memory written from CPU  Works best with read- only data 9

CUDA Programming Principles  Mapping computations to blocks and threads  Load balancing among threads in a block saves time  Efficient usage of different types of memory  Reduce global memory accesses 10 block

Outline  Motivation  CUDA Programming Model  Parallel CKY Parsing on GPUs  Mapping Thread-Based vs. Block-Based Sequential Spans vs. Parallel Spans  Atomic Operations vs. Parallel Reduction  Reducing Global Memory Accesses  Experimental Results  Conclusions 11

Parallelism in CKY Parsing  Bottleneck: binary relaxation  Parallelism in spans, symbols and rules 12 I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2) spans symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 S1S1 S2S2 S 100 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … rules

Mapping  A symbol  a thread?  Load imbalance 13 symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 S1S1 S2S2 S 100 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 …

Thread-Based Mapping  A rule  a thread  Flatten out the symbol dimension  (+) 850k rules: great parallelism  (+) load balanced  (-) a block may handle rules with different parent symbols  Harder to get maximum scores for symbols 14 rules S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … …

Block-Based Mapping  A symbol  a block  A rule  a thread  (+) All the threads in the same block have the same parent  (-) What if #rules of a symbol exceeds the limit of #threads per block?  Splitting symbols to virtual symbols 15 symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … … S1S1 S2S2 S 100 …

Sequential Spans 16 symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … … S1S1 S2S2 S 100 … spans … … … … … …

Parallel Spans 17 symbols S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 100  S 2 S 26 S 100  S 22 S 3 … S 2  S 2 S 4 S 2  S 2 S 5 S 2  S 99 S 52 … S 2  S 3 S 33 … … S1S1 S2S2 S 100 … spans … … … … … …

Atomic Operations  Multiple threads update the scores of the same parent symbol  Schedule the updates so that they don’t happen simultaneously to ensure correctness  Atomic operations  Guarantee a memory location is accessed by one thread at any time  Serialize operations if necessary 18 S 1  S 1 S 2 S 1  S 2 S 30 S 1  S 44 S 53 S 2  S 2 S 4 scores[S 1 ] scores[S 2 ]

Parallel Reduction  Binary tree reduction  An efficient O(logN) runtime  All the threads in a block must have the same parent symbol  An option only for block-based mapping 19

Reducing Global Memory Accesses  Shared memory: frequently accessed data  Scores of parent symbols  Texture Memory: read-only data  Grammar information such as rule scores  Scores of symbols with smaller spans  Changing the layout of scores  Minimize the overhead of copying data to texture memory 20 span = 1 span = 2 span = 3 span = 4 I love you. love you. you.. you you love love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2)

Setup  Two GPU architectures  NVIDIA GTX285 (Tesla)  NVIDIA GTX480 (Fermi)  GTX480 better than GTX285 in #cores, support of cache, size of memory  Benchmark  1000 sentences of sec. 22 of WSJ in Penn Treebank  Speedups  Comparing to a serial C implementation of Berkeley Parser 22

GTX285 (Tesla)  No cache memory supported  Lower memory bandwidth speedups 23 serial 1.0 PSpan: Parallel Spans SSpan: Sequential Spans reduce: parallel reduction tex: texture memory thread- atomic- PSpan6.4 block- atomic- PSpan 8.1 block- atomic- SSpan 11.1 block- atomic- SSpan- tex 11.9 block- reduce- PSpan 10.1 block- reduce- SSpan 14.2 block- reduce- SSpan- tex 17.4

GTX480 (Fermi)  Cache memory supported  Higher memory bandwidth speedups 24 PSpan: Parallel Spans SSpan: Sequential Spans reduce: parallel reduction tex: texture memory 1.0 serial thread- atomic- PSpan 13.2 block- atomic- PSpan 14.1 block- atomic- SSpan 15.2 block- atom- SSpan- tex13.9 block- reduce- PSpan 25.8 Block- reduce- SSpan 23.4 block- reduce- SSpan- tex 22.2

Conclusions  We explored the design space for parallelizing CKY parsing on GPUs  Different mappings, synchronization methods  Utilizing different types of memories  We compared two GPU architectures  26X on GTX480, 17X on GTX285  We expect a scalable performance gain as the number of processing cores increases in the future GPUs 25

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Similar presentations

Presentation on theme: "Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Similar presentations

Presentation on theme: "Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University."— Presentation transcript:

Similar presentations

About project

Feedback