Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Similar presentations


Presentation on theme: "Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)"— Presentation transcript:

1 Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

2 CKY Parsing Find the most likely parse tree for a given sentence Parse trees can be used in many NLP applications –Machine translation –Question answering –Information extraction Dynamic Programming in O(|G|n 3 ) –n is number of words in a sentence –|G| is size of the grammar I love you. love you. you.. you love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2)

3 Why Faster Parsers? O(|G|n 3 ) –n is on average about 20 –|G| is much more larger grammars with high accuracy: >1,000,000 rules We need faster parsers for real-time NL processing with high accuracy!

4 GPUs Manycore era –Due to “Power Wall”, it is unlikely that CPUs with faster clock frequency appear –Instead, number of processing cores will continue to increase GPU (Graphics Processing Unit) –Currently available manycore architecture: –480 processing cores in GTX480

5 Overall Structure Hierarchical parallel platform –Several Streaming Processors (SP) grouped into a Streaming Mulitprocessor (SM) …

6 Memory Types Different types of memory –Global, shared, texture, constant memory Can you guys please add a bit more here?

7 CUDA CUDA (Compute Unified Device Architecture) –Parallel programming framework for GPUs Programming model, language, compilers, APIs –Allows general purpose computing on GPUs

8 Thread and Thread Block in CUDA Thread blocks (Blocks) –Independent execution units Threads –Maximum threads per block: 512 or 1024 Warps –Group of threads executed together: 32 Kernel –Configured as #blocks, #threads

9 9 Fork-join programming model, host+device program –Serial or modestly parallel parts in host C code –Highly parallel parts in device kernel C code Serial code (host)‏... Parallel code in kernel (device)‏ KernelA >>(args); Serial code (host)‏ Parallel code = kernel (device)‏ KernelB >>(args); © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009 ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign Programming Model in CUDA

10 SIMT model in CUDA __global__ void Kenrel1(..) { if( threadIdx.x < a)... else... } SIMT (Single Instruction Multiple Thread) –Not SIMD (Single Instruction Multiple Data) because… Threads can actually execute different locations of the program –Not SPMD (Single Program Multiple Data) because… Threads with different execution path cannot execute in parallel __global__ void Kenrel2(..) { int tx = threadIdx.x; for(i=0; i<LoopCount[tx]; i++)... }

11 Parallelisms in CKY Parsing Dynamic Programming –Iterations must be executed in serial But, in each iteration –About a million rules (with thousands of symbols) need to be evaluated for each span I love you. love you. you.. yo u love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2) Rules Spans Unary Rule Relaxation Binary Rule Relaxation # rules, # span

12 Thread-Mapping Map a symbol to a thread? –Not good for load balancing –Remember SIMT! Map a rule to a thread? –850K rules  good concurrency –Thread blocks are just groups of the same # of threads …

13 Block-Mapping Map each symbol to a thread block –and map the rules to threads in the thread block that corresponds to the parent symbol –(+) All the threads in the same thread block has the same parent –(-) What if #rules of a symbol exceeds the #thread limit? …

14 Block-Mapping … 01234567 1023 … 01234567 012 Symbol i Virtual Symbol j Virtual Symbol j+1

15 Span-Mapping It is easy to further parallelize another level of parallelism orthogonally –Simply add another dimension in the grid of thread blocks …… blockIndex.y=0 blockIndex.y=1 blockIndex.y= n-len+1 blockIndex.x=sym0 blockIndex.x=sym1 … … span index

16 Synchronization Massive number of threads with the same parent symbol need to update its computed score correctly such that the reduced final value is the maximum value

17 Atomic Operations atomicMax(&max,value); –CUDA API –Much efficient for shared memory than global memory shared memory global memory

18 Parallel Reduction After log 2 N steps (N is #threads in a block), the reduced value is obtained –All the threads work for the same symbol –An option only for block-mapping __syncthreads()

19 Reducing Global Memory Using Texture Memory Grammar information –parent[], lchild[], rchild[] –Read-only throughout the whole program Scores updated in the previous iterations of dynamic programming –scores[][][] –Read-only Locate such read-only data in texture memory! But, in case of scores[][][], we need to locate newly updated scores in the current iteration to the texture memory –Locating array in texture memory = cudaBindTexture( ) –The execution time of this API is proportional to the array size –(-) scores[start][stop][S] is huge array… S j  S r S s scores[w p ][w d ][S r ], scores[w d+1 ][w q ][S s ]

20 Reducing Global Memory Using Texture Memory (Cont’d) Change the layout –scores[start][stop][S]  scores[len][start][S] –We only need to update the part of scores[][][] when len=current iteration I love you. love you. you.. you love I I love I love you love you (0,0) (0,1) (0,2) (0,3) (1,3) (2,3) (3,3) (2,2) (1,1) (1,2) len=1 len=2 len=3 len=4

21 Experimental Results GTX285 –No cache memory supported –Low memory bandwidth speedup thread- atom 6.4 block- atom 8.1 block -pr 10.1 block - atom- SS 11.1 block -pr- SS 14.2 block- atom- SS- tex 11.9 block- pr-SS- tex 17.4

22 Experimental Results GTX480 –Cache memory supported –Higher memory bandwidth speedup thread- atom 13.2 block- atom 14. 1 block -pr 25.8 block - atom- SS 15.2 block -pr- SS 23.4 block- atom- SS- tex 13.9 block -pr- SS- tex 22.2

23 Conclusions We explored design space for parallelizing CKY parsing on a GPU –Different mappings, synchronization methods, –Utilizing different types of memories We compared each version two GPUs –26X on GTX480, 17X on GTX285 We expect scalable performance gains as the number of processing cores increases in future GPUs


Download ppt "Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)"

Similar presentations


Ads by Google