Presentation on theme: "Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal."— Presentation transcript:
Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal
Salt Lake City, Feb 21 st 2008 PPoPP082 MOTIVATION LDPC Decoding Intensive computation Irregular accesses to memory LDPC decoding using VLSI dedicated hardware Low area, low power consumption High throughputs (Mbps) and low latency Fixed-point arithmetic LDPC decoding on GPUs GPUs processing horse power available CUDA programming interface Medium to high throughputs (Mbps) Floating-point arithmetic Software based flexible solution!
Salt Lake City, Feb 21 st 2008 PPoPP083 OUTLINE Motivation LDPC codes Bit Node processing (BN) Check Node processing (CN) GPUs CUDA interface Experimental results Conclusions and future work
Salt Lake City, Feb 21 st 2008 PPoPP084 LDPC CODES Advantages: Linear block codes Perform close to Shannon limit capacity High throughputs (Mbps) Very low Bit Error Rate (BER) Disadvantages: Good performance implies large H matrices Computationally intensive operations Large amounts of hardware VLSI dedicated solutions are expensive Bottom line: Why not using the horse power available on GPUs, instead of developing expensive VLSI?
Salt Lake City, Feb 21 st 2008 PPoPP085 LDPC CODES Parity check matrix defines the LDPC code Tanner Graph represents connections between BNs and CNs CN1 BN1
Salt Lake City, Feb 21 st 2008 PPoPP086 LDPC DECODER BNs and CNs exchange messages (i.e., probabilities) allowing reliable decision on a bit value
Salt Lake City, Feb 21 st 2008 PPoPP087 CHECK NODE PROCESSING - CN 1. Calculates message going from CN m to BN n : BNi BNj BNk BNn q im q jm q km r mn CNm
Salt Lake City, Feb 21 st 2008 PPoPP088 BIT NODE PROCESSING – BN 2. Calculates the message sent from BN n to CN m including channel information P n : 3. Then computes the a posteriori pseudo-probabilities and performs hard decoding: BNn r in r jn r kn q nm PnPn CNi CNm CNj CNk
Salt Lake City, Feb 21 st 2008 PPoPP089 INTENSIVE COMPUTING "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?" -- Seymore Cray
Salt Lake City, Feb 21 st 2008 PPoPP0810 GRAPHICS PROCESSING UNITS (GPUs) Raw compute power increasing rapidly Manycores architecture Can be programmed outside the graphics framework Exposing parallelism Multi-threaded architecture using CUDA Interest in GPP on GPUs Hard programming Needs efficient interface GPU wins when arithmetic intensity is maximized… GPU looses with memory accesses!
Salt Lake City, Feb 21 st 2008 PPoPP0811 SUM PRODUCT ALGORITHM (SPA) Kernel 1 - Computes the messages sent from CN m to BN n probability of BN n being 0 or 1 Kernel 2 – Computes the messages from BN n to CN m
Salt Lake City, Feb 21 st 2008 PPoPP0812 COMPACT DATA STRUCTURES – H MATRIX H mapped into compact H BN and H CN data structures for all CN m do: (rows in H) for all BN n do: (columns in H) If H mn ==1 then p next = j:H mn ==1, // with n+1< j <(n+N) mod N H BN =p next
Salt Lake City, Feb 21 st 2008 PPoPP0813 COMPUTING KERNELS ON THE GPU A novel SPA multi-thread computing approach SPA iteratively performed by several KERNELS on GPU Flow control and execution management of KERNELS performed by the CUDA programming interface
Salt Lake City, Feb 21 st 2008 PPoPP0814 CUDA INTERFACE FOR GPGPU C based programming interface for NVIDIAs 8x series and next generation CUDA enables efficient use of their massive parallelism Multi-threading hides latency problems Allows transparent programming Slow global memory and fast shared memory acess Avoid non-coalesced memory accesses Significant speedups depending on the algorithm Hard challenge: irregular memory access patterns!
Salt Lake City, Feb 21 st 2008 PPoPP0815 MULTI-THREAD COMPUTING APPROACH Multi-thread strategy and architecture
Salt Lake City, Feb 21 st 2008 PPoPP0816 MULTI-THREAD COMPUTING APPROACH Circular addressing mechanism allows increase of parallelism
Salt Lake City, Feb 21 st 2008 PPoPP0817 MULTI-THREAD COMPUTING APPROACH
Salt Lake City, Feb 21 st 2008 PPoPP0818 EXPERIMENTAL RESULTS Matrix size CPUGPUCPUGPUCPUGPU 25 iterations50 iterations100 iterations 512x10243.50.26.90.413.90.8 2448x489616.70.833.31.666.53.1 2000x400021.01.141.92.284.04.2 Main conclusions ( … obtained from the matrices we considered using CUDA): Much faster processing than on top notch CPUs Supports floating-point operations Achieves medium to large throughputs BUT MOST DEFINITELLY NOT AS GREAT AS WE HOPED!
Salt Lake City, Feb 21 st 2008 PPoPP0819 CONCLUSIONS AND FUTURE WORK GPGPU approach for LDPC decoding New compact data structures to represent the H matrix Multi-thread algorithm for LDPC decoding Significant speedups achieved with the CUDA programming interface Up to 22 GPUs allow a software based, scalable and low cost solution Trading task parallelism by data parallelism Adoption/generalization of the proposed approach (algorithms and data structures) for irregular processing in graphs
Salt Lake City, Feb 21 st 2008 PPoPP0820 CONCLUSIONS Gabriel Falcão, firstname.lastname@example.org University of Coimbra Technical University of Lisbon Portugal