JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009

O UTLINE  Motivation  JPEG Algorithm  Design Approach in CUDA  Benchmark  Conclusion

M OTIVATION Growth of Digital Imaging Applications Effective algorithm for Video Compression Applications Loss of Data Information must be minimal JPEG is a lossy compression algorithm that reduces the file size without affecting quality of image It perceive the small changes in brightness more readily than we do small change in color

JPEG A LGORITHM Step 1: Divide sample image into 8x8 blocks Step 2: Apply DCT DCT is applied to each block It replaces actual color of block to average matrix which is analyze for entire matrix This step does not compress the file In general: Simple color space model: [R,G,B] per pixel JPEG uses [Y, Cb, Cr] Model Y = Brightness Cb = Color blueness Cr = Color redness

JPEG A LGORITHM Step 3: Quantization First Compression Step Each DCT coefficient is divided by its corresponding constant in Quantization table and rounded off to nearest integer The result of quantizing the DCT coefficients is that smaller, unimportant coefficients will be replaced by zeros and larger coefficients will lose precision. It is this rounding-off that causes a loss in image quality. Step 4: Apply Huffman Encoding Apply Huffman encoding to Quantized DCT Coefficient to reduce the image size further Step 5: Decoder Decoder of JPEG consist of: Huffman Decoding De-Quantization IDCT

DCT and IDCT

Discrete Cosine Transform Separable transform algorithm (1D and then the 2D): 2D DCT is performed in a 2 pass approach one for horizontal direction and one for vertical direction DCT 1 st pass2 nd pass

Discrete Cosine Transform Translate DCT into matrix cross multiplication Pre-calculate Cosine values are stored as constant array Inverse DCT are calculated in the same way only with P 00 P 01 P 02 P 03 P 04 P 05 P 06 P 07 P 10 P 11 P 12 P 13 P 14 P 15 P 16 P 17 P 20 P 21 P 22 P 23 P 24 P 25 P 26 P 27 P 30 P 31 P 32 P 33 P 34 P 35 P 36 P 37 P 40 P 41 P 42 P 43 P 44 P 45 P 46 P 47 P 50 P 51 P 52 P 53 P 54 P 55 P 56 P 57 P 60 P 61 P 62 P 63 P 64 P 65 P 66 P 67 P 70 P 71 P 72 P 73 P 74 P 75 P 76 P 77 C 00 C 01 C 02 C 03 C 04 C 05 C 06 C 07 C 10 C 11 C 12 C 13 C 14 C 15 C 16 C 17 C 20 C 21 C 22 C 23 C 24 C 25 C 26 C 27 C 30 C 31 C 32 C 33 C 34 C 35 C 36 C 37 C 40 C 41 C 42 C 43 C 44 C 45 C 46 C 47 C 50 C 51 C 52 C 53 C 54 C 55 C 56 C 57 C 60 C 61 C 62 C 63 C 64 C 65 C 66 C 67 C 70 C 71 C 72 C 73 C 74 C 75 C 76 C 77 x

DCT CUDA Implementation Each thread within each block has the same number of calculation Each thread multiply and accumulated eight elements P 00 P 01 P 02 P 03 P 04 P 05 P 06 P 07 P 10 P 11 P 12 P 13 P 14 P 15 P 16 P 17 P 20 P 21 P 22 P 23 P 24 P 25 P 26 P 27 P 30 P 31 P 32 P 33 P 34 P 35 P 36 P 37 P 40 P 41 P 42 P 43 P 44 P 45 P 46 P 47 P 50 P 51 P 52 P 53 P 54 P 55 P 56 P 57 P 60 P 61 P 62 P 63 P 64 P 65 P 66 P 67 P 70 P 71 P 72 P 73 P 74 P 75 P 76 P 77 C 00 C 01 C 02 C 03 C 04 C 05 C 06 C 07 C 10 C 11 C 12 C 13 C 14 C 15 C 16 C 17 C 20 C 21 C 22 C 23 C 24 C 25 C 26 C 27 C 30 C 31 C 32 C 33 C 34 C 35 C 36 C 37 C 40 C 41 C 42 C 43 C 44 C 45 C 46 C 47 C 50 C 51 C 52 C 53 C 54 C 55 C 56 C 57 C 60 C 61 C 62 C 63 C 64 C 65 C 66 C 67 C 70 C 71 C 72 C 73 C 74 C 75 C 76 C 77 x Thread.x = 2 Thread.y = 3

DCT Grid and Block Two methods and approach Each thread block process 1 macro blocks (64 threads) Each thread block process 8 macro blocks (512 threads)

DCT and IDCT GPU results 512x5121024x7682048x2048

DCT Results

IDCT Results

Quantization

Break the image into 8x8 blocks 8x8 Quantized matrix to be applied to the image. Every content of the image is multiplied by the Quantized value and divided again to round to the nearest integer value.

Quantization CUDA Programing Method 1 – Exact implementation as in CPU Method 2 – Shared memory to copy 8x8 image Method 3 – Load divided values into shared memory.

Quantization CUDA Results

Quantization CPU vs GPU Results

Tabulated Results for Quantization Method 2 and Method 3 have similar performance on small image sizes Method 3 might perform better on images bigger that 2048x2048 Quantization is ~x70 faster for the first method and much more as resolution increases. Quantization is ~ x180 faster for method2 and 3 and much more as resolution increases. Method 1Method 2Method 3CPUxCPU - 1xCPU - 2xCPU - 3 512x5120.1020.039 7.3772.2549188.9744 1024x7680.2740.085 2280.29197258.8235 2048x20481.390.3790.3611079.13669290.2375305.5556

Huffman Encode/Decode

Huffman Encoding Basics Utilizes frequency of each symbol Lossless compression Uses VARIABLE length code for each symbol IMAGE

Challenges Encoding is a very very very serial process Variable length of symbols is a problem Encoding: don’t know when symbols needs to be written unless all other symbols are encoded. Decoding: don’t know where symbols start

ENCODING

DECODING Decoding: don’t know where symbols start Need redundant calculation Uses decoding table, rather then tree Decode then shift by n bits. STEP 1: divide bitstream into overlapping segments. 65 bytes. Run 8 threads on each segment with different starting positions

DECODING STEP 2: Determine which threads are valid, throw away others

DECODING - challenges Each segment takes fixed number of encoded bits, but it results in variable length decoded output 64 bit can result in 64 bytes of output. Memory explosion Memory address for input do not advance in fixed pattern as output address Memory collisions Decoding table doesn’t fit into one address line Combining threads is serial NOTE: to simplify the algorithm, max symbol length was assumed to be 8 bits. (it didn’t help much)

Huffman Results Encoding Step one is very fast: ~100 speed up Step two – algorithm is wrong – no results Decoding 3 times slower then classic CPU method. Using shared memory for encoding table resolved only some conflicts (5 x slower -> 4 x slower) Conflicts on inputs bitstream Either conflicts on input or output data Moving 65 byte chunks to shared memory and ‘sharing’ it between 8 threads didn’t help much (4 x slower -> 3 x slower) ENCODING should be left to CPU

Conclusion & Results

Results CPU 512x512 - CPU1024x768 - CPU2048x2048 -CPU DCT3.3811.0557.12 Quantization5.7417.1675.97 IDCT3.3410.4956.5 GPU 512x512 -GPU1024x768 -GPU2048x2048 -GPU DCT0.1910.472.7 Quantization0.0390.0850.379 IDCT0.1710.4362.145 Performance Gain 512x5121024x7682048x2048 DCT17.6963350823.510638321.15555556 Quantization147.1794872201.8823529200.4485488 IDCT19.5321637424.0596330326.34032634

Performance Gain DCT and IDCT are the major consumers of the computation time. Computation increases with the increase with resolution. Total Processing time for 2k image is 5.224ms and for the CPU is 189.59 => speed up of 36x

GPU Performance DCT and IDCT still take up the major computation cycles but reduced by a x100 magnitude. 2K resolution processing time is 7ms using the GPU as compared to ~900ms with the CPU.

Conclusion CUDA implementation for transform and quantization is much faster than CPU (x36 faster) Huffman Algorithm does not parallelize well and final results show x3 slower than CPU. GPU architecture is well optimized for image and video related processing. High Performance Applications - Interframe, HD resolution/Realtime video compression/decompression.

Conclusion – Image Quality Resolution – 1024x768 CPU GPU

Conclusion – Image Quality Resolution – 2048x2048

Conclusion – Image Quality Resolution – 512x512 CPU GPU

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

Similar presentations

Presentation on theme: "JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009.

Similar presentations

Presentation on theme: "JPEG C OMPRESSION A LGORITHM I N CUDA Group Members: Pranit Patel Manisha Tatikonda Jeff Wong Jarek Marczewski Date: April 14, 2009."— Presentation transcript:

Similar presentations

About project

Feedback