**Dongyue Mou and Zeng Xing**

cujpeg A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing

**Outline JPEG Algorithm Traditional Encoder What's new in cujpeg**

**JPEG Algorithm Serialization in zig-zag style**

JPEG is a commonly used method for image compression. JPEG Encoding Algorithm is consist of 7 steps: Divide image into 8x8 blocks [R,G,B] to [Y,Cb,Cr] conversion Downsampling (optional) FDCT(Forward Discrete Cosine Transform) Quantization Serialization in zig-zag style Entropy encoding (Run Length Coding & Huffman coding)

**JPEG Algorithm -- Example **

This is an example

Divide into 8x8 blocks This is an example

Divide into 8x8 blocks This is an example

**RGB vs. YCC Color space conversion makes use of it!**

The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance) Color space conversion makes use of it! Simple color space model: [R,G,B] per pixel JPEG uses [Y, Cb, Cr] Model Y = Brightness Cb = Color blueness Cr = Color redness

**Convert RGB to YCC 8x8 pixel 1 pixel = 3 components MCU with**

sampling factor (1, 1, 1)

Downsampling Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels 4 blocks 16 x16 pixel MCU: minimum coded unit: The smallest group of data units that is coded. Data size reduces to a half immediately MCU with sampling factor (2, 1, 1)

Apply FDCT 2D IDCT: Bottleneck, the complexity of the algorithm is O(n^4) 1D IDCT: 2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms

**Apply FDCT Meaning of each position in DCT result- matrix DCT Result**

Shift operations From [0, 255] To [-128, 127] Meaning of each position in DCT result- matrix DCT Result

**Quantization Quantization Matrix (adjustable according to quality)**

DCT result Quantization result

**Zigzag reordering / Run Length Coding**

Quantization result [ Number of Zero before me, my value]

**Huffman encoding Total input: 512 bits, Output: 113 bits output**

Values G Real saved values -1, 1 -3, -2, 2, 3 -7,-6,-5,-4,5,6,7 . 1 2 3 4 5 15 0,1 00, 01, 10, 11 000,001,010,011,100,101,110,111 RLC result: [0, -3] [0, 12] [0, 3]......EOB After group number added: [0,2,00b] [0,4,1100b] [0,2,00b] EOB First Huffman coding (i.e. for [0,2,00b] ): [0, 2, 00b] => [100b, 00b] ( look up e.g. table AC Chron) Total input: 512 bits, Output: 113 bits output

**Outline JPEG Algorithm Traditional Encoder What's new in cujpeg**

**Traditional Encoder CPU Image .jpg Load image Color conversion DCT**

Quantization Zigzag Reorder Encoding .jpg

**Algorithm Analyse 1x full 2D DCT scan O(N4) 8x Row 1D DCT scan**

8x Column 1D DCT scan O(N3) 8 threads can paralell work

Algorithm Analyse

**DCT In Place __device__ void vectorDCTInPlace(float *Vect0, int Step)**

{ float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step; float *Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step; float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step; float *Vect7 = Vect6 + Step; float X07P = (*Vect0) + (*Vect7); float X16P = (*Vect1) + (*Vect6); float X25P = (*Vect2) + (*Vect5); float X34P = (*Vect3) + (*Vect4); float X07M = (*Vect0) - (*Vect7); float X61M = (*Vect6) - (*Vect1); float X25M = (*Vect2) - (*Vect5); float X43M = (*Vect4) - (*Vect3); float X07P34PP = X07P + X34P; float X07P34PM = X07P - X34P; float X16P25PP = X16P + X25P; float X16P25PM = X16P - X25P; (*Vect0) = C_norm * (X07P34PP + X16P25PP); (*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM); (*Vect4) = C_norm * (X07P34PP - X16P25PP); (*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM); (*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M); (*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M); (*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M); (*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M); } __device__ void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1); for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); } __device__ void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8; __syncthreads(); vectorDCTInPlace(block + row, 1); vectorDCTInPlace(block + col, 1); }

**Allocation Desktop PC Graphic Card CPU: 1 P4 Core, 3.0GHz RAM: 2GB**

GPU: 16 Core 575MHz SP/Core, 1.35GHz RAM: 768MB

**Binding Huffman Encoding Color conversion, DCT, Quantize**

many conditions/branchs intensive bit operating less computing Color conversion, DCT, Quantize intensive computing less conditions/branchs

**Binding 1 CUDA Block = 504 Threads Result: maximal 21 MCUs/CUDA Block**

Hardware: 16KB Shared Memory Problem: 1 MCU contains 702 Byte data Result: maximal 21 MCUs/CUDA Block Hardware: 512 threads Problem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads Result: 1 MCU needs 24 threads 1 CUDA Block = 504 Threads

**cujpeg Encoder CPU GPU Image .jpg Load image Color conversion DCT**

Quantization Zigzag Reorder Encoding .jpg

**cujpeg Encoder CPU GPU Image Shared Memory .jpg Texture Load image**

cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost); for (int i=0; i<BLOCK_WIDTH; i++) myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f); CPU GPU Texture Memory Color Conversion Shared Memory Image Load image Global Memory Quantization Reorder Result In Place DCT Host Memory Quantize Reorder int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r *g *b ; float cb = *r *g *b + 0.5; float cr = 0.500*r f*g *b + 0.5; myDCTLine[Offset + i] = y; myDCTLine[Offset i]= cb; myDCTLine[Offset i]= cb; Encoding cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight )); cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice )); cudaBindTextureToArray(TexSrc, textureCache, channel)); cudaMalloc((void **)(&ResultDevice), ResultSize); .jpg

**Quantized/Reordered Data**

Scheduling For each MCU: 24 threads Convert 2 pixel 8 threads Convert rest 2 pixel Do 1x row vector DCT Do 1x column vector DCT Quantize 8x scalar value RGB Data x24 Y Cb Cr YCC Block x24 Y Cb Cr DCT Block x24 Quantized/Reordered Data

**GPU Occupancy 504 16 16128 Threads Per Block Registers Per Thread**

Shared Memory Per Block (bytes) 16128 Active Threads per Multiprocessor Active Warps per Multiprocessor Active Thread Blocks per Multiprocessor 1 Occupancy of each Multiprocessor 67% Maximum Simultaneous Blocks per GPU

**Benchmark 0.560s 1.171s 0.121s 0.237s ( Q = 80, Sample = 1:1:1 )**

512x512 1024x1024 2048x2048 4096x4096 cujpeg 0.321s 0.376s 0.560s 1.171s libjpeg 0.121s 0.237s 0.804s 3.971s ( Q = 80, Sample = 1:1:1 )

**Benchmark Time Consumption (4096x4096) Load Tansfer Compute Encode**

Total Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s Quality = 80 0.121s 0.324s 0.480 1.123s Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s

**Benchmark Each thread has 240 operations 24 threads process 1 MCU**

Time Consumption (4096x4096) Load Tansfer Compute Encode Total Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s Quality = 80 0.121s 0.324s 0.480 1.123s Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s Each thread has 240 operations 24 threads process 1 MCU 4096x4096 image includes MCUs. Total ops: *24*210 = flops Speed: (Total ops) /0.043 = 35.12Gflops

**Conclusion CUDA can obviously accelerate the JPEG compression.**

The over-all performance Depends on the system speed More bandwidth Besser encoding routine Support downsample

