Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg.

Similar presentations


Presentation on theme: "A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg."— Presentation transcript:

1 A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg

2 Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

3 Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

4 JPEG Algorithm JPEG is a commonly used method for image compression. JPEG Encoding Algorithm is consist of 7 steps: 1.Divide image into 8x8 blocks 2.[R,G,B] to [Y,Cb,Cr] conversion 3.Downsampling (optional) 4.FDCT(Forward Discrete Cosine Transform)‏ 5.Quantization 6.Serialization in zig-zag style 7.Entropy encoding (Run Length Coding & Huffman coding)

5 This is an example JPEG Algorithm -- Example

6 This is an example Divide into 8x8 blocks

7 This is an example Divide into 8x8 blocks

8 RGB vs. YCC The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance) Color space conversion makes use of it! Simple color space model: [R,G,B] per pixel JPEG uses [Y, Cb, Cr] Model Y = Brightness Cb = Color blueness Cr = Color redness

9 Convert RGB to YCC 8x8 pixel 1 pixel = 3 components MCU with sampling factor (1, 1, 1)

10 Downsampling MCU: minimum coded unit: The smallest group of data units that is coded. Data size reduces to a half immediately 4 blocks 16 x16 pixel Y is taken every pixel, and Cb,Cr are taken for a block of 2x2 pixels MCU with sampling factor (2, 1, 1)

11 Apply FDCT 2D IDCT: 1D IDCT: 2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms Bottleneck, the complexity of the algorithm is O(n^4)

12 Apply FDCT Shift operations From [0, 255] To [-128, 127] DCT Result Meaning of each position in DCT result- matrix

13 Quantization DCT result Quantization Matrix (adjustable according to quality)‏ Quantization result

14 Zigzag reordering / Run Length Coding [ Number of Zero before me, my value] Quantization result

15 Huffman encoding RLC result: [0, -3] [0, 12] [0, 3]......EOB After group number added: [0,2,00 b ] [0,4,1100 b ] [0,2,00 b ]...... EOB First Huffman coding (i.e. for [0,2,00 b ] ): [0, 2, 00 b ] => [100 b, 00 b ] ( look up e.g. table AC Chron) Total input: 512 bits, Output: 113 bits output ValuesGReal saved values 0 -1, 1 -3, -2, 2, 3 -7,-6,-5,-4,5,6,7. -32767..32767 0 1 2 3 4 5. 15. 0,1 00, 01, 10, 11 000,001,010,011,100,101,110,111.

16 Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

17 Traditional Encoder CPU Load image Color conversion DCT Quantization Zigzag Reorder Encoding Image.jpg

18 Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

19 Algorithm Analyse 1x full 2D DCT scan O(N 4 ) 8x Row 1D DCT scan 8x Column 1D DCT scan O(N 3 ) 8 threads can paralell work

20 Algorithm Analyse

21 DCT In Place __device__ void vectorDCTInPlace(float *Vect0, int Step) { float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step; float *Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step; float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step; float *Vect7 = Vect6 + Step; float X07P = (*Vect0) + (*Vect7); float X16P = (*Vect1) + (*Vect6); float X25P = (*Vect2) + (*Vect5); float X34P = (*Vect3) + (*Vect4); float X07M = (*Vect0) - (*Vect7); float X61M = (*Vect6) - (*Vect1); float X25M = (*Vect2) - (*Vect5); float X43M = (*Vect4) - (*Vect3); float X07P34PP = X07P + X34P; float X07P34PM = X07P - X34P; float X16P25PP = X16P + X25P; float X16P25PM = X16P - X25P; (*Vect0) = C_norm * (X07P34PP + X16P25PP); (*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM); (*Vect4) = C_norm * (X07P34PP - X16P25PP); (*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM); (*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M); (*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M); (*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M); (*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M); } __device__ void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1); for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); } __device__ void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8; __syncthreads(); vectorDCTInPlace(block + row, 1); __syncthreads(); vectorDCTInPlace(block + col, 1); __syncthreads(); }

22 Allocation Desktop PC –CPU: 1 P4 Core, 3.0GHz –RAM: 2GB Graphic Card –GPU: 16 Core 575MHz 8 SP/Core, 1.35GHz –RAM: 768MB

23 Binding Huffman Encoding many conditions/branchs intensive bit operating less computing Color conversion, DCT, Quantize intensive computing less conditions /branchs

24 Binding Hardware: 16KB Shared Memory Problem: 1 MCU contains 702 Byte data Result: maximal 21 MCUs/CUDA Block Hardware: 512 threads Problem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads Result: 1 MCU needs 24 threads 1 CUDA Block = 504 Threads

25 cujpeg Encoder CPU Load image Color conversion DCT Quantization Zigzag Reorder Encoding Image.jpg GPU

26 cujpeg Encoder CPU Encoding Image.jpg GPU Texture Memory Global Memory Quantization Reorder Result Shared Memory Color Conversion In Place DCT Quantize Reorder Host Memory cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight )); cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice )); cudaBindTextureToArray(TexSrc, textureCache, channel)); cudaMalloc((void **)(&ResultDevice), ResultSize); Load image int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5; float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5; float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5; myDCTLine[Offset + i] = y; myDCTLine[Offset + 64 + i]= cb; myDCTLine[Offset + 128 + i]= cb; for (int i=0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/13/4167519/slides/slide_26.jpg", "name": "cujpeg Encoder CPU Encoding Image.jpg GPU Texture Memory Global Memory Quantization Reorder Result Shared Memory Color Conversion In Place DCT Quantize Reorder Host Memory cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight )); cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice )); cudaBindTextureToArray(TexSrc, textureCache, channel)); cudaMalloc((void **)(&ResultDevice), ResultSize); Load image int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r + 0.587*g + 0.114*b - 128.0 + 0.5; float cb = -0.168763*r - 0.331264*g + 0.500*b + 0.5; float cr = 0.500*r - 0.418688f*g - 0.081312*b + 0.5; myDCTLine[Offset + i] = y; myDCTLine[Offset + 64 + i]= cb; myDCTLine[Offset + 128 + i]= cb; for (int i=0; i

27 Scheduling For each MCU: 24 threads Convert 2 pixel 8 threads Convert rest 2 pixel 24 threads Do 1x row vector DCT Do 1x column vector DCT Quantize 8x scalar value YCbCr RGB Data YCC Block DCT Block Quantized/Reordered Data YCbCr x24

28 Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

29 GPU Occupancy Threads Per Block 504 Registers Per Thread 16 Shared Memory Per Block (bytes) 16128 Active Threads per Multiprocessor504 Active Warps per Multiprocessor16 Active Thread Blocks per Multiprocessor1 Occupancy of each Multiprocessor67% Maximum Simultaneous Blocks per GPU16

30 Benchmark 512x5121024x10242048x20484096x4096 cujpeg0.321s0.376s 0.560s1.171s libjpeg 0.121s0.237s 0.804s3.971s ( Q = 80, Sample = 1:1:1 )

31 Benchmark Time Consumption (4096x4096) LoadTansferComputeEncodeTotal Quality = 1000.132s0.348s0.043s0.837s1.523s Quality = 800.121s0.324s0.043s0.4801.123s Quality = 500.130s0.353s0.044s0.468s1.167s

32 Benchmark Time Consumption (4096x4096) LoadTansferComputeEncodeTotal Quality = 1000.132s0.348s0.043s0.837s1.523s Quality = 800.121s0.324s0.043s0.4801.123s Quality = 500.130s0.353s0.044s0.468s1.167s Each thread has 240 operations 24 threads process 1 MCU 4096x4096 image includes 262144 MCUs. Total ops: 262144*24*210 = 1509949440 flops Speed: (Total ops) /0.043 = 35.12Gflops

33 Outline JPEG Algorithm Traditional Encoder What's new in cujpeg Benchmark Conclusion

34 CUDA can obviously accelerate the JPEG compression. The over-all performance Depends on the system speed More bandwidth Besser encoding routine Support downsample


Download ppt "A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing cujpeg."

Similar presentations


Ads by Google