Download presentation

1
**Dongyue Mou and Zeng Xing**

cujpeg A Simple JPEG Encoder With CUDA Technology Dongyue Mou and Zeng Xing

2
**Outline JPEG Algorithm Traditional Encoder What's new in cujpeg**

Benchmark Conclusion

3
**Outline JPEG Algorithm Traditional Encoder What's new in cujpeg**

Benchmark Conclusion

4
**JPEG Algorithm Serialization in zig-zag style**

JPEG is a commonly used method for image compression. JPEG Encoding Algorithm is consist of 7 steps: Divide image into 8x8 blocks [R,G,B] to [Y,Cb,Cr] conversion Downsampling (optional) FDCT(Forward Discrete Cosine Transform) Quantization Serialization in zig-zag style Entropy encoding (Run Length Coding & Huffman coding)

5
**JPEG Algorithm -- Example **

This is an example

6
Divide into 8x8 blocks This is an example

7
Divide into 8x8 blocks This is an example

8
**RGB vs. YCC Color space conversion makes use of it!**

The precision of colors suffer less (for a human eye) than the precision of contours (based on luminance) Color space conversion makes use of it! Simple color space model: [R,G,B] per pixel JPEG uses [Y, Cb, Cr] Model Y = Brightness Cb = Color blueness Cr = Color redness

9
**Convert RGB to YCC 8x8 pixel 1 pixel = 3 components MCU with**

sampling factor (1, 1, 1)

10
Downsampling Y is taken every pixel , and Cb,Cr are taken for a block of 2x2 pixels 4 blocks 16 x16 pixel MCU: minimum coded unit: The smallest group of data units that is coded. Data size reduces to a half immediately MCU with sampling factor (2, 1, 1)

11
Apply FDCT 2D IDCT: Bottleneck, the complexity of the algorithm is O(n^4) 1D IDCT: 2-D is equivalent to 1-D applied in each direction Kernel uses 1-D transforms

12
**Apply FDCT Meaning of each position in DCT result- matrix DCT Result**

Shift operations From [0, 255] To [-128, 127] Meaning of each position in DCT result- matrix DCT Result

13
**Quantization Quantization Matrix (adjustable according to quality)**

DCT result Quantization result

14
**Zigzag reordering / Run Length Coding**

Quantization result [ Number of Zero before me, my value]

15
**Huffman encoding Total input: 512 bits, Output: 113 bits output**

Values G Real saved values -1, 1 -3, -2, 2, 3 -7,-6,-5,-4,5,6,7 . 1 2 3 4 5 15 0,1 00, 01, 10, 11 000,001,010,011,100,101,110,111 RLC result: [0, -3] [0, 12] [0, 3]......EOB After group number added: [0,2,00b] [0,4,1100b] [0,2,00b] EOB First Huffman coding (i.e. for [0,2,00b] ): [0, 2, 00b] => [100b, 00b] ( look up e.g. table AC Chron) Total input: 512 bits, Output: 113 bits output

16
**Outline JPEG Algorithm Traditional Encoder What's new in cujpeg**

Benchmark Conclusion

17
**Traditional Encoder CPU Image .jpg Load image Color conversion DCT**

Quantization Zigzag Reorder Encoding .jpg

18
**Outline JPEG Algorithm Traditional Encoder What's new in cujpeg**

Benchmark Conclusion

19
**Algorithm Analyse 1x full 2D DCT scan O(N4) 8x Row 1D DCT scan**

8x Column 1D DCT scan O(N3) 8 threads can paralell work

20
Algorithm Analyse

21
**DCT In Place __device__ void vectorDCTInPlace(float *Vect0, int Step)**

{ float *Vect1 = Vect0 + Step, *Vect2 = Vect1 + Step; float *Vect3 = Vect2 + Step, *Vect4 = Vect3 + Step; float *Vect5 = Vect4 + Step, *Vect6 = Vect5 + Step; float *Vect7 = Vect6 + Step; float X07P = (*Vect0) + (*Vect7); float X16P = (*Vect1) + (*Vect6); float X25P = (*Vect2) + (*Vect5); float X34P = (*Vect3) + (*Vect4); float X07M = (*Vect0) - (*Vect7); float X61M = (*Vect6) - (*Vect1); float X25M = (*Vect2) - (*Vect5); float X43M = (*Vect4) - (*Vect3); float X07P34PP = X07P + X34P; float X07P34PM = X07P - X34P; float X16P25PP = X16P + X25P; float X16P25PM = X16P - X25P; (*Vect0) = C_norm * (X07P34PP + X16P25PP); (*Vect2) = C_norm * (C_b * X07P34PM + C_e * X16P25PM); (*Vect4) = C_norm * (X07P34PP - X16P25PP); (*Vect6) = C_norm * (C_e * X07P34PM - C_b * X16P25PM); (*Vect1) = C_norm * (C_a * X07M - C_c * X61M + C_d * X25M - C_f * X43M); (*Vect3) = C_norm * (C_c * X07M + C_f * X61M - C_a * X25M + C_d * X43M); (*Vect5) = C_norm * (C_d * X07M + C_a * X61M + C_f * X25M - C_c * X43M); (*Vect7) = C_norm * (C_f * X07M + C_d * X61M + C_c * X25M + C_a * X43M); } __device__ void blockDCTInPlace(float *block) { for(int row = 0; row < 64; row += 8) vectorDCTInPlace(block + row, 1); for(int col = 0; col < 8; col++) vectorDCTInPlace(block + col, 1); } __device__ void parallelDCTInPlace(float *block) { int col = threadIdx.x % 8; int row = col * 8; __syncthreads(); vectorDCTInPlace(block + row, 1); vectorDCTInPlace(block + col, 1); }

22
**Allocation Desktop PC Graphic Card CPU: 1 P4 Core, 3.0GHz RAM: 2GB**

GPU: 16 Core 575MHz SP/Core, 1.35GHz RAM: 768MB

23
**Binding Huffman Encoding Color conversion, DCT, Quantize**

many conditions/branchs intensive bit operating less computing Color conversion, DCT, Quantize intensive computing less conditions/branchs

24
**Binding 1 CUDA Block = 504 Threads Result: maximal 21 MCUs/CUDA Block**

Hardware: 16KB Shared Memory Problem: 1 MCU contains 702 Byte data Result: maximal 21 MCUs/CUDA Block Hardware: 512 threads Problem: 1 MCU contains 3 Blocks, 1 Block needs 8 threads Result: 1 MCU needs 24 threads 1 CUDA Block = 504 Threads

25
**cujpeg Encoder CPU GPU Image .jpg Load image Color conversion DCT**

Quantization Zigzag Reorder Encoding .jpg

26
**cujpeg Encoder CPU GPU Image Shared Memory .jpg Texture Load image**

cudaMemcpy( ResultHost, ResultDevice, ResultSize, cudaMemcpyDeviceToHost); for (int i=0; i<BLOCK_WIDTH; i++) myDestBlock[myZLine[i]] = (int)(myDCTLine[i] * myDivQLine[i] + 0.5f); CPU GPU Texture Memory Color Conversion Shared Memory Image Load image Global Memory Quantization Reorder Result In Place DCT Host Memory Quantize Reorder int b = tex2D(TexSrc, TexPosX++, TexPosY); int g = tex2D(TexSrc, TexPosX++, TexPosY); int r = tex2D(TexSrc, TexPosX+=6, TexPosY); float y = 0.299*r *g *b ; float cb = *r *g *b + 0.5; float cr = 0.500*r f*g *b + 0.5; myDCTLine[Offset + i] = y; myDCTLine[Offset i]= cb; myDCTLine[Offset i]= cb; Encoding cudaMallocArray( &textureCache, &channel, scanlineSize, imgHeight )); cudaMemcpy2DToArray(textureCache, 0, 0, image, imageStride, imageWidth, imageHeight, cudaMemcpyHostToDevice )); cudaBindTextureToArray(TexSrc, textureCache, channel)); cudaMalloc((void **)(&ResultDevice), ResultSize); .jpg

27
**Quantized/Reordered Data**

Scheduling For each MCU: 24 threads Convert 2 pixel 8 threads Convert rest 2 pixel Do 1x row vector DCT Do 1x column vector DCT Quantize 8x scalar value RGB Data x24 Y Cb Cr YCC Block x24 Y Cb Cr DCT Block x24 Quantized/Reordered Data

28
**Outline JPEG Algorithm Traditional Encoder What's new in cujpeg**

Benchmark Conclusion

29
**GPU Occupancy 504 16 16128 Threads Per Block Registers Per Thread**

Shared Memory Per Block (bytes) 16128 Active Threads per Multiprocessor Active Warps per Multiprocessor Active Thread Blocks per Multiprocessor 1 Occupancy of each Multiprocessor 67% Maximum Simultaneous Blocks per GPU

30
**Benchmark 0.560s 1.171s 0.121s 0.237s ( Q = 80, Sample = 1:1:1 )**

512x512 1024x1024 2048x2048 4096x4096 cujpeg 0.321s 0.376s 0.560s 1.171s libjpeg 0.121s 0.237s 0.804s 3.971s ( Q = 80, Sample = 1:1:1 )

31
**Benchmark Time Consumption (4096x4096) Load Tansfer Compute Encode**

Total Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s Quality = 80 0.121s 0.324s 0.480 1.123s Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s

32
**Benchmark Each thread has 240 operations 24 threads process 1 MCU**

Time Consumption (4096x4096) Load Tansfer Compute Encode Total Quality = 100 0.132s 0.348s 0.043s 0.837s 1.523s Quality = 80 0.121s 0.324s 0.480 1.123s Quality = 50 0.130s 0.353s 0.044s 0.468s 1.167s Each thread has 240 operations 24 threads process 1 MCU 4096x4096 image includes MCUs. Total ops: *24*210 = flops Speed: (Total ops) /0.043 = 35.12Gflops

33
**Outline JPEG Algorithm Traditional Encoder What's new in cujpeg**

Benchmark Conclusion

34
**Conclusion CUDA can obviously accelerate the JPEG compression.**

The over-all performance Depends on the system speed More bandwidth Besser encoding routine Support downsample

Similar presentations

OK

1 JPEG Compression CSC361/661 Burg/Wong. 2 Fact about JPEG Compression JPEG stands for Joint Photographic Experts Group JPEG compression is used with.jpg.

1 JPEG Compression CSC361/661 Burg/Wong. 2 Fact about JPEG Compression JPEG stands for Joint Photographic Experts Group JPEG compression is used with.jpg.

© 2018 SlidePlayer.com Inc.

All rights reserved.

To ensure the functioning of the site, we use **cookies**. We share information about your activities on the site with our partners and Google partners: social networks and companies engaged in advertising and web analytics. For more information, see the Privacy Policy and Google Privacy & Terms.
Your consent to our cookies if you continue to use this website.

Ads by Google

Ppt on home automation using zigbee Ppt on employment and unemployment in india Ppt on porter's five forces article Ppt on brain drain Ppt on national education day in the us Ppt on formal education system Ppt on double entry system Ppt on european union Ppt on e commerce business model Ppt on aerobics step