Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jpeg Encoder Accelerator Advanced Embedded Systems Architecture EE-382N-4 Fall 2009 Anup P. Joshi Chandra Bhushan Prakash Karthick Santhanam Pratap Ramanathan.

Similar presentations


Presentation on theme: "Jpeg Encoder Accelerator Advanced Embedded Systems Architecture EE-382N-4 Fall 2009 Anup P. Joshi Chandra Bhushan Prakash Karthick Santhanam Pratap Ramanathan."— Presentation transcript:

1 Jpeg Encoder Accelerator Advanced Embedded Systems Architecture EE-382N-4 Fall 2009 Anup P. Joshi Chandra Bhushan Prakash Karthick Santhanam Pratap Ramanathan

2 OVERVIEW JPEG Encoding Process JPEG Encoder Accelerator Existing Architecture Proposed Architectures Implementation Results Conclusion

3 JPEG Overview Raw Image – represents lots and lots of bytes of information Standardized image compression mechanism Exploits known limitations of the human eye - Small color changes are perceived less accurately than small changes in brightness Lossy method but achieves much greater compression compared to GIF, BMP etc Stores 24-bit-per-pixel color data instead of 8-bit-per-pixel data 24 bits per pixel gives 16 million colors as compared to 256 or fewer colors Disadvantage: Repeated compression and decompression will deteriorate image quality

4 Encoding scheme:

5 Step 1: Image  Pixel Source imageDivision into 8x8 blocksOne 8x8 block

6 Step 2: Color Space Transform Color of each pixel  3-d vector (R,G,B) Significant correlation between these components Color space transform to produce a new vector Luminance Y; blue and red chrominance, Cb and Cr

7 Step 3: DCT  Use Sequential DCT to transform block into set of 64 values (DCT coefficients)  One DC coefficient; Measure of average of energy of block  63 AC coefficients, corresponding to high frequencies; Tend to be zero or near zero for most natural images Step 4: Quantizer  64 coefficients quantized using one of 64 corresponding values from a quantization table  Facilitates greater compression, but lossy (removes most coefficients)  Step 5: Encoder  ‘Huffman’ encoder – most popular  Previously quantized DC coefficient used to predict current coefficient, difference encoded

8 Accelerator considerations Hardware v/s Software  Pure software always slower than hardware based implementation Dedicated Hardware functional unit (accelerator) – more faster Enhanced Architectural Options:  Pipelining JPEG Encoder - already done  Going for different architecture/microarchitecture  Pipelining Individual blocks in encoder We chose the 2 nd option due to constraints in design (more in following slides)

9 Existing Pipelined Encoder - Open Source Design files acquired from Opencores.org Pipelined Encoder – Verilog source files Existing architecture for Encoding:

10 Existing Implementation Details Input to the Encoder (data_in) is 24-bit data bus with 8 bits each for the Red, Green and Blue pixels Follows sequential DCT-based mode :  Inputs start with the top left 8x8 block of the image, starting with the top left pixel, going to the right, then down to the second row, etc. Input data for 1 st 8x8 block of pixels sent over 64 consecutive clocks After sending data for the first block, a delay of 33 clock cycles incurred due to the Encoding process (Huffman) before sending the next block  Huffman encodes values based on previous block’s output  dependency and delay introduced  A candidate for improvisation Output: JPEG_bitstream, 32-bits produced out of the Huffman encoder

11 Experimented architectures # 1:  Insert a buffer between Quantizer and Huffman encoder so that it doesn’t change for 97 cycles.  But quantizer output changes every 64 cycles.  Hence loss of data!!

12 Architectures # 2:  Split image bitstream equally across 2 parallel paths – replicated functional units  Equivalent to using 2 encoders – albeit delay within each encoder still remains !  Gross over-usage of Silicon area - additional overhead on software too

13 Architecture #3: Two Huffman blocks  Eliminates bottleneck – helps in removing the delay between feeding two blocks of data Individual Huffman blocks are driven alternately : 1 st Huffman Block for every odd 8x8 pixel block 2 nd Huffman Block for every even 8x8 pixel block Negligible loss in compression – two separate first set input in Huffman blocks 64 Cycle - accumulation 97 Cycles in each Huffman Some cycles for synchronization

14 Implementation details Transform source image into the required R,G,B bit stream for each pixel Process it in the Design (Hardware) Generate encoded bit stream for every pixel Reconstruct image from the output of the Hardware implementation

15 Conversion of image to R,G,B bitstream In Matlab:  Generated bit information using imread() function  Generates a text file ‘bits.txt’ containing 24bit data for total number of pixels  Properly formatted and supplied to the Design via Test bench.TIFF format (File size: 28KB) Supplied to the Testbench

16 Simulation results ( Existing architecture ): The ‘enable’ signal should be brought high when the data from the first pixel of the image is ready enable signal needs to stay high while the data is being input to the core Each 8x8 block of data needs to be input to the core on 64 consecutive clock cycles Takes additional 33 clocks to produce the JPEG bitstream for 64 pixels of data from 1 block of input Overall clock consumption (for this example): 143,120,000 / 10,000 = 14312 clocks

17 Simulation results ( New architecture): Alternates between the 2 Huffman encoder blocks Introduced 2 data_ready signals each corresponding to the two JPEG bitstreams coming out of the 2 Huffman encoder blocks Overhead in synchronizing the two Huffman Encoders: Only Eight! Overall clock consumption: 107,120,000 / 10,000 = 10712 clocks

18 Synthesis results:

19 Reconstructing the image Ideal reconstruction – Implement a decoder  Functionally complex (Excessive design time) Alternative way to verify functionality- Software (Matlab) Re-construct the image using the generated bitstream – giving us the much-anticipated “JPEG image” Image-reconstruction performed in Matlab Verify against the input image (quality & compression)

20 Image reconstruction (software): JPEG Bitstream_odd JPEG Bitstream_even Reconstruct Merge JPEG format

21 Original Image Vs Jpeg Encoded Image Size: 28 KB, TIFF format Size: 3 KB, JPEG format

22 Performance Comparison of architectures Existing: Frequency: ~68MHz For test image, total clocks consumed = 14312 Total area = 1 374 028.8 sq. μm (Based on Design Vision synthesis) New: Frequency: ~68MHz For test image, total clocks consumed = 10712 Total area = 1 634 796.8 sq.μm (Based on Design Vision synthesis) Result summary:  Overall savings in clock cycles (acceleration) : 3600  Savings per 8x8 block = 3600 / 144 = 25 clock cycles  Overall increase in area (in terms of NAND1 gates) = (1 634 796.8 / 1.8772) - (1 374 028.8 / 1.8772) = 138 913.275  Change in power consumption ???

23 Design trade-offs Existing implementation had a lot of dependency between functional blocks Re-designing/pipelining the internal blocks is cumbersome Adopted a revised “Architectural” solution that uses multiple functional units Improves speed of encoding !!!!! Costs more Area and higher instantaneous power

24 A second chance? Possibly look at pipelining individual blocks Re-design Huffman block to reduce the internal dependency Reconstruct image using JPEG Decoder Accelerate the Decoding process as well Besides Starting early

25 Questions??

26 Back up Mapping onto an FPGA wasn’t successful due to too many cells – ran out of space!

27 Breakdown of work performed: Anup Joshi and Chandra Prakash  Architecture with 2 encoders  Architecture with buffer  Synchronizing 2 Huffman blocks in proposed architecture  Synthesis of encoder Karthick Santhanam and Pratap Ramanathan  Analysis of open source code  Architecture with 2 Huffman blocks  Matlab code for generating input bit stream  Matlab code for combining bit stream outputs

28 Architecture #2:  But after the design we realized that the Huffman was the bottleneck  No point in making the Quantizer’s output wait at the ‘already slow’ stage  Lessons learnt: Identify the initial bottlenecks, DO NOT WASTE TIME 

29 Lossy – quantization factor - 10 JPEG format, Size 1 KB


Download ppt "Jpeg Encoder Accelerator Advanced Embedded Systems Architecture EE-382N-4 Fall 2009 Anup P. Joshi Chandra Bhushan Prakash Karthick Santhanam Pratap Ramanathan."

Similar presentations


Ads by Google