Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard Tien-Wei Hsieh Youn-Long Lin Tien-Wei Hsieh Youn-Long Lin Department of Computer.

Similar presentations


Presentation on theme: "A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard Tien-Wei Hsieh Youn-Long Lin Tien-Wei Hsieh Youn-Long Lin Department of Computer."— Presentation transcript:

1 A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard Tien-Wei Hsieh Youn-Long Lin Tien-Wei Hsieh Youn-Long Lin Department of Computer Science National Tsing Hua University TAIWAN

2 2004/6/16 Tien-Wei Hsieh 2 Abstract Proposition –16-bit parallel context generator –Stripe-skipping method –3-stage pipelined arithmetic encoder –Renormalization strategy with forwarding method Contribution –We reduce the cycle count by 17% compared with the best-known design –We have achieved 5% within the optimum

3 2004/6/16 Tien-Wei Hsieh 3 Outline Introduction Previous work Proposed architecture Experimental results Conclusion

4 2004/6/16 Tien-Wei Hsieh 4 Outline Introduction Previous work Proposed architecture Experimental results Conclusion

5 2004/6/16 Tien-Wei Hsieh 5 Pre-process Discrete Wavelet Transform (DWT) Quantization Block Coding (Tier-1 Coding) Block Coding (Tier-1 Coding) Bit-stream Organization (Tier-2 Coding) Bit-stream Organization (Tier-2 Coding) Original Image Data Compressed Image Data EBCOT (Embedded Block Coding with Optimized Truncation) JPEG2000 Image Coding

6 2004/6/16 Tien-Wei Hsieh 6 EBCOT Tier-1 Time Consuming Platform: Pentium 4 2.8GHz, 736MB RAM, Microsoft Windows XP, VC ++ Reference software: JPEG 2000 jasper Test pattern: 512x512 gray image, 1 tile, 5/3 DWT 3 decomposition levels, code-block size 64x % % % %

7 2004/6/16 Tien-Wei Hsieh 7 EBCOT Tier-1 Block Coding (Context-based adaptive binary arithmetic coding) LL LH HL HH HL HH Sub-bitstream N Context Formation (CF) Arithmetic Encoder (AE) Sub-bitstream 3 Sub-bitstream 2 Sub-bitstream 1 contextdecision Block Coding Code- blockN Code- block3 Code- block2 Code- block1 From DWT & Quantization To Tier-2

8 2004/6/16 Tien-Wei Hsieh 8 Bit-Plane Division of Code-block 1 Sign bit MSB LSB Magnitude bits insignificant significant Pixel Bit-plane

9 2004/6/169 Scanning Each Bit-plane 3 Times 4 bits in a column N stripes in a bit-plane M columns in a stripe (pass > stripe > column > bit) Code-block size is 4N x M

10 Coding a Bit-plane Insignificant bits with significant neighbors Significant bits Bits not coding in previous two passes Pass 1Pass 2Pass Significant bit

11 2004/6/16 Tien-Wei Hsieh 11 Outline Introduction Previous work Proposed architecture Experimental results Conclusion

12 12 Previous work Context Formation – –Normal mode NTU: Skipping methods, Sample Skipping (SS) and Group-of- Column Skipping (GOCS) – –Reduce 60% cycle count compared with straightforward method NCTU: Memory-saving algorithm – –Reduce 4K bits memory space if the code-block size is 64x64 – –Pass-parallel mode TKU: Pass-parallel context modeling – –No cycle wasted – –0.1 ~ 0.2 dB image quality degradation Arithmetic Encoder –MQ coder (JBIG) Osaka University: 4-stage pipelined architecture

13 2004/6/16 Tien-Wei Hsieh 13 Outline Introduction Previous work Proposed architecture Experimental results Conclusion

14 2004/6/16 Tien-Wei Hsieh 14 Proposed EBCOT Tier-1 Coder Address Generator Code Block Memory State Memory Context Formation (CF) Compress & PISO Arithmetic Encoder (AE) Pixel_in Byte_out 16 bits 40 (CX, D)s (CX, D) 16 bits

15 2004/6/16 Tien-Wei Hsieh 15 Context Formation Unit Address Generator Code Block Memory State Memory Context Formation (CF) Compress & PISO Arithmetic Encoder (AE) Pixel_in Byte_out 16 bits 40 (CX, D)s (CX, D) 16 bits

16 2004/6/16 Tien-Wei Hsieh 16 Data Dependency for 16 Bits (delay)

17 2004/6/16 16-Way Parallel Architecture

18 18 Memory Scheme Stripe N Stripe N+1 Memory B Memory C Memory A Memory B Memory C ORDERORDER Memory A Memory B Memory C

19 2004/6/1619 Stripe Skipping 3 registers record coding condition of all stripes in 3 passes –A stripe is skipped in Pass1 if all bits in the stripe are significant –A stripe is skipped in Pass2 if all bits in the stripe are insignificant –A stripe is skipped in Pass3 if all bits in the stripe have been coded in Pass1 or 2

20 2004/6/16 Tien-Wei Hsieh 20 Arithmetic Encoder Address Generator Code Block Memory State Memory Context Formation (CF) Compress & PISO Arithmetic Encoder (AE) Pixel_in Byte_out 16 bits 40 (CX, D)s (CX, D) 16 bits

21 21 Feedback Loop in AE Flow Chart Probability Estimation Table (PET) Context Table A Calculation A C C Calculation Index Updating Table Reading MPS Updating Byte Context Decision Renormalization NMPS, NLPS SWITCH Qe index mps

22 22 Modified Probability Estimation Table (MPET) Context Table A Calculation A C C Calculation Index Updating Bit Shifting MPS Updating Context Decision Renormalization Proposed Pipelined AE Byte Stage 1 Stage 2 Stage 3 Table Reading

23 23 Fast Renormalization CT > A_shift ? C = C << A_shift CT = CT – Ashift C = C << CT A_shift = A_shift – CT BYTEOUT2 DONE NOYES BYTEOUT Twice ? YES NO CT > A_shift ? C = C << A_shift CT = CT – Ashift C = C << CT A_shift = A_shift – CT CT = 0 BYTEOUT DONE NOYES

24 2004/6/16 Tien-Wei Hsieh 24 Compress & Parallel In Serial Out Address Generator Code Block Memory State Memory Context Formation (CF) Compress & PISO Arithmetic Encoder (AE) Pixel_in Byte_out 16 bits 40 (CX, D)s (CX, D) 16 bits

25 2004/6/16 Tien-Wei Hsieh 25 Interaction Between CF and AE Clock CF_stall AE_stall CF generates 4, 2, 0, 0, 0, 2, 1, 2, 0, and 0 (CX, D) pairs respectively For example

26 2004/6/16 Tien-Wei Hsieh 26 Overlapping CF and AE Clock CF_stall AE_stall Clock CF_stall AE_stall

27 2004/6/16 Tien-Wei Hsieh 27 AHB Interface Register Block Context Formation Arithmetic Encoder Slave Controller Slave Transaction Master Interface Master Controller Tier-1 Encoder Memory Block AHB Slave Interface Master Interface IP Core

28 2004/6/16 Tien-Wei Hsieh 28 Outline Introduction Previous work Proposed architecture Experimental results Conclusion

29 2004/6/16 Tien-Wei Hsieh 29 Objective of Experiment The objective of our experiment is to prove –Low power –High performance –AHB-compliant Test pattern – –512x512 gray images (airplane, baboon, lena, peppers) – –1 title – –5/3 DWT – –3 decomposition levels –Code-block size 64x64

30 2004/6/16 Tien-Wei Hsieh 30 IP Qualification & Code Coverage IP qualification (nLint) –Compliant with RMM guidelines Code coverage () Code coverage (Verification Navigator ) Design for testability (TetraMAX) Our design General expectancy Statement coverage 97.9%95% Branch coverage 95.8%95% Toggle coverage 100%95% Path coverage 69.2%50% Total faults Test patterns Test coverage 77, %

31 2004/6/16 Tien-Wei Hsieh 31 Synthesis and power analysis Our design Technology library TSMC.35 Area (gate count) 25, kb memory Max. Frequency (MHz) Power (mW) Synthesis tool : Design Compiler (under WCCOM) Power analysis tool : PrimePower

32 2004/6/16 Tien-Wei Hsieh 32 Composition of coding cycle (unit: 1,000,000) Simulation tool : ModelSim SE/PE 5.7e (1.75) (1.62) (2.1) (1.58) (CX, D)s

33 2004/6/16 Tien-Wei Hsieh 33 Cycle reduction (unit: 1,000,000) Peppers Lena Baboon Airplane Peppers Lena Baboon Airplane 2% reduction by stripe-skipping 9% reduction by proposed renormalization

34 2004/6/16 Tien-Wei Hsieh 34 Comparison ( Unit: 1,000,000 cycles )

35 2004/6/16 Tien-Wei Hsieh 35 Platform Architecture – Altera Excalibur EPXA10DDR FPGA Embedded Stripe AHB2 AHB1 ARM922T Processor Interrupt Controller WatchDog Timer AHB1 to AHB2 Bridge DPRAM 128KB SRAM 256KB SDRAM Controller SDRAM 128MB PLD to Stripe Bridge Stripe to PLD Bridge UART Controller External Bus Interface FLASH 32MB PLD Master 1 PLD Master 2 PLD Slave 3 PLD Slave 4 PLD Slave 2 PLD Master 3 PLD Slave 1

36 2004/6/16 Tien-Wei Hsieh 36 Platform-based SOC Design Flow ■ ADS ■ SOPC Builder ■ Quartus II ■ ESS, ADS and Modelsim SE System spec. Profiling & HW/SW partition Software spec. HW spec. for each component FPGA design Stripe configuration Software coding Librar y BUS interface HDL coding Device HDL coding Interface.v Accelerator.v Component.v Stripe. v User defined firmware Integration (SOPC Builder) System PTF SOPC generation Pin assignment & Hardware compilation (Quartus II) *.c files Compilation (ADS) Software image System building (Quartus II) System image SOPC platform Prototyping Excalibur. h Stripe.h Component SDK HW/SW co-simulation RTL codes Hardware image

37 37 FPGA Prototyping Result Platform: Altera Excalibur TM EPXA10DDR, 25MHz Pure softwareProposed accelerator (Second)

38 2004/6/16 Tien-Wei Hsieh 38 Outline Introduction Previous work Proposed architecture Experimental results Conclusion

39 2004/6/16 Tien-Wei Hsieh 39 Summary Proposition –16-bit parallel context generator –Stripe-skipping method –3-stage pipelined arithmetic encoder –Renormalization strategy with forwarding method Contribution –We reduce the cycle count by 17% compared with the best-known design –We have achieved 5% within the optimum

40 2004/6/16 Tien-Wei Hsieh 40 Future Work Ping-pong method for the “compress & PISO” to reduce the 5% coding cycles ASIC Integration

41 2004/6/16 Tien-Wei Hsieh 41 Demo on the SoC platform Pure software (100 sec.) –Configure the FPGA –Load the original image to SDRAM –Execute the JPEG2000 encoder –Get the compressed image from SDRAM –Record the time consuming Proposed accelerator (50 sec.) –Configure the FPGA –Load the original image to SDRAM –Execute the JPEG2000 encoder –Get the compressed image from SDRAM –Compare images and time consuming

42 Thank you!!

43 2004/6/16 Tien-Wei Hsieh 43 Convert pixel data into bit-plane data

44 Load code-block data in memory 4-bit 2-bit 4-bit 2-bit(64x64 code-block size)

45 2004/6/16 Tien-Wei Hsieh UC1 SC1 SC2 SC3 SC4 MR4 ZC4 MR3 ZC3 MR2 ZC2 MR1 UC2 RLC ZC1 Pass[1] Pass[0] Pass1 incurs 4 Zero Coding Contexts and 4 Sign Coding Contexts at most Pass2 incurs 4 Magnitude Refinement Contexts at most Pass3 incurs 1 Run-length Coding Context, 2 Uniform Coding Contexts, 3 Zero Coding Contexts and 4 Sign Coding Contexts at most Hence, there will be 10 output registers at least Context formation for a column Pass is represented by 2-bit 00 => Pass1 01 => Pass2 10 => Pass3


Download ppt "A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard Tien-Wei Hsieh Youn-Long Lin Tien-Wei Hsieh Youn-Long Lin Department of Computer."

Similar presentations


Ads by Google