Presentation is loading. Please wait.

Presentation is loading. Please wait.

A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard

Similar presentations


Presentation on theme: "A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard"— Presentation transcript:

1 A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard
Tien-Wei Hsieh Youn-Long Lin Department of Computer Science National Tsing Hua University TAIWAN

2 Abstract Proposition Contribution 16-bit parallel context generator
Stripe-skipping method 3-stage pipelined arithmetic encoder Renormalization strategy with forwarding method Contribution We reduce the cycle count by 17% compared with the best-known design We have achieved 5% within the optimum 2004/6/16 Tien-Wei Hsieh

3 Outline Introduction Previous work Proposed architecture
Experimental results Conclusion 2004/6/16 Tien-Wei Hsieh

4 Outline Introduction Previous work Proposed architecture
Experimental results Conclusion 2004/6/16 Tien-Wei Hsieh

5 EBCOT (Embedded Block Coding with Optimized Truncation)
JPEG2000 Image Coding Pre-process Original Image Data EBCOT (Embedded Block Coding with Optimized Truncation) Discrete Wavelet Transform (DWT) Quantization Block Coding (Tier-1 Coding) Bit-stream Organization (Tier-2 Coding) Compressed Image Data 2004/6/16 Tien-Wei Hsieh

6 EBCOT Tier-1 Time Consuming
% % 1.725 % % Platform: Pentium 4 2.8GHz, 736MB RAM, Microsoft Windows XP, VC ++ Reference software: JPEG 2000 jasper Test pattern: 512x512 gray image, 1 tile, 5/3 DWT 3 decomposition levels, code-block size 64x64 2004/6/16 Tien-Wei Hsieh

7 EBCOT Tier-1 Block Coding
(Context-based adaptive binary arithmetic coding) From DWT & Quantization LL LH HL HH Code- blockN Code- block3 Code- block2 Code- block1 Block Coding Context Formation (CF) context decision Sub-bitstream N To Tier-2 Sub-bitstream 3 Arithmetic Encoder (AE) Sub-bitstream 2 Sub-bitstream 1 2004/6/16 Tien-Wei Hsieh

8 Bit-Plane Division of Code-block
Sign bit 1 MSB 1 Pixel Magnitude bits LSB insignificant Bit-plane significant 2004/6/16 Tien-Wei Hsieh

9 Scanning Each Bit-plane 3 Times
(pass > stripe > column > bit) M columns in a stripe 4 bits in a column N stripes in a bit-plane Code-block size is 4N x M 2004/6/16

10 Coding a Bit-plane 1 Significant bit Pass 1 Pass 2 Pass 3 1 1 1
1 Significant bit Pass 1 Pass 2 Pass 3 1 1 1 Insignificant bits with significant neighbors Bits not coding in previous two passes Significant bits

11 Outline Introduction Previous work Proposed architecture
Experimental results Conclusion 2004/6/16 Tien-Wei Hsieh

12 Previous work Context Formation Arithmetic Encoder Normal mode
NTU: Skipping methods, Sample Skipping (SS) and Group-of-Column Skipping (GOCS) Reduce 60% cycle count compared with straightforward method NCTU: Memory-saving algorithm Reduce 4K bits memory space if the code-block size is 64x64 Pass-parallel mode TKU: Pass-parallel context modeling No cycle wasted 0.1 ~ 0.2 dB image quality degradation Arithmetic Encoder MQ coder (JBIG) Osaka University: 4-stage pipelined architecture

13 Outline Introduction Previous work Proposed architecture
Experimental results Conclusion 2004/6/16 Tien-Wei Hsieh

14 Proposed EBCOT Tier-1 Coder
Address Generator State Memory 16 bits Context Formation (CF) Code Block Memory 16 bits Pixel_in 40 (CX, D)s Arithmetic Encoder (AE) (CX, D) Compress & PISO Byte_out 2004/6/16 Tien-Wei Hsieh

15 Context Formation Unit
Address Generator State Memory 16 bits Context Formation (CF) Code Block Memory 16 bits Pixel_in 40 (CX, D)s Arithmetic Encoder (AE) (CX, D) Compress & PISO Byte_out 2004/6/16 Tien-Wei Hsieh

16 Data Dependency for 16 Bits
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 (delay) 2004/6/16 Tien-Wei Hsieh

17 16-Way Parallel Architecture
2 2 1 5 9 13 3 3 4 4 2 6 10 14 3 7 11 15 5 5 6 6 4 8 12 16 7 7 8 8 9 9 10 10 11 11 12 12 13 13 14 14 2004/6/16 15 15 16 16

18 Memory Scheme Memory B Stripe N Memory C Memory A Stripe N+1 Memory B
D E Memory B Memory C

19 Stripe Skipping 3 registers record coding condition of all stripes in 3 passes A stripe is skipped in Pass1 if all bits in the stripe are significant A stripe is skipped in Pass2 if all bits in the stripe are insignificant A stripe is skipped in Pass3 if all bits in the stripe have been coded in Pass1 or 2 2004/6/16

20 Arithmetic Encoder Address State Memory Generator 16 bits Context
Formation (CF) Code Block Memory 16 bits Pixel_in 40 (CX, D)s Arithmetic Encoder (AE) (CX, D) Compress & PISO Byte_out 2004/6/16 Tien-Wei Hsieh

21 Feedback Loop in AE Flow Chart
Context Decision Context Table Table Reading index mps Probability Estimation Table (PET) Qe A A Calculation Qe NMPS, NLPS SWITCH C C Calculation Renormalization Index Updating MPS Updating Byte

22 Modified Probability Estimation Table (MPET)
Proposed Pipelined AE Context Decision Table Reading Modified Probability Estimation Table (MPET) Stage 1 A Calculation Index Updating MPS Updating Bit Shifting Stage 2 A Context Table C Calculation Stage 3 Renormalization C Byte

23 Fast Renormalization YES CT > A_shift ? NO BYTEOUT
C = C << A_shift CT = CT – Ashift C = C << CT A_shift = A_shift – CT CT = 0 DONE YES CT > A_shift ? NO BYTEOUT2 YES C = C << A_shift CT = CT – Ashift C = C << CT A_shift = A_shift – CT Twice ? NO BYTEOUT DONE

24 Compress & Parallel In Serial Out
Address Generator State Memory 16 bits Context Formation (CF) Code Block Memory 16 bits Pixel_in 40 (CX, D)s Arithmetic Encoder (AE) (CX, D) Compress & PISO Byte_out 2004/6/16 Tien-Wei Hsieh

25 Interaction Between CF and AE
For example CF generates 4, 2, 0, 0, 0, 2, 1, 2, 0, and 0 (CX, D) pairs respectively Clock CF_stall 4 2 2 1 2 AE_stall 2004/6/16 Tien-Wei Hsieh

26 Overlapping CF and AE Clock CF_stall 4 2 2 1 2 AE_stall Clock CF_stall
2 1 2 AE_stall Clock CF_stall 4 2 2 1 2 AE_stall 2004/6/16 Tien-Wei Hsieh

27 AHB Interface AHB Slave Interface Slave Transaction IP Core
Tier-1 Encoder Register Block Slave Controller Context Formation Memory Block Master Controller Arithmetic Encoder Master Interface Master Interface 2004/6/16 Tien-Wei Hsieh

28 Outline Introduction Previous work Proposed architecture
Experimental results Conclusion 2004/6/16 Tien-Wei Hsieh

29 Objective of Experiment
The objective of our experiment is to prove Low power High performance AHB-compliant Test pattern 512x512 gray images (airplane, baboon, lena, peppers) 1 title 5/3 DWT 3 decomposition levels Code-block size 64x64 2004/6/16 Tien-Wei Hsieh

30 IP Qualification & Code Coverage
IP qualification (nLint) Compliant with RMM guidelines Code coverage (Verification Navigator ) Design for testability (TetraMAX) Our design General expectancy Statement coverage 97.9% 95% Branch coverage 95.8% Toggle coverage 100% Path coverage 69.2% 50% Total faults Test patterns Test coverage 77,200 439 99.99% 2004/6/16 Tien-Wei Hsieh

31 Synthesis and power analysis
Our design Technology library TSMC .35 Area (gate count) 25,706 + 45kb memory Max. Frequency (MHz) 43.48 Power (mW) 26.68 Synthesis tool : Design Compiler (under WCCOM) Power analysis tool : PrimePower 2004/6/16 Tien-Wei Hsieh

32 Composition of coding cycle
1.46 0.17 0.12 (1.75) 1.35 0.16 0.11 (CX, D)s (1.62) 1.75 0.22 0.13 (2.1) 1.32 0.14 0.12 (1.58) (unit: 1,000,000) Simulation tool : ModelSim SE/PE 5.7e 2004/6/16 Tien-Wei Hsieh

33 Cycle reduction 0 0.5 1 1.5 2 2.5 (unit: 1,000,000)
0.078 Peppers Lena Baboon Airplane 0.077 2% reduction by stripe-skipping 0.069 0.084 0.006 0.005 9% reduction by proposed renormalization 0.007 0.005 (unit: 1,000,000) 2004/6/16 Tien-Wei Hsieh

34 Comparison 1.54 1.43 1.83 1.41 1.88 1.74 1.76 1.32 2.1 1.75 1.46 1.35 ( Unit: 1,000,000 cycles ) 2004/6/16 Tien-Wei Hsieh

35 Platform Architecture – Altera Excalibur EPXA10DDR
FPGA Embedded Stripe AHB2 AHB1 ARM922T Processor Interrupt Controller WatchDog Timer AHB1 to AHB2 Bridge DPRAM 128KB SRAM 256KB SDRAM 128MB PLD to Stripe Bridge Stripe to PLD UART External Bus Interface FLASH 32MB PLD Master 1 2 Slave 3 4 這是我們實驗所用的 SOC platform – Altera Excalibur EPXA10DDR 是個 AMBA-based platform,可以構成一個 standalone 的系統 主要分為兩部份: embedded stripe 和 reconfigurable FPGA Embedded stripe 主要為系統部份,在其上的 bus 依照連接的 component 而分別設計成 AHB1 及 AHB2 AHB1 連接的是系統核心的部份,主要為 ARM922T processor, Interrupt Controller 及 timer 而 AHB2 則是連接連低速的 peripherals ,像是 UART 、EBI 及 FPGA AHB1 及 AHB2 藉由 AHB1-AHB2 bridge 連接,並共享所有 memory 在 on-chip memory 方面有 SRAM 256KB,DPRAM 128KB 而 off-chip memory 方面則有 SDRAM 128MB Embedded stripe 藉由 PLD-Stripe bridge 及 Stripe-PLD bridge 和 FPGA 部份相連,我們的 Hardware accelerator 設計就置於此 整個系統可以 standalone,執行的資料 download 至 FLASH 系統啟動後,即將資料載入 memory 中執行 2004/6/16 Tien-Wei Hsieh

36 Platform-based SOC Design Flow
■ ADS ■ SOPC Builder ■ Quartus II ■ ESS, ADS and Modelsim SE System spec. User defined firmware Profiling & HW/SW partition Software spec. HW spec. for each component Software coding Stripe.h Stripe configuration FPGA design *.c files Excalibur.h Library Compilation (ADS) Component SDK Software image System PTF Device HDL coding BUS interface HDL coding HW/SW co-simulation SOPC generation Accelerator.v Interface.v System building (Quartus II) Stripe.v Component.v RTL codes Integration (SOPC Builder) System image Hardware image Pin assignment & Hardware compilation (Quartus II) 2004/6/16 SOPC platform Prototyping Tien-Wei Hsieh

37 FPGA Prototyping Result
Pure software Proposed accelerator (Second) 100.98 80.18 81.47 85.34 26.99 31.2 27.6 27.78 Platform: Altera ExcaliburTM EPXA10DDR, 25MHz

38 Outline Introduction Previous work Proposed architecture
Experimental results Conclusion 2004/6/16 Tien-Wei Hsieh

39 Summary Proposition Contribution 16-bit parallel context generator
Stripe-skipping method 3-stage pipelined arithmetic encoder Renormalization strategy with forwarding method Contribution We reduce the cycle count by 17% compared with the best-known design We have achieved 5% within the optimum 2004/6/16 Tien-Wei Hsieh

40 Future Work Ping-pong method for the “compress & PISO” to reduce the 5% coding cycles ASIC Integration 2004/6/16 Tien-Wei Hsieh

41 Demo on the SoC platform
Pure software (100 sec.) Configure the FPGA Load the original image to SDRAM Execute the JPEG2000 encoder Get the compressed image from SDRAM Record the time consuming Proposed accelerator (50 sec.) Compare images and time consuming 2004/6/16 Tien-Wei Hsieh

42 Thank you!!

43 Convert pixel data into bit-plane data
1 1 2004/6/16 Tien-Wei Hsieh

44 Load code-block data in memory
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Memory A Stripe0 COL GRP ROW STP External Address: Stripe1 Stripe2 Memory B Memory C 4-bit bit bit bit (64x64 code-block size)

45 Context formation for a column
Pass[1] ZC1 Pass1 incurs 4 Zero Coding Contexts and 4 Sign Coding Contexts at most Pass2 incurs 4 Magnitude Refinement Contexts at most Pass3 incurs 1 Run-length Coding Context, 2 Uniform Coding Contexts, 3 Zero Coding Contexts and 4 Sign Coding Contexts at most Hence, there will be 10 output registers at least RLC 1 UC1 Pass[0] UC2 MR1 1 SC1 Pass[0] ZC2 MR2 1 SC2 Pass[0] ZC3 Pass is represented by 2-bit 00 => Pass1 01 => Pass2 10 => Pass3 MR3 1 SC3 Pass[0] ZC4 MR4 1 SC4 2004/6/16 Tien-Wei Hsieh


Download ppt "A Hardware Accelerator IP for EBCOT Tier-1 Coding in JPEG2000 Standard"

Similar presentations


Ads by Google