Presentation is loading. Please wait.

Presentation is loading. Please wait.

Analysis, Fast Algorithm, and VLSI Architecture Design for H

Similar presentations

Presentation on theme: "Analysis, Fast Algorithm, and VLSI Architecture Design for H"— Presentation transcript:

1 Analysis, Fast Algorithm, and VLSI Architecture Design for H
Analysis, Fast Algorithm, and VLSI Architecture Design for H.264/AVC Intra Frame Coder Yu-Wen Huang, Bing-Yu Hsieh, Tung-Chien Chen, and Liang-Gee Chen IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY 2005

2 Outline Introduction H.264/AVC Intra Coding Computation Reduction
Hardware Architecture

3 Introduction - Decoder Input Coder Video Signal Control Data
Entropy Coding Scaling & Inv. Transform Motion- Compensation Control Data Quant. Transf. coeffs Motion Intra/Inter Coder Decoder Estimation Transform/ Scal./Quant. - Input Video Signal Split into Macroblocks 16x16 pixels Intra-frame Prediction De-blocking Filter Output Multiple Reference Frames & Variable Block sizes

4 Introduction Compressed Data Source Prediction Transform Quantization
Entropy Coding 44/1616 Luma 88 Chroma 4 4 DCT Scalar Nonuniform Q CAVLC CABAC lossy lossless (Bit per pixel)

5 Introduction H.264/AVC I-Frame Coder (CAVLC) vs. JPEG2000 (DWT 53)
Computational Complexity Block-based coding vs. Frame-based coding DWT 53 Hardware-friendly Memory-wasting

6 Introduction Comparison between different image coding standards JPEG
JPEG 2000 DWT53 H.264 I-Frame CAVLC 0.225 bpp

7 Introduction Two solutions for platform-based design of H.264/AVC intra frame coder Fast algorithm for software implementation Reduce 45% complexity PSNR drop 0.3 dB Hardware accelerator Max. clock rate 55 MHz 31 fps for 4:2:0 SDTV (All intra frames)

8 H.264/AVC Intra Coding Intra Prediction I4MB (44) I16MB (1616) + DC
Current 1 3 4 5 6 7 8 + DC + DC + Plane 1

9 H.264/AVC Intra Coding Mode Decision Low complexity mode
SATD (Original pels – Predictors) Rate (bit of Mode information) High complexity mode MSE (Original pels – Reconstructed pels) Rate (Mode information + Residual)

10 H.264/AVC Intra Coding Transform and Quantization
4  4 integer transform Hadamard transform DCT-based integer transform

11 H.264/AVC Intra Coding Entropy Coding
Context-Based Adaptive Binary Arithmetic Coding (CABAC) Context-Based Adaptive Variable Length Coding (CAVLC)

12 H.264/AVC Intra Coding Run-time percentage 720  480 4:2:0 30fps
10829 MIPS

13 Computation Reduction
Intra Prediction Table look-up Cost generation Sub-sampling

14 Computation Reduction
Fast Intra Prediction The smaller the mode number is, the more possible it will occur. global statistics cannot reflect the correlation of local modes. Local statistics of neighboring blocks are applied.

15 Computation Reduction
Fast Intra Prediction Skip unlikely candidates

16 Computation Reduction
Rate-distortion under different numbers of local-searched I4MB modes without insertion of full-search blocks 6 4 1 All DC modes 2

17 Computation Reduction
Fast Intra Prediction Prevention of error propagation Periodic insertion of full-search 4x4 blocks Adaptive threshold on the distortion for a MB If min SATD of P > THMinSATD, then search all modes. THMinSATD =   (min SATD of F)  = 2.0 F P F P P P P P F P F P P P P P

18 Computation Reduction
Subsampling Patterns

19 Computation Reduction
Saved Computation and PSNR Drop PSNR drop < 0.3 dB Global: subsampling + partial search using global statistics Local: subsampling + partial search Proposed: subsampling + partial search + periodic insertion of full search + adaptive SATD threshold

20 Hardware Architecture
Assumptions A RISC can execute one instruction per cycle, except multiplication requiring two. A processing element (PE) can generate predictors of one pixel per cycle.

21 Hardware Architecture
Solutions luma chroma Produce all modes per cycle Produce one mode per cycle 30fps # of modes Avg. cycles per predictors

22 Hardware Architecture
Comparisons in different degrees of parallelism

23 Hardware Architecture
DRAM M A B C D E F G H I K J L Register

24 Hardware Architecture
Four-Parallel Reconfigurable Intra Prediction Generator 8-bit adder 9-bit adder

25 Hardware Architecture
M A B C D E F G H I K J L Intra Prediction Generator

26 Hardware Architecture
Top PE0 PE1 PE2 PE3 Cycle 1: T0+T4+T8+T12 Cycle 1: T1+T5+T9+T13 Cycle 1: T2+T6+T10+T14 Cycle 1: T3+T7+T11+T15 Cycle 2: +L0+L4+L8 Cycle 2: +L0+L5+L9 Cycle 2:  +L2+L6+L10 Cycle 2:  +L3+L7+L11 Cycle 3: +L12 Cycle 3: +L13 Cycle 3:  +L14 Cycle 3:  +L15 Left Cycle 4: +++ I16MB DC Prediction Mode

27 Hardware Architecture
I16MB Plane Prediction Mode Pred[y, x] = Clip1((a + b (x – 7) + c  (y – 7) >> 5) a = 16  (p[-1, 15] + p[15, -1]) b = (5  H + 32) >> 6 c = (5  V + 32) >> 6 H = 7x’=0 (x’+1)  (p[-1, 8+x’] – p[-1, 6 – x’]) V = 7x’=0 (y’+1)  (p[8+y’, -1] – p[6 – y’, -1]) Pred[0,0] Pred[0,8] Pred[0,4] Pred[0,12] A0 A1 A2 A3

28 Hardware Architecture

29 Hardware Architecture

30 Hardware Architecture
Transform (Implemented by shifters and adders) DCT iDCT Hadamard

31 Hardware Architecture

Download ppt "Analysis, Fast Algorithm, and VLSI Architecture Design for H"

Similar presentations

Ads by Google