Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.

Similar presentations


Presentation on theme: "1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer."— Presentation transcript:

1 1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer Eng. Department University of Patras, Greece

2 2 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

3 3 Deblocking Filter Algorithm (1/3)  The deblocking filter is used in H.264/AVC to reduce the blocking artifacts – Improves subjective & objective quality and reduces the bit-rate typically 5-10%.  It is performed on a macroblock (MB) basis after the completion of the macroblock reconstruction stage  It includes a large number of data depended branches – each 4x4 pixel area is processed up to four times  It spends over one-third (1/3) of the total decoding time

4 4 Deblocking Filter Algorithm (2/3)  Each MB is processed in 4x4 blocks  The vertical edges are filtered at first rightwards – from edge V0 to edge V3  Then horizontal ones downwards – from edge H0 to H3  Each 8 pixels of two adjacent 4x4 sub- blocks are filtered at the same time – The same process repeats for the chroma components

5 5 Deblocking Filter Algorithm (3/3)  Each sub-edge shares a BS value  The BS along with two thresholds α, β decides the filtering strength of each sub-edge – A filter samples flag is calculated  Three filter types are used – Strong filter (4- or 5-tap filter) – Weak filter – No filtering

6 6 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

7 7 Filtering Order  During filtering all four sub-edges of each sub-block are filtered and almost all pixels are involved and updated  A suitable filtering order is needed to: – Reduce the size of the on-chip memory for buffering intermediate data – Increase data reuse – Reduce the external memory accesses – Simplify control and steering logic – Avoid pipeline stalls due to data and resource hazards

8 8 Proposed Filtering Order  The vertical sub-edges are filtered in raster scan-order followed by the horizontal ones  The filtering direction is not changed before all vertical edges of luma and chroma are filtered  The proposed order is in accordance to the standard

9 9 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

10 10 Memory Organization (1/2) Four single port memories are employed (sizes in bits) – Current-A (CM-A) 96x32 – Current-B (CM-B) 96x32 – Left _mem (LM) 32x32 – Upper_mem (UM) 2xFWx32 + 2x(FW/16)x32  Transpose buffers TR-P and TR-Q (4x32) – typical systolic array All internal buses are 32 bits

11 11 Memory Organization (2/2)

12 12 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

13 13 Algorithm Features  Deblocking filter algorithm computational intensive operations – LUT operations – retrieve values α(IndexA), β(IndexB), c1(Index A, BS) – BS calculation – Weak Filter BS(1~3) filtering, δ calculation and clipping operations – Strong Filter BS(4)  The introduced pipeline exploits specific algorithmic features – BS is the same for all micro-edges of a sub-edge for the luma component – BS of the luma component is reused for the chroma components – For the (4:2:0) format BS changes every 2 micro-edges in chroma components

14 14 Proposed Pipeline Organization

15 15 Pipeline Operation  Each sub-block needs 4 cycles to be processed  The BS unit spends 4 cycles (BS calculation & LUT operations) – BS and LUT operations are do not depend on pixel values  BS calculation & LUT operations are overlapped with the filtering operations for the luma component  Four initialization cycles are needed to calculate the BS and the α, β, c1 for the first luma sub-block

16 16 BS=4 Filtering Filter equations modified to improve delay & area BS=4 – 13 adders instead of 28 Total components Adders: =31

17 17 Pipeline Benefits  LUT operations and BS calculation are not squeezed in a single pipeline stage – Bs Unit has 4-cycles  The filtering operations are expanded in three pipeline stages  The BS values are reused for filtering the chroma components  Modification of the original filtering equations (improve performance & area)  The proposed ordering eases control logic and memory addressing avoiding any potential critical path increase

18 18 Edge Filter Process Block Cycle01234 Filtered Sub-edge01234 PINL0B0B1B2L1 QINB0B1B2B3B4 TR_P-W B0B1B2 TR_P-R B0B1 TR_Q-W B3 TR_Q-R CM_A-RB0B1B2B3B4 CM_B-W B0B1 LM-RL0 L1 LM-W UPM-W Ext_M-WL0

19 19 Vertical Edge Filter Process  Total cycles = 4*27= 108 – If two port memory has been used then total cycles = 4x24=96 which is the optimum Block Cycle Filtered Sub-edge PINL0B0B1B2L1B4B5…L3B12B13…L1B22 QINB0B1B2B3B4B5B6B12B13B14B22B23 TR_P-W B0B1B2 B4…B10L3B12…B20L1B22 TR_P-R B0B1B2 B9B10L3 B20 B22 TR_Q-W B3 …B11 …B21 B23 TR_Q-R 3 B11B19 B21 B23 CM_A-RB0B1B2B3B4B5B6…B12B13B14…B22B23 CM_B-W B0B1B2B3B9B10B11B19B20B21B22B23 LM-RL0 L1 …L3 …L1 LM-W UPM-W L3 L1 Ext_M-WL0L1

20 20 Processing Cycles  Vertical Edges: 108 cycles  Horizontal Edges: 108 cycles  Initialize: 10 cycles – 6 fetch coding info, initialize control – 4 1 st BS calculation  Normal operation: 226 cycles  For the last row (edges 27, 31, 35, 41, 45): 5x4=20 extra cycles – Resource hazard (Bus conflict)  For the last MB in frame 12 extra cycles are needed (edges 39, 43, 47) – Resource hazard (Bus conflict)  Worst case total cycles: 258

21 21 Outline 1.Deblocking filter algorithm 2.Filtering ordering 3.Memory organization 4.Pipelined architecture 5.Synthesis results and comparisons 6.Conclusions and future work

22 22 Experimental Setup  Synthesis Setup – Synopsys design compiler – TSMC 0.18um  FPGA proven – Stand alone, compared with the JM reference software – It has also verified as a part of a H.264 hardware encoder – It achieves 280 MHz in Virtex 5 speed grade 3

23 23 Synthesis Results and Comparisons [5] (2008)[6] (2008)[7] (2009)[8] (2006)Proposed Pipeline stages55455 Filtering orderHybrid Impr. Sequential Local RAMs (bits) 1P 1 2x96x32 1P 96x32, 2P 1 32x32 1P 32x32 1P 96x32, 1P 32x32 1P 96x32, 2P 32x32 1P 2x96x32, 1P 32x32 Upper neighbour RAM (bits)1P 2FWx32N/A1P 2FW 2 x321P 1.5FWx321P 2FWx32 Coding information RAM (bits)N/A 2(FW/16)x32 7 Transpose buffers (4x32 bits)71522 Technology (μm)0.18 Gate count (10 3 gates) Kernel processing (cycles/MB)204210/ / /246 6 Max frequency (MHz) (1.8x up to 4x) Throughput (10 3 MB/s) (1.5x up to 3.8x) Fps – Full HD (1920x1080) Fps – Ultra HD (3840x2160) :1P: Single-Port, 2P: Two-Port, 2::FW: Frame width, 3: Filtering cycles only, 4: Filtering cycles only, 5: It takes 246 cycles to filter a MB at the right frame boundary, 6: It takes 246 cycles to filter a MB at the bottom frame row, 7: The 2x(FW/16)x32 bits are stored in upper memory

24 24 Conclusions  A novel high speed pipeline architecture for the H.264/AVC deblocking filter is proposed  It operates at 400 MHz and occupies 19.2 Kgates in 0.18 um CMOS technology  It achieves 216 and 54 fps for Full and Ultra-HD frames, respectively  Only single port memories are employed  No external memory accesses are needed during filtering – Parameters and neighbors are store internally – Only fully filtered data are written to external memories

25 25 Questions ???

26 26 Hardware Architecture (Pipeline organization) 5/ Threshold Calculation

27 27 BS=4 Filtering

28 28 Deblocking Filter Algorithm 3/3  Each sub-edge between two adjacent 4x4 luma sub-blocks share a Boundary Strength (BS)  The BS value along with two threshold variables, α and β, decide the filtering strength of the sub-edge

29 29 Hardware Architecture (Pipeline organization) 5/ Bs 1,2,3 filter

30 30 Deblocking Filter Algorithm 4/4  Boundary strength across horizontal edges – The boundary strength is calculated for each sub-edge for the luma component – It is reused for the chroma components in 2:1 ratio for 4:2:0 format


Download ppt "1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer."

Similar presentations


Ads by Google