Presentation on theme: "Low Power and Low Area Transform–Quant & Inverse Quant–Inverse Transform Hardware Design for H.264 Encoder."— Presentation transcript:
Low Power and Low Area Transform–Quant & Inverse Quant–Inverse Transform Hardware Design for H.264 Encoder
Outline I. H.264 TQ & IQIT II. DESIGNED HARDWARE III. RESULTS
H.264 TQ & IQIT Each residual macroblock is transformed, quantized. Previous standards such as MPEG-1,MPEG-2, MPEG-4 and H.263 made use of the 8x8 Discrete Cosine Transform (DCT) as the basic transform. The “baseline” profile of H.264 uses three transforms depending on “the type of residual data : 1)A transform for the 4x4 array of luma DC coefficients in intra macroblocks(predicted in 16x16 mode), 2)A transform for the 2x2 array of chroma DC coefficients (in any macroblock) 3) A transform for all other 4x4 blocks in the residual data.
Work accomplished... ( T, Q, IQ, IT)... Future work ( MC, toplevel,...)
Data within a macroblock are transmitted in the order shown in Figure If the macroblock is coded in 16x16 Intra mode, then the block labelled “-1” is transmitted first, containing the DC coefficient of each 4x4 luma block. Next, the luma residual blocks 0-15 are transmitted in the order shown (with the DC coefficient set to zero in a 16x16 Intra macroblock). Blocks 16 and 17 contain a 2x2 array of DC coefficients from the Cb and Cr chroma components respectively. Finally, chroma residual blocks 18- 25 (with zero DC coefficients) are sent. The entire process of transform and quantization can be carried out using 16-bit integer arithmetic
4x4 Integer Transform &Inverse Transform It is an integer transform The core part of the transform is multiply-free, it only requires additions and shifts. A scaling multiplication (part of the complete transform) is integrated into the quantizer (reducing the total number of multiplications).
>> indicates a binary shift right. In the reference model software, f is 2qbits/3 for Intra blocks or 2qbits/6 for Inter blocks. For QP>5, the factors MF remain unchanged but the divisor 2qbits increases by a factor of 2 for each increment of 6 in QP. Quantization
the rescaled output increase by a factor of 2 for every increment of 6 in QP. a further constant scaling factor of 64 to avoid rounding errors The values at the output of the inverse transform are divided by 64 to remove the scaling factor Inverse Quantization
4x4 luma DC coefficient Transform & Quantization 16x16 Intra-mode only an inverse Hadamard transform is applied followed by rescaling (note that the order is not reversed as might be expected) If QP is greater than or equal to 12, rescaling is performed by: If QP is less than 12, rescaling is performed by:
2x2 chroma DC coefficient Transform & Quantization Inverse transform is identical During decoding, the inverse transform is applied before rescaling If QP is greater than or equal to 6, rescaling is performed by: If QP is less than 6, rescaling is performed by: The rescaled coefficients are replaced in their respective 4x4 blocks of chroma coefficients [ (z0+z2) + (z1+z3), (z0+z2) - (z1+z3); (z0-z2) + (z1-z3), (z0-z2) - (z1-z3)]
Problems encountered Signed arithmetic Initially designed for 100Mhz Due to creating a dual purpose datapath we get extra MUX delays Hardware specified in the standart to avoid rounding errors Error of the book “H.264 and MPEG-4 Video Compression” ! Unpredicted and unbelievable routing error !
Designed hardware supports up to H.264 level 2.2 (SDTV @ 15 fps). A dual purpose datapath is designed. Transform and Quantization of a 4x4 block is completed in 36 clock cycles. Inverse Quantization of a 4x4 block takes 18 clock cycles. Inverse Transform of a 4x4 block is done in 36 clock cycles. It takes nearly 2400 cycles to complete an intra 16x16 predicted macroblock. Working at 80Mhz designed hardware can process up to 33000 mb’s per second. RESULTS
Number of ports : 68 Number of nets : 212 Number of instances : 30 Number of references to this view : 0 Total accumulated area : Number of Dffs or Latches : 493 Number of Function Generators : 2688 Number of MUX CARRYs : 148 Number of MUXF5 : 608 Number of MUXF6 : 184 Number of accumulated instances : 3847 Number of global buffers used: 0 Synthesis Results Synthesis is done with LeonardoSpectrum Clock frequency is 80MHz
Device Utilization for 2V8000ff1152 Resource Used Avail Utilization ----------------------------------------------- IOs 68 824 8.25% Global Buffers 0 16 0.00% Function Generators 2688 93184 2.88% CLB Slices 1344 46592 2.88% Dffs or Latches 493 95656 0.52% Block RAMs 0 168 0.00% Block Multipliers 1 168 0.60%
FPGA & ASIC The design can be used either for FPGA or for ASIC. Only one multiplier is used (2V8000ff1152 has 168 block multipliers). A clock frequency of 80 MHz for FPGA is achieved. To be able to reach 80MHz lots of pipelining stages are added. Designed hardware may work at a clock frequency up to 200MHz in ASIC. Removing pipelining registers will decrease the area and power consumption.