Presentation on theme: "Overview of the H.264/AVC Video Coding Standard"— Presentation transcript:
1Overview of the H.264/AVC Video Coding Standard T. Wiegand, G. Sullivan, G. Bjontegaard, and A. Luthra Overview of the H.264/AVC Video Coding Standard, IEEE Transactions on Circuits and Systems for Video Technology, Vol. 13, No. 7, pp , July 2003.CMPT 820: Multimedia Systems
2Outline Overview Network Abstraction Layer (NAL) Video Coding Layer (VCL)Profiles and ApplicationsFeature HighlightsConclusions
3Evolution of Video Compression Standards ITU-TMPEGH.261Video TelephonyMPEG-1Video-CDH.262/MPEG-2Digital TV/DVDH.263Video ConferencingMPEG-4 VisualObject-based CodingMPEG-2: an enabling technology for digital television systemsinitially designed to extend MPEG-1 and support interlaced video codingUsed by transmission of SD/HD TV signal and storage of SD video onto DVDs.H.263/+/++: conversation/communication applicationSupport diversified networks and adapted to their characteristics, e.g. PSTN, mobile, LAN/InternetConsidering loss and error robustness requirementsMPEG-4 Visual: object-based video codingProvide shape codingH.264 MPEG-4 AVC
4H.264/AVC Coding Standard Various Applications Challenge: Broadcast: cable, satellites, terrestrial, and DSLStorage: DVDs (HD DVD and Blu-ray)Video Conferencing: over different networksMultimedia Streaming: live and on-demandMultimedia Messaging Services (MMS)Challenge:How to handle all these applications and networksFlexibility and customizabilityH.264/AVC consider a variety of applications over heterogeneous networks, such as broadcast over unidirectional comm. Channel, optical/magnetic media that supports sequential playouts as well as random access (jump to some points), low bit rate video conferencing, possibly high bit rate video streaming, and digital video mails.Often these applications impose different requirements, e.g., high/low bit rates, while the networks have different characteristics, e.g., low/high latency (satellites), low/high loss rate (wire/wireless) networks
5Structure of H.264/AVC Codec Layered designNetwork Abstraction Layer (NAL)formats video and meta data for variety of networksVideo Coding Layer (VCL)represents video in an efficient wayScope of H.264 standardTo address the heterogeneity of applications and networks, H.264/AVC adopt a layered design, where …
6Outline Overview Network Abstraction Layer (NAL) Video Coding Layer (VCL)Profiles and ApplicationsFeature HighlightsConclusions
7Network Abstraction Layer Provide network friendliness for sending video data over various network transports, such as:RTP/IP for Internet applicationsMPEG-2 streams for broadcast servicesISO File formats for storage applicationsWe present a few NAL conceptsThe main objective of NAL is to enable simple and effective customization of sending video data over different network systems.
8NAL Units Packets consist of video data short packet header: one byteSupport two types of transportsstream-oriented: no free unit boundaries use a 3-byte start code prefixpacket-oriented: start code prefix is a wasteCan be classified into:VCL units: data for video picturesNon-VCL units: meta data and additional infoNAL units are packets that carry video data.NAL units have very short packet header, and most of its capacity is used for carrying payloads.… We next present non VCL units
9Non-VCL NAL Units Two types of non-VCL NAL units Parameter sets: headers shared by a large number of VCL NAL unitsa VCL NAL unit has a pointer to its picture parameter seta picture parameter set points to its sequence parameter setSupplemental enhancement info (SEI): optional info for higher-quality reconstruction and/or better application usabilitySent over in-band or out-of-band channelsParameter sets units reduce the traffic amount by exploring the header redundancy among VCL NAL unitsThere are two type of parameter sets. Picture parameter set contains headers for one or a few pictures. Sequence parameter set contains (less frequently changed) headers for even more pictures.SEI units are not mandatory and are used to improve the playout quality. For example, rate-distortion info can be carried by SEI units. Also, SEI is quite extensible, and applications can define their custom SEI messages.Out-of-band often enables more flexibility, such as sending parameter sets with stronger FEC code, possible retransmissions, and higher routing priority
10Access Units A set of NAL units Decoding an access unit results in one pictureStructure:Delimiter: for seeking in a streamSEI: timing and other infoprimary coded picture: VCLredundant coded picture: for error recoveryNow, we know the building blocks. But how to put NAL units together to represent ONE picture?* Delimiter for seeking in byte-stream format* SEI: timing information and other supplemental data that may enhance application usability.* Consist of a set of VCL NAL units which together compose a primary coded picture.* Several redundant slices for error recovery.Redundant coded pictures are usually not decoded. Only when decoders fail to decode the primary coded picture, they will try to decode redundant coded picture.
11Video Sequences and IDR Frames Sequence: an independently decodable NAL unit stream don’t need NALs from other sequenceswith one sequence parameter setstarts with an instantaneous decoding refresh (IDR) access unitIDR frames: random access pointsIntra-coded framesno subsequent picture of an IDR frame will require reference to pictures prior to the IDR framedecoders mark buffered reference pictures unusable once seeing an IDR frameIntuitively, aggregating several assess units (pictures) gives us a sequence. But, video sequence carries other requirement in H.264/AVC.
12Outline Overview Network Abstraction Layer (NAL) Video Coding Layer (VCL)Profiles and ApplicationsFeature HighlightsConclusions
13Video Coding Layer (VCL) (Like other) hybrid video coding: H.264/AVC represents pictures in macroblocksMotion compensation: temporal redundancyTransform: spatial redundancySmall improvements add up to huge gainCombining many coding tools togetherPictures/Frames/FieldsFields: top/bottom field contains even/odd rowsInterlaced: two fields were captured at diff timeProgressive: otherwiseLike prior video coding standards, H.264 is a hybrid coder, which use MC to exploit temporal redundancy, and transform to exploit spatial redundance.These coding tools are not totally new. In fact, many of them have been proposed many years ago, but have been thought impractical due to computational complexity. They become acceptable today because of the advance of computer hardware.Each picture contains a frame, where each frame has two fields. Interlaced
14Macroblocks and Slices Fixed size MBs: 16x16 for luma, and 8x8 for chromaSlice: a set of MBs that can be decoded without use of other slicesI slice: intra-prediction (I-MB)P slice: possibly one inter-prediction signal (I- and P-MBs)B slice: up to two inter-prediction signals (I- and B-MBs)SP slice: efficient switch among streamsSI slice: used in conjunction with SP slicesThe only reason that we may require adjacent slices is for deblocking filter.The main feature of SP-frames is that identical SP-frames can be reconstructed even when different reference frames are used for their prediction.SI-frames are used in conjunction with SP-frames.This property make them useful for applications such as random access, adapt to network condition, and error recovery/ resilienceWe next talk about ordering of MBs.
15Flexible Macroblock Ordering (FMO) MBs in a slice: in raster-orderSlice group: more flexibleEach slice group contains one or several slicesPossible usages:Region-of-interest (ROI)Checker-board for video conferencingROI: several foreground objects, and one left-over backgroundChecker-board, for error concealment.
16Adaptive Field Coding Two fields of a frame can be coded as: A single frame (frame mode)Two separate fields (field mode)A single frame with adaptive mode (mixed mode)Picture-adaptive frame/field (PAFF)frame/field decision is made at frame level16% - 20% bit rate reduction over frame onlyMacroblock-adaptive frame/field (MBAFF)frame/field decision is made at MB level14% - 16% bit rate reduction over PAFF suitable for Interlaced and high motionWhy need different modes? In interlaced frames, two adj. row may have low correlation, because frames were taken at different time instances.For regions with more movements, field mode is better, for regions with less movements, frame mode is better. This is why we want to have MBAFF.[Ref] Chan Tsang - Project Presentation - Fast Macroblock Adaptive Coding in H264.ppt
17Intra-frame Prediction In spatial domain, using samples to the left and/or on above to predict samples in a MBTypes of intra-frame prediction:Intra_4x4: detailed luma blockIntra_16x16: smooth luma blocksChroma_8x8: similar to Intra_16x16 as chroma components are smoothI_PCM: bypass prediction/transform, send samplesanomalous pictures, loseless, and predictable bit rateUnlike MPEG-4, intra-frame prediction in H.264 is done in spatial domain.There are four types of intra-frame prediction: 4x4 16x16, chroma, and I_PCM.Why I_PCM: (1) anomalous shape (good coding efficiency), (2) loseless, and (3) predictable bit rate
18Intra_4x4 PredictionSamples in 4x4 block are predicted using 13 neighboring sample8 prediction mode: 1 DC and 8 directionalSample D is used if E-H is not availableDC: use one value to predict the whole block.For example: samples in mode 0 are copied into the 4x4 block.E-H may be unavailable because of (1) decoding order, (2) outside the subject sliceFirst macroblock coded in a picture, we use DC prediction.
19Sample Intra_4x4 Prediction Interpolation is used in some modes[Ref] Foreman sequence,
20Intra_16x16 Prediction 4 modes Vertical Horizontal DC Planer (Diagonal)Use 33 neighboring samples.Good for smooth/flat areas.
21Inter-Prediction in P Slices Two-level segmentation of MBsLumaMBs are divided into at most 4 partitions (as small as 8x8)8x8 partitions are divided into at most 4 partitionsChroma – half size horizontally and verticallyMaximum of 16 motion vectors for each MBWhy do we need different partition sizes? Motion vector are expensiveLarge partition - smooth areaSmall partition: - detailed areaNote: motion vectors are coded using predictive coding
23Inter-Prediction Accuracy ¼-pixel for luma, 1/8-pixel for ChromaHalf-pixel samples:6-tap FIR filterQuarter-pixel samples:average of neighborsChroma predictions: bilinearinterpolationCDABEKLMNOPFGHIJTURSccddeeffaabbgghhbacefgijkpqrdhnmsThe granularity of motion vectorsfinite impulse response (FIR) filterJ can be computed in two ways…Actual computations are done with addition, bit-shift, and integer arithmetics
24Multiframe Inter-Prediction in P Slices More than one prior reference pictures (by diff. MBs)Encoders/decoders buffer the same reference pictures for inter-predictionReference index is used when coding MVsMVs for regions smaller than 8x8 uses the same index for all MVs in the 8x8 regionP_skip mode:Don’t send residual signals nor MVs nor reference indexUse buffered frame 0 as the reference pictureUse neighbor’s MVsLarge areas with no change or constant motion like slow panning can be represented with very few bits.Prior standards often couple the playout order with reference order. E.g., a B frame use two adjacent I or P frames for inter-prediction. This is not true anymore in H.264/AVC.Several reference frames can be used, and are organized as a buffered frame listFull sample, half sample and quarter sample predictions represent different degrees of low pass filtering, which is chosen automatically by ME.P_skip is a mode of P-predicted macroblock (16x16)
25Multiframe Inter-Prediction in B Slices Weighted average of2 predictionsB-slices can be usedas referenceTwo reference picturelists are usedOne out of four pred. methods for each partition:list 0list 1bi-predictivedirect prediction: inferred from prior MBsThe MB can be coded in B_skip mode (similar to P_skip)
264x4 Integer Transform Why smaller transform: Only use add and shift, an exact inverse transform is possible no decoding mismatchNot too much residue to codeLess noise around edge (ringing or mosquito noise)Less computations and shorter data type (16-bit)An approximation to 4x4 DCT:Ringing is still there, just they are smaller, and harder to see., where
272nd Transform and Quantization Parameter 2nd Transform: Intra_16x16 and Chroma modes are for smooth areaDC coefficients are transformed again to cover the whole MBQuantization step is adjusted by an exponential function of quantization parameter to cover a wider range of QSQP increases by 6 => QS doublesQP increases by 1 => QS increases by 12% => bit rate decreases by 12%2nd transform is needed to take advantage of spatial redundancy in lat areas.In addition to cover a wider range, exponential quantization step also simplifies the problem of rate control.
28Entropy CodingNon-transform coefficients: an infinite-extent codeword tableTransform coefficients:Context-Adaptive Variable Length Coding (CAVLC)several VLC tables are switched dep. on prior transmitted data better than a single VLC tableContext-Adaptive Binary Arithmetic Coding (CABAC)flexible symbol probability than CAVLC 5 – 15% rate reductionefficient: multiplication freePrevious standards usually have SEPARATE VLC tables for different elements, because they tend to have different probability characteristicsIn H.264/AVC a simple and efficient infinite-extent codeword table is shared by all elementsFor different element, only a mapping to this single codeword book is required.Arithmetic coding: non-integer number of bit to each symbol
29In-loop Deblocking Filter Operate within coding loopUse filtered frames as ref. frames improve coding efficiencyAdaptive deblocking, need to determineBlocking effects or object edgesStrong or weak deblockingIntuitionsLarge difference near a block edge -> likely a block artifactIf the difference is too large to be explained by the QP difference -> likely a real edgeIn-loop is better than post-filter, because it can improve the quality.E.g., Filter p0 and q0 if
30Hypothetical Reference Decoder (HRD) Standard receiver buffer models encoders must produce bit streams that are decodable to HRDTwo buffersCoded picture buffer (CPB)models the bit arrival and removal timeDecoded picture buffer (DPB)models the frame decoded and output time in reference frame listsSo, if a designer mimics the behavior of HRD, his decoder will work
31Outline Overview Network Abstraction Layer (NAL) Video Coding Layer (VCL)Profiles and ApplicationsFeature HighlightsConclusions
32Profiles and Applications Defines a set of coding tools and algorithmsConformance points for interoperability3 Profiles for different applicationsBaseline – video conferencingMain – broadcast, media storage, digital cinemaExtended – streaming over IP (wire/wireless)15 Levelspic sizedecoding rate (MB/s)bit ratebuffer sizeBaseline: low latency real-time[Ref] J. Ostermann, J. Bormans, P. List, D. Marpe, M. Narroschke,F. Pereira, T. Stockhammer, and T. Wedi Video coding with H.264/AVC: tools, performance and complexity IEEE Circuits and Systems Magazine 4(1) pp May 2004
33Outline Overview Network Abstraction Layer (NAL) Video Coding Layer (VCL)Profiles and ApplicationsFeature HighlightsConclusions
34Feature Highlights -- Prediction Variable blocksize MCsQuarter-sample accurate MCsMVs over pic. BoundariesMultiple reference picturesWeighted Bi-directional predictionDecoupling of referencing from display ordersDecoupling prediction mode from reference capability (uses B frames as reference)Improved Skip/Direct modesIntra prediction in Spatial domainIn-loop deblocking filter
37Outline Overview Network Abstraction Layer (NAL) Video Coding Layer (VCL)Profiles and ApplicationsFeature HighlightsConclusions
38Conclusions Key improvements Enhanced prediction (intra- and inter-) Small block size exact match transformAdaptive in-loop deblocking filterEnhanced entropy coding method[Ref] G Sullivan and T. Wiegand, Video Compression—From Concepts to the H.264/AVC Standard,Proc. of IEEE, 93(1), Jan 2005