Download presentation
Presentation is loading. Please wait.
Published byJoleen Francis Modified over 9 years ago
1
A Highly Parallel Framework for HEVC Coding Unit Partitioning Tree Decision on Many-core Processors Chenggang Yan, Yongdong Zhang, Jizheng Xu, Feng Dai, Liang Li, Qionghai Dai, and Feng Wu. IEEE SIGNAL PROCESSING LETTERS, VOL. 21, NO. 5, MAY 2014
2
Outline Introduction Related Work Proposed Method Experimental Results Conclusion 2
3
Introduction(1/3) In HEVC, each frame is divided into non- overlapping CTUs, which can be recursively split into smaller CUs. For a CTU, the CU partitioning tree (CUPT) controls how a CTU is coded with CUs with variable block sizes and coding modes. The price to be paid for higher coding efficiency is higher computational complexity. 3
4
Introduction(2/3) To speed up the decision process of CUPT, many researchers have tried to reduce the search space by avoiding searching the full branches of the quad-tree [10]. In order to guarantee the coding efficiency, many branches of the quad-tree can’t be skipped and the speedup is no more than two times. Many researchers only consider the RD-based intra mode selection, while inter mode selection is much more time-consuming. 4 [10] L. Shen, Z. Liu, and X. Zhang et al., “An effective CU size decision method for HEVC encoders,” IEEE Trans. Multimedia, vol. 15, pp. 465–470, Jan. 2013.
5
Introduction(3/3) Many-core processors are good candidates for speeding up compression algorithms. Efficient parallelization of CUPT decision (CUPTD) on many-core processors is challenging, because CUPTD has complicated data dependencies. If CUPTD isn’t extensively parallelizable, cores will be left unused and performance might suffer. 5
6
Related Work(1/3) HEVC CU Partition Tree Decision(CUPTD) 6
7
Related Work(2/3) For RD-based intra prediction: Instead of applying the intra coding at PU level, HEVC conducts intra prediction in TU level sequentially, which always utilize the nearest neighboring reference samples from the already reconstructed TUs. To enhance the coding efficiency of HEVC, HEVC provides as many as 35 prediction modes. Just like H.264/AVC, left, above, and above-right neighboring reconstructed sample will be used for intra prediction. 7
8
Related Work(3/3) For RD-based inter prediction: The best motion vector predictor is selected from a given advanced motion vector prediction candidate list. The AMVPCL is composed of both spatial candidates and temporal candidates. Spatial candidates need the motion information of neighboring left, left-down, upper, upper-left and upper-right PUs. According to RD-based intra/inter prediction, the search of the current CU branch may have data dependencies on its neighboring left, left-down, upper, upper-left and upper-right CU branches. 8
9
Proposed Method A(1/2) Problem Formulation 9
10
Proposed Method A(2/2) 10
11
Proposed Method B(1/3) CTU-Level Parallelism The best RD costs in the current CTU’s neighboring left, upper, upper-left, and upper-right CTUs are computed. The current CTU has data dependencies on its neighboring left, upper, upper-left, and upper-right CTUs. We use the same DAG-based order as described in our previous work [14] to parallelize CTUs. 11 [14] C. Yan et al., “Highly parallel framework for HEVC motion estimation on many-core platform,” in Data Compression Conf., Snowbird, UT, 2013, pp. 63–72.
12
Proposed Method B(2/3) Generate a DAG to capture the dependency relationships of CTUs. Consists of a set of vertices V and edges E. data dependency an edge. Processed remove 12
13
Proposed Method B(3/3) 13
14
Proposed Method B(1/) Step1 : Initialize DQ and CM. DQ is a waiting queue. CM is designed to record the number of related CTUs for each CTU. Step2 : When some values in the CM become zero, get the corresponding coordinates and push them into DQ. Step3 : Get coordinates from DQ and process corresponding CTUs in parallel on many-core platform. Step4 : Update CM. When a CTU with coordinate (i, j) in CM is processed, the values of coordinates (i+1, j), (i+1, j-1), (i,j+1) and (i+1,j+1) in CM will minus one operation. Step5 : Repeat above steps 2~4 until each frame is over. 14
15
Proposed Method C(1/3) 15
16
Proposed Method C(2/3) CICUs : The CICU’s left boundary and CTU’s left boundary overlap. The CICU’s upper boundary and CTU’s upper boundary overlap. 16
17
Proposed Method C(3/3) PICUs : PICUs don’t meet requirements of CICUs. The PICU’s left boundary and CTU’s left boundary overlap or neighboring left largest size CU has been computed. The PICU’s upper boundary and CTU’s upper boundary overlap or neighboring upper and upper-right largest size CUs have been computed. 17
18
Experimental Results To compare our proposed method with serial execution, we adopt an encoder migrated from HEVC reference software HM7.0 without any optimization. The experiment platform of this letter is based on Tile64, which is a member of TILERA many-core platform and contains 64 processing cores [17]. 18 [17] S. Bell et al., “TILE64-Processor: A 64-core SoC with mesh,” in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2008, pp. 88–598.
19
Experimental Results 19
20
Experimental Results 20
21
Conclusion We propose an efficient parallel framework for HEVC CUPTD on many-core processors. Experiments conducted on Tile64 platform demonstrate that our method saves more time than the default encoding scheme in HM 7.0. 21
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.