Presentation on theme: "1 Parallel H.264 Decoding on an Embedded Multicore Processor Arnaldo Azevedo 1, Cor Meenderink 1, Ben Juurlink 1 Andrei Terechko 2, Jan Hoogerbrugge 2,"— Presentation transcript:
1 Parallel H.264 Decoding on an Embedded Multicore Processor Arnaldo Azevedo 1, Cor Meenderink 1, Ben Juurlink 1 Andrei Terechko 2, Jan Hoogerbrugge 2, Mauricio Alvarez 3, Alex Ramirez 3,4 1 - Delft University of Technology, Netherlands 2 - NXP, Netherlands 3 - Barcelona Supercomputing Center, Spain 4 - Universitat Politecnica de Catalunya, Spain HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009
3 Introduction Industry shift to multicores Increasing demand for higher media quality/resolution Efficient and scalable exploitation of multicore architectures for video coding H.264 is widely used and computationally demanding Decoding is part of encoding and more challenging
5 H.264 Parallelization Frame-level Motion Compensation introduces inter-frame dependencies Frame-level parallelism is very limited Slice-level Slice-level parallelism is uncertain and increase bitrate Slice 1 Slice 3 Slice 2 I0 P3 B1 B2 P9 B4 B5 P6
6 H.264 Parallelization MacroBlock-level Current MB Intra DF Intra DF 2D-Wave: exploits MB-level parallelism
7 H.264 Parallelization MacroBlock-level Current MB Intra DF Intra DF 2D-Wave: Full HD: up to 60 MBs in parallel Exploits MB-level parallelism
8 H.264 Parallelization overview current strategies Frame-level: very limited parallelism Slice-level: uncertain parallelism increases bitrate MB-level: Reasonable parallelism None of these is sufficient to leverage a many-core!
10 3D-Wave maximum parallelism For full HD: Maximum available parallelism ranges from MBs! Note: This requires >200 frames in flight.
11 3D-Wave Implementation 3D-Wave was implemented on an NXP multicore consisting of TM3270 Trimedias TM3270 was projected for SD video processing VLIW-based media-processor with SIMD support In-house simulator capable of simulating up to 64 cores 2D-Wave was already implemented Tail submit (proposed by Hoogerbrugge, Terechko)  Checks the right and down-left MBs Execute one of them if ready, send other to TQ  Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.
15 3D-Wave Implementation Inter frame dependencies mb_decode checks inter frame dependencies On failure, it inserts the MB in the Kick-Off List of the Ref MB Ref MB F1;MB(1,3)NULL Frame 0Frame 1
16 3D-Wave Implementation Inter frame dependencies Decoding process continues normally Ref MB F1;MB(1,3)NULL Frame 0Frame 1
17 3D-Wave Implementation Inter frame dependencies mb_decode checks Kick-Off List and submits subscribed tasks F1;MB(1,3)NULL Ref MB Frame 0Frame 1
18 3D-Wave Implementation Inter frame dependencies And the decoding process carries on Ref MB NULL Frame 0Frame 1
19 3D-Wave Implementation Frame Scheduling 3D-Wave can have many of frames in flight Practical implementation requires few frames in flight A policy was developed to limit the number of frames in flight Implementation uses the Kick-Off List subscribes the first MB of the next frame to a specific MB in the current frame position of the MB defines number of frames in flight
20 3D-Wave Implementation Frame Priority Frame latency is an important factor in video decoding 3D-Wave interleaves the processing of all frames in flight Frame Priority is necessary to limit frame latency in 3D-Wave Implementation splits the Task Queue(TQ) into high and low priority task queues sends the tasks of the frame next-in-line to the high priority task queue checks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise
23 Experimental Results Use the NXP H.264 decoder that is highly optimized. Machine-dependent optimizations (e.g. SIMD operations) Machine-independent optimizations (e.g. code restructuring) The experiments use all 4 videos from the HD-VideoBench.  Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.
24 Experimental Results Methodology Entropy Decoding results of the entire sequence are buffered Sequence contains only I and P frames with one slice All frames are scheduled to execute at once Reference Frame Buffer keeps all the frames of the sequence Presented results are for 25 frames (1 second) of Rush_Hour Full High Definition(FHD) On a single core, 2D-Wave can decode 39 SD, 18 HD, and 8 FHD frames per second, respectively.
25 Experimental Results Scalability Efficiency of more than 80% for 64 cores Start-up and ramp-down times of short sequence limit efficiency 64 cores is 16x faster than real-time for FHD
26 Experimental Results Frame Scheduling FHD Rush_Hour decoding on 16 cores Different colors represent different frames Frame Scheduling limits the number of frames in flight Performance loss is < 5% for at most 6 frames in flight
27 Experimental Results Frame Scheduling and Priority Frame Priority reduces frame latency to the same as 2D-Wave (10ms) The latency of the 1st frame: 58.5ms Frame Scheduling(15.1ms) Frame Scheduling and Priority(9.2ms) Does not reduce performance significantly (< 1%) FHD Rush_Hour decoding on 16 cores
28 Experimental Results Bandwidth Requirements Bandwidth required for 64 cores is approximately 21 GB/s 3D-Wave is 20% more bandwidth efficient than 2D-Wave Scheduling and Priority reduce locality and increase bandwidth
29 Conclusions 3D-Wave scales with high efficiency to large number of cores 3D-Wave allows efficient use of many-cores architectures for video processing Frame priority reduces latency to its minimum
30 References  Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Parallel Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers  Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization  Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.