Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Parallel H.264 Decoding on an Embedded Multicore Processor Arnaldo Azevedo 1, Cor Meenderink 1, Ben Juurlink 1 Andrei Terechko 2, Jan Hoogerbrugge 2,

Similar presentations


Presentation on theme: "1 Parallel H.264 Decoding on an Embedded Multicore Processor Arnaldo Azevedo 1, Cor Meenderink 1, Ben Juurlink 1 Andrei Terechko 2, Jan Hoogerbrugge 2,"— Presentation transcript:

1 1 Parallel H.264 Decoding on an Embedded Multicore Processor Arnaldo Azevedo 1, Cor Meenderink 1, Ben Juurlink 1 Andrei Terechko 2, Jan Hoogerbrugge 2, Mauricio Alvarez 3, Alex Ramirez 3,4 1 - Delft University of Technology, Netherlands 2 - NXP, Netherlands 3 - Barcelona Supercomputing Center, Spain 4 - Universitat Politecnica de Catalunya, Spain HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009

2 2 Outline Introduction 3D-Wave 3D-Wave Implementation Experimental Results Conclusions

3 3 Introduction Industry shift to multicores Increasing demand for higher media quality/resolution Efficient and scalable exploitation of multicore architectures for video coding H.264 is widely used and computationally demanding Decoding is part of encoding and more challenging

4 4 Parallel H.264 Decoding The H.264 Decoder The H.264 decoding process Stream Parsing Entropy Decoder Inverse Quantization Inverse DCT Spatial Prediction Motion Compensation Reference Frames Deblocking + Encoded Bitstream Parser Reconstructor Data-Parallel Processing

5 5 H.264 Parallelization Frame-level Motion Compensation introduces inter-frame dependencies Frame-level parallelism is very limited Slice-level Slice-level parallelism is uncertain and increase bitrate Slice 1 Slice 3 Slice 2 I0 P3 B1 B2 P9 B4 B5 P6

6 6 H.264 Parallelization MacroBlock-level Current MB Intra DF Intra DF 2D-Wave: exploits MB-level parallelism

7 7 H.264 Parallelization MacroBlock-level Current MB Intra DF Intra DF 2D-Wave: Full HD: up to 60 MBs in parallel Exploits MB-level parallelism

8 8 H.264 Parallelization overview current strategies Frame-level: very limited parallelism Slice-level: uncertain parallelism increases bitrate MB-level: Reasonable parallelism None of these is sufficient to leverage a many-core!

9 9 motion compensation frame 0 (I)frame 1 (P)‏frame 2 (P)‏ 3D-Wave

10 10 3D-Wave maximum parallelism For full HD: Maximum available parallelism ranges from MBs! Note: This requires >200 frames in flight.

11 11 3D-Wave Implementation 3D-Wave was implemented on an NXP multicore consisting of TM3270 Trimedias TM3270 was projected for SD video processing VLIW-based media-processor with SIMD support In-house simulator capable of simulating up to 64 cores 2D-Wave was already implemented Tail submit (proposed by Hoogerbrugge, Terechko) [13] Checks the right and down-left MBs Execute one of them if ready, send other to TQ [13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.

12 12 Reference Frame Buffer Frame 0Frame 1Frame 2Frame 3Frame 4 Decoder Frame 5 Sync info Reference Frame Buffer Structure 3D-Wave Implementation Reference Frame Buffer Structure

13 13 Frame 0Frame 1 Frame 2 Frame 3Frame 4 Decoder Sync info Parallel Reference Frame Buffer Structure 3D-Wave Implementation Reference Frame Buffer Structure

14 14 Frame 0Frame 1 Frame 2 Frame 3Frame 4 Decoder Sync info Parallel Reference Frame Buffer Structure 3D-Wave Implementation Reference Frame Buffer Structure

15 15 3D-Wave Implementation Inter frame dependencies mb_decode checks inter frame dependencies On failure, it inserts the MB in the Kick-Off List of the Ref MB Ref MB F1;MB(1,3)‏NULL Frame 0Frame 1

16 16 3D-Wave Implementation Inter frame dependencies Decoding process continues normally Ref MB F1;MB(1,3)‏NULL Frame 0Frame 1

17 17 3D-Wave Implementation Inter frame dependencies mb_decode checks Kick-Off List and submits subscribed tasks F1;MB(1,3)‏NULL Ref MB Frame 0Frame 1

18 18 3D-Wave Implementation Inter frame dependencies And the decoding process carries on Ref MB NULL Frame 0Frame 1

19 19 3D-Wave Implementation Frame Scheduling 3D-Wave can have many of frames in flight Practical implementation requires few frames in flight A policy was developed to limit the number of frames in flight Implementation uses the Kick-Off List subscribes the first MB of the next frame to a specific MB in the current frame position of the MB defines number of frames in flight

20 20 3D-Wave Implementation Frame Priority Frame latency is an important factor in video decoding 3D-Wave interleaves the processing of all frames in flight Frame Priority is necessary to limit frame latency in 3D-Wave Implementation splits the Task Queue(TQ) into high and low priority task queues sends the tasks of the frame next-in-line to the high priority task queue checks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise

21 21

22 22

23 23 Experimental Results Use the NXP H.264 decoder that is highly optimized. Machine-dependent optimizations (e.g. SIMD operations) Machine-independent optimizations (e.g. code restructuring) The experiments use all 4 videos from the HD-VideoBench[10]. [10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.

24 24 Experimental Results Methodology Entropy Decoding results of the entire sequence are buffered Sequence contains only I and P frames with one slice All frames are scheduled to execute at once Reference Frame Buffer keeps all the frames of the sequence Presented results are for 25 frames (1 second) of Rush_Hour Full High Definition(FHD) On a single core, 2D-Wave can decode 39 SD, 18 HD, and 8 FHD frames per second, respectively.

25 25 Experimental Results Scalability Efficiency of more than 80% for 64 cores Start-up and ramp-down times of short sequence limit efficiency 64 cores is 16x faster than real-time for FHD

26 26 Experimental Results Frame Scheduling FHD Rush_Hour decoding on 16 cores Different colors represent different frames Frame Scheduling limits the number of frames in flight Performance loss is < 5% for at most 6 frames in flight

27 27 Experimental Results Frame Scheduling and Priority Frame Priority reduces frame latency to the same as 2D-Wave (10ms) The latency of the 1st frame: 58.5ms  Frame Scheduling(15.1ms)  Frame Scheduling and Priority(9.2ms) Does not reduce performance significantly (< 1%) FHD Rush_Hour decoding on 16 cores

28 28 Experimental Results Bandwidth Requirements Bandwidth required for 64 cores is approximately 21 GB/s 3D-Wave is 20% more bandwidth efficient than 2D-Wave Scheduling and Priority reduce locality and increase bandwidth

29 29 Conclusions 3D-Wave scales with high efficiency to large number of cores 3D-Wave allows efficient use of many-cores architectures for video processing Frame priority reduces latency to its minimum

30 30 References [3] Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Parallel Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers [10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization [13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.


Download ppt "1 Parallel H.264 Decoding on an Embedded Multicore Processor Arnaldo Azevedo 1, Cor Meenderink 1, Ben Juurlink 1 Andrei Terechko 2, Jan Hoogerbrugge 2,"

Similar presentations


Ads by Google