Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel H.264 Decoding on an Embedded Multicore Processor

Similar presentations


Presentation on theme: "Parallel H.264 Decoding on an Embedded Multicore Processor"— Presentation transcript:

1 Parallel H.264 Decoding on an Embedded Multicore Processor
Arnaldo Azevedo1, Cor Meenderink1, Ben Juurlink1 Andrei Terechko2, Jan Hoogerbrugge2, Mauricio Alvarez3, Alex Ramirez3,4 1 - Delft University of Technology, Netherlands 2 - NXP, Netherlands 3 - Barcelona Supercomputing Center, Spain 4 - Universitat Politecnica de Catalunya, Spain HIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009

2 Outline Introduction 3D-Wave 3D-Wave Implementation
Experimental Results Conclusions

3 Introduction Industry shift to multicores
Increasing demand for higher media quality/resolution Efficient and scalable exploitation of multicore architectures for video coding H.264 is widely used and computationally demanding Decoding is part of encoding and more challenging

4 Parallel H.264 Decoding The H.264 Decoder
The H.264 decoding process Encoded Bitstream Inverse Quantization Inverse DCT Stream Parsing Entropy Decoder Deblocking + Spatial Prediction Motion Compensation Reference Frames Reconstructor Data-Parallel Processing Parser 4

5 H.264 Parallelization Frame-level Slice-level
Motion Compensation introduces inter-frame dependencies Frame-level parallelism is very limited Slice-level Slice-level parallelism is uncertain and increase bitrate I0 P3 P6 P9 B1 B4 B2 B5 Slice 1 Slice 3 Slice 2

6 H.264 Parallelization MacroBlock-level
2D-Wave: Current MB Intra DF exploits MB-level parallelism

7 H.264 Parallelization MacroBlock-level
Current MB Intra DF 2D-Wave: Exploits MB-level parallelism Full HD: up to 60 MBs in parallel

8 H.264 Parallelization overview current strategies
Frame-level: very limited parallelism Slice-level: uncertain parallelism increases bitrate MB-level: Reasonable parallelism None of these is sufficient to leverage a many-core! Dependencies inter frame #Explain I, B, and P types $ fig2 maintain part of fig1 (MC prediction bold) #show decoding order in figure #only B frames can be processed in parallel

9 3D-Wave motion compensation frame 0 (I) frame 1 (P)‏ frame 2 (P)‏ 9

10 3D-Wave maximum parallelism
For full HD: Maximum available parallelism ranges from MBs! Note: This requires >200 frames in flight.

11 3D-Wave Implementation
3D-Wave was implemented on an NXP multicore consisting of TM3270 Trimedias TM3270 was projected for SD video processing VLIW-based media-processor with SIMD support In-house simulator capable of simulating up to 64 cores 2D-Wave was already implemented Tail submit (proposed by Hoogerbrugge, Terechko) [13] Checks the right and down-left MBs Execute one of them if ready, send other to TQ [13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.

12 3D-Wave Implementation Reference Frame Buffer Structure
Decoder Sync info Frame 5 Reference Frame Buffer Structure

13 3D-Wave Implementation Reference Frame Buffer Structure
Decoder Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Sync info Sync info Sync info Sync info Sync info Parallel Reference Frame Buffer Structure

14 3D-Wave Implementation Reference Frame Buffer Structure
Decoder Decoder Decoder Frame 0 Frame 1 Frame 2 Frame 3 Frame 4 Sync info Sync info Sync info Sync info Sync info Parallel Reference Frame Buffer Structure

15 3D-Wave Implementation Inter frame dependencies
mb_decode checks inter frame dependencies On failure, it inserts the MB in the Kick-Off List of the Ref MB Ref MB F1;MB(1,3)‏ NULL Frame 0 Frame 1 15

16 3D-Wave Implementation Inter frame dependencies
Decoding process continues normally Ref MB F1;MB(1,3)‏ NULL Frame 0 Frame 1 16

17 3D-Wave Implementation Inter frame dependencies
mb_decode checks Kick-Off List and submits subscribed tasks Frame 0 Frame 1 Ref MB F1;MB(1,3)‏ NULL 17

18 3D-Wave Implementation Inter frame dependencies
And the decoding process carries on Frame 0 Frame 1 Ref MB NULL 18

19 3D-Wave Implementation Frame Scheduling
3D-Wave can have many of frames in flight Practical implementation requires few frames in flight A policy was developed to limit the number of frames in flight Implementation uses the Kick-Off List subscribes the first MB of the next frame to a specific MB in the current frame position of the MB defines number of frames in flight

20 3D-Wave Implementation Frame Priority
Frame latency is an important factor in video decoding 3D-Wave interleaves the processing of all frames in flight Frame Priority is necessary to limit frame latency in 3D-Wave Implementation splits the Task Queue(TQ) into high and low priority task queues sends the tasks of the frame next-in-line to the high priority task queue checks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise

21

22

23 Experimental Results Use the NXP H.264 decoder that is highly optimized. Machine-dependent optimizations (e.g. SIMD operations) Machine-independent optimizations (e.g. code restructuring) The experiments use all 4 videos from the HD-VideoBench[10]. [10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.

24 Experimental Results Methodology
Entropy Decoding results of the entire sequence are buffered Sequence contains only I and P frames with one slice All frames are scheduled to execute at once Reference Frame Buffer keeps all the frames of the sequence Presented results are for 25 frames (1 second) of Rush_Hour Full High Definition(FHD) On a single core, 2D-Wave can decode 39 SD, 18 HD, and 8 FHD frames per second, respectively.

25 Experimental Results Scalability
Efficiency of more than 80% for 64 cores Start-up and ramp-down times of short sequence limit efficiency 64 cores is 16x faster than real-time for FHD 25

26 Experimental Results Frame Scheduling
FHD Rush_Hour decoding on 16 cores Different colors represent different frames Frame Scheduling limits the number of frames in flight Performance loss is < 5% for at most 6 frames in flight 26

27 Experimental Results Frame Scheduling and Priority
FHD Rush_Hour decoding on 16 cores Frame Priority reduces frame latency to the same as 2D-Wave (10ms) The latency of the 1st frame: 58.5ms  Frame Scheduling(15.1ms)  Frame Scheduling and Priority(9.2ms) Does not reduce performance significantly (< 1%) 27

28 Experimental Results Bandwidth Requirements
Bandwidth required for 64 cores is approximately 21 GB/s 3D-Wave is 20% more bandwidth efficient than 2D-Wave Scheduling and Priority reduce locality and increase bandwidth 28

29 Conclusions 3D-Wave scales with high efficiency to large number of cores 3D-Wave allows efficient use of many-cores architectures for video processing Frame priority reduces latency to its minimum

30 References [3] Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Parallel Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers 2008. [10] Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007. [13] Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008. M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April 2009. A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.


Download ppt "Parallel H.264 Decoding on an Embedded Multicore Processor"

Similar presentations


Ads by Google