Presentation on theme: "Parallel H.264 Decoding on an Embedded Multicore Processor"— Presentation transcript:
1Parallel H.264 Decoding on an Embedded Multicore Processor Arnaldo Azevedo1, Cor Meenderink1, Ben Juurlink1Andrei Terechko2, Jan Hoogerbrugge2, Mauricio Alvarez3, Alex Ramirez3,41 - Delft University of Technology, Netherlands2 - NXP, Netherlands3 - Barcelona Supercomputing Center, Spain4 - Universitat Politecnica de Catalunya, SpainHIPEAC (The 4th International Conference on High Performance and Embedded Architectures and Compilers) 2009
3Introduction Industry shift to multicores Increasing demand for higher media quality/resolutionEfficient and scalable exploitation of multicore architectures for video codingH.264 is widely used and computationally demandingDecoding is part of encoding and more challenging
4Parallel H.264 Decoding The H.264 Decoder The H.264 decoding processEncoded BitstreamInverse QuantizationInverse DCTStream ParsingEntropy DecoderDeblocking+Spatial PredictionMotion CompensationReference FramesReconstructorData-Parallel ProcessingParser4
5H.264 Parallelization Frame-level Slice-level Motion Compensation introduces inter-frame dependenciesFrame-level parallelism is very limitedSlice-levelSlice-level parallelism is uncertain and increase bitrateI0P3P6P9B1B4B2B5Slice 1Slice 3Slice 2
7H.264 Parallelization MacroBlock-level CurrentMBIntraDF2D-Wave:Exploits MB-levelparallelismFull HD:up to 60 MBs inparallel
8H.264 Parallelization overview current strategies Frame-level:very limited parallelismSlice-level:uncertain parallelismincreases bitrateMB-level:Reasonable parallelismNone of these is sufficient to leverage a many-core!Dependencies inter frame#Explain I, B, and P types $ fig2 maintain part of fig1 (MC prediction bold)#show decoding order in figure#only B frames can be processed in parallel
103D-Wave maximum parallelism For full HD:Maximum available parallelism ranges from MBs!Note:This requires >200 frames in flight.
113D-Wave Implementation 3D-Wave was implemented on an NXP multicore consisting of TM3270 TrimediasTM3270 was projected for SD video processingVLIW-based media-processor with SIMD supportIn-house simulator capable of simulating up to 64 cores2D-Wave was already implementedTail submit (proposed by Hoogerbrugge, Terechko) Checks the right and down-left MBsExecute one of them if ready, send other to TQ Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.
153D-Wave Implementation Inter frame dependencies mb_decode checks inter frame dependenciesOn failure, it inserts the MB in the Kick-Off List of the Ref MBRef MBF1;MB(1,3)NULLFrame 0Frame 115
163D-Wave Implementation Inter frame dependencies Decoding process continues normallyRef MBF1;MB(1,3)NULLFrame 0Frame 116
173D-Wave Implementation Inter frame dependencies mb_decode checks Kick-Off List and submits subscribed tasksFrame 0Frame 1Ref MBF1;MB(1,3)NULL17
183D-Wave Implementation Inter frame dependencies And the decoding process carries onFrame 0Frame 1Ref MBNULL18
193D-Wave Implementation Frame Scheduling 3D-Wave can have many of frames in flightPractical implementation requires few frames in flightA policy was developed to limit the number of frames in flightImplementationuses the Kick-Off Listsubscribes the first MB of the next frame to a specific MB in the current frameposition of the MB defines number of frames in flight
203D-Wave Implementation Frame Priority Frame latency is an important factor in video decoding3D-Wave interleaves the processing of all frames in flightFrame Priority is necessary to limit frame latency in 3D-WaveImplementationsplits the Task Queue(TQ) into high and low priority task queuessends the tasks of the frame next-in-line to the high priority task queuechecks if there are tasks in the high priority TQ, executes from the low priority TQ otherwise
23Experimental ResultsUse the NXP H.264 decoder that is highly optimized.Machine-dependent optimizations (e.g. SIMD operations)Machine-independent optimizations (e.g. code restructuring)The experiments use all 4 videos from the HD-VideoBench. Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007.
24Experimental Results Methodology Entropy Decoding results of the entire sequence are bufferedSequence contains only I and P frames with one sliceAll frames are scheduled to execute at onceReference Frame Buffer keeps all the frames of the sequencePresented results are for 25 frames (1 second) of Rush_Hour Full High Definition(FHD)On a single core, 2D-Wave can decode 39 SD, 18 HD, and 8 FHD frames per second, respectively.
25Experimental Results Scalability Efficiency of more than 80% for 64 coresStart-up and ramp-down times of short sequence limit efficiency64 cores is 16x faster than real-time for FHD25
26Experimental Results Frame Scheduling FHD Rush_Hour decoding on 16 coresDifferent colors represent different framesFrame Scheduling limits the number of frames in flightPerformance loss is < 5% for at most 6 frames in flight26
27Experimental Results Frame Scheduling and Priority FHD Rush_Hour decoding on 16 coresFrame Priority reduces frame latency to the same as 2D-Wave (10ms)The latency of the 1st frame: 58.5ms Frame Scheduling(15.1ms) Frame Scheduling and Priority(9.2ms)Does not reduce performance significantly (< 1%)27
28Experimental Results Bandwidth Requirements Bandwidth required for 64 cores is approximately 21 GB/s3D-Wave is 20% more bandwidth efficient than 2D-WaveScheduling and Priority reduce locality and increase bandwidth28
29Conclusions3D-Wave scales with high efficiency to large number of cores3D-Wave allows efficient use of many-cores architectures for video processingFrame priority reduces latency to its minimum
30References Meenderinck, C., Azevedo, A., Alvarez, M., Juurlink, B., Ramirez, A.: “Parallel Scalability of H.264,” First Workshop on Programmability Issues for Multi-Core Computers 2008. Alvarez, M., Salami, E., Ramirez, A., Valero, M.: “HD-VideoBench: A Benchmark for Evaluating High Definition Digital Video Applications,” IEEE International Symposium on Workload Characterization 2007. Hoogerbrugge, J., Terechko, A.: “A Multithreaded Multicore System for Embedded Media Processing,” Transactions on High-Performance Embedded Architectures and Compilers 2008.M. Alvarez, A. Ramirez, M. Valero, A. Azevedo, C.H. Meenderinck, B.H.H. Juurlink, “Performance Evaluation of Macroblock-level Parallelization of H.264 Decoding on a cc-NUMA Multiprocessor Architecture,” The 4CCC: 4th Colombian Computing Conference, Bucaramanga, Colombia, April 2009.A. Azevedo, B.H.H. Juurlink, C.H. Meenderinck, A. Terechko, J. Hoogerbrugge, M. Alvarez, A. Ramirez, M. Valero, “A Highly Scalable Parallel Implementation of H.264,” Transactions on High-Performance Embedded Architectures and Compilers (HiPEAC), September 2009.