Presentation is loading. Please wait.

Presentation is loading. Please wait.

Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Similar presentations


Presentation on theme: "Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula"— Presentation transcript:

1 A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture
Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona State University CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009

2 Outline Introduction and Motivation
Opportunities for Parallelization in H.264 Implementation Performance Optimizations Experimental Results Conclusion

3 Motivation Multicore Architectures H.264 Cell Broadband Engine(CBE)
Scalability: more cores = more performance H.264 Standard for video applications including High Definition(HD) Computationally expensive Cell Broadband Engine(CBE) Common and inexpensive thanks to PS3 Low power high performance design gives a glimpse of future embedded architectures

4 IBM Cell Broadband Engine Architecture
3.2 GHz 9 cores, 10 threads >200 Gflops(single precision) >20 Gflops(double precision) Up to 25 GB/s memory bandwidth Up to 75 GB/s I/O bandwidth >300 GB/s interconnect bus SPE: Synergistic Processor Element SPU: Synergistic Processor Unit SXU: SPU Core LS: Local Storage SMF: Synergistic Memory Flow Control EIB: Element Interconnect Bus PPE: PowerPC Processor Element PPU: PowerPC processor Unit PXU: Power Processor Unit MIC: Memory Interface Controller BIC: Bus Interface Controller L1: Memory Cache Internal to the CPU L2: Memory Cache External to the CPU

5 H.264 Advanced Video Coding
H.264 is a video compression standard Version 1 completed May 2003 ITU-T Video Coding Experts Group (H.264) ISO/IEC Moving Picture Experts Group (MPEG-4 AVC) Macroblock(MB) based CODEC closely related to MPEG-2 Growing demand for HD and Wireless video 50% bit rate reduction over previous standard Computational complexity approximately 2.4 x MPEG2

6 H.264: Decoder

7 Reference Code: FFmpeg (H.264 Decoder)
Open source video and audio converter Handles a multitude of formats Codecs other than H.264 decoder removed About 250K Lines of Code after paring to H.264 only About 200 functions ported to SPU in our implementation

8 H.264 Frame Level Relationships
I Frame: Independently Encoded Intra Prediction P Frame: Predicted from a Preceding frame Intra and Inter Prediction B Frame: Predicted from Both preceding and following frames

9 H.264 Opportunities for Parallelism: GOP and Frame Level
I, P, B Frames Picture Sequence IBBPBBP Independent Group of Pictures (GOP) Independent Frames within GOP

10 H.264 Opportunities for Parallelism: Slice and MB Level
Slices: Independently encoded groups of MBs within a frame Intra Dependencies:

11 Data Partitioning Scheme
Our Scheme: One row of MBs issued to each SPU Possible Intra MB dependencies:

12 Functional Partitioning
CBE Architecture:

13 FFmpeg main MB decoding loop
Intra Inter

14 Scalable Implementation

15 FFmpeg Data Structure Modification
Single threaded code: monolithic data structure Entire structure needed to decode single MB but majority is static from one MB to the next SPU only requires applicable subset for one row of MBs Only MB specific data replicated in SPU LS Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.

16 SPU LS(Local Store) Limitations

17 Code Overlay Code segment contains one or more functions
Memory region assigned one or more segments At run time, region contains exactly one segment

18 Designing an Overlay Scheme
Start with one flat region 1. Identify key functions and assign to new regions Profiling indicates f21() is most important with 50 calls However, f11() is present 80 times in the trace f11() is a key function 2. Create new regions based on profiling data until memory is exhausted

19 Designing an Overlay Scheme

20 Overlay Performance

21 Additional Performance Optimizations

22 Experimental Results Microsoft’s WMV HD demonstration page [13]
The source videos were transcoded into H x1080 (1080p) format 5 different bitrates: 2.5, 4, 8, 12, 16Mbps CAVLC and CABAC Use the x264 H.264 encoder integrated into ffmpeg The videos were encoded using the x264 presets: baseline, normal, and hq Decoder performance is measured on the Sony’s Playstation 3, 3.2 GHz Cell Processor (limited by Sony for access to six of the CBE’s eight SPUs) running Linux Fedora 9 [13] Microsoft Corporation. WMV HD Content Showcase.

23 The white band at the bottom is the PPU (entropy decoder) contribution
Motion vector decoding and deblocking are the most expensive components The white band at the bottom is the PPU (entropy decoder) contribution Figure 14: Breakdown of decoder performance by component using a single SPU.

24 Decoder Performance Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs. [4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.

25 Our implementation achieved a “best case” average framerate of 34
Our implementation achieved a “best case” average framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs And a “worst case” entropy decoder limited average framerate of 15.43fps on 16Mbps hq CABAC encoded video streams.

26 Conclusion Demonstrated scalable H.264 decoder for multicore processor
23% frame rate advantage over prior work [4] on similar videos and using same number of cores Careful engineering required to efficiently manage data structures and scratchpad memory


Download ppt "Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula"

Similar presentations


Ads by Google