Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Slides:



Advertisements
Similar presentations
Parallel Scalability and Efficiency of HEVC Parallelization Approaches
Advertisements

Parallel Processing with PlayStation3 Lawrence Kalisz.
Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.
MPEG-2 to H.264/AVC Transcoding Techniques Jun Xin Xilient Inc. Cupertino, CA.
Parallel H.264 Decoding on an Embedded Multicore Processor
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Standards, process, requirements 4K PLAYBACK EXPLAINED.
Design center Vienna Donau-City-Str. 1 A-1220 Vienna Vers SVEN Scalable Video Engine Gerald Krottendorfer.
-1/20- MPEG 4, H.264 Compression Standards Presented by Dukhyun Chang
Technion - IIT Dept. of Electrical Engineering Signal and Image Processing lab Transrating and Transcoding of Coded Video Signals David Malah Ran Bar-Sella.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
1 Video Coding Concept Kai-Chao Yang. 2 Video Sequence and Picture Video sequence Large amount of temporal redundancy Intra Picture/VOP/Slice (I-Picture)
Source Coding for Video Application
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
H.264/Advanced Video Coding – A New Standard Song Jiqiang Oct 21, 2003.
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
1 Slice-Balancing H.264 Video Encoding for Improved Scalability of Multicore Decoding Michael Roitzsch Technische Universität Dresden ACM & IEEE international.
Fundamentals of Multimedia Chapter 11 MPEG Video Coding I MPEG-1 and 2
Multicore Design Considerations. Multicore: The Forefront of Computing Technology “We’re not going to have faster processors. Instead, making software.
An Introduction to H.264/AVC and 3D Video Coding.
1. 1. Problem Statement 2. Overview of H.264/AVC Scalable Extension I. Temporal Scalability II. Spatial Scalability III. Complexity Reduction 3. Previous.
January 26, Nick Feamster Development of a Transcoding Algorithm from MPEG to H.263.
MPEG-2 Digital Video Coding Standard
EE 5359 H.264 to VC 1 Transcoding Vidhya Vijayakumar Multimedia Processing Lab MSEE, University of Arlington Guided.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
1/23/2005 page1 11/11/2004 MPEG4 Codec for Access Grids National Center for High Performance Computing Speaker: Barz Hsu
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
Page 19/15/2015 CSE 40373/60373: Multimedia Systems 11.1 MPEG 1 and 2  MPEG: Moving Pictures Experts Group for the development of digital video  It is.
Profiles and levelstMyn1 Profiles and levels MPEG-2 is intended to be generic, supporting a diverse range of applications Different algorithmic elements.
Windows Media Video 9 Tarun Bhatia Multimedia Processing Lab University Of Texas at Arlington 11/05/04.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
EE 5359 PROJECT PROPOSAL FAST INTER AND INTRA MODE DECISION ALGORITHM BASED ON THREAD-LEVEL PARALLELISM IN H.264 VIDEO CODING Project Guide – Dr. K. R.
1 Data Partition for Wavefront Parallelization of H.264 Video Encoder Zhuo Zhao, Ping Liang IEEE ISCAS 2006.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Adaptive Multi-path Prediction for Error Resilient H.264 Coding Xiaosong Zhou, C.-C. Jay Kuo University of Southern California Multimedia Signal Processing.
- By Naveen Siddaraju - Under the guidance of Dr K R Rao Study and comparison of H.264/MPEG4.
June, 1999 An Introduction to MPEG School of Computer Science, University of Central Florida, VLSI and M-5 Research Group Tao.
EE 5359 TOPICS IN SIGNAL PROCESSING PROJECT ANALYSIS OF AVS-M FOR LOW PICTURE RESOLUTION MOBILE APPLICATIONS Under Guidance of: Dr. K. R. Rao Dept. of.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
- By Naveen Siddaraju - Under the guidance of Dr K R Rao Study and comparison between H.264.
1 A high-level simulator for the H.264/AVC decoding process in multi-core systems Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, Margrit Gelautz.
Aug 25, 2005 page1 Aug 25, 2005 Integration of Advanced Video/Speech Codecs into AccessGrid National Center for High Performance Computing Speaker: Barz.
By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
COMPARATIVE STUDY OF HEVC and H.264 INTRA FRAME CODING AND JPEG2000 BY Under the Guidance of Harshdeep Brahmasury Jain Dr. K. R. RAO ID MS Electrical.
UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
1. 2 Design of a 125  W, Fully-Scalable MPEG-2 and H.264/AVC Video Decoder for Mobile Applications Tsu-Ming Liu 1, Ching-Che Chung 1, Chen-Yi Lee 1,
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Implementation and comparison study of H.264 and AVS china EE 5359 Multimedia Processing Spring 2012 Guidance : Prof K R Rao Pavan Kumar Reddy Gajjala.
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Introduction to H.264 / AVC Video Coding Standard Multimedia Systems Sharif University of Technology November 2008.
High Performance Computing on an IBM Cell Processor --- Bioinformatics
Cell Architecture.
Overview of the Scalable Video Coding
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
Video Compression - MPEG
Video-in-Video Insertion into a Pre-encoded Bit-stream
Bongsoo Jung, Byeungwoo Jeon
What Choices Make A Killer Video Processor Architecture?
Presentation transcript:

A Scalable Parallel H.264 Decoder on the Cell Broadband Engine Architecture Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula Arizona State University CODES+ISSS (The International Conference on Hardware-Software Codesign and System Synthesis) 2009

Outline Introduction and Motivation Opportunities for Parallelization in H.264 Implementation Performance Optimizations Experimental Results Conclusion

Motivation Multicore Architectures H.264 Cell Broadband Engine(CBE) Scalability: more cores = more performance H.264 Standard for video applications including High Definition(HD) Computationally expensive Cell Broadband Engine(CBE) Common and inexpensive thanks to PS3 Low power high performance design gives a glimpse of future embedded architectures

IBM Cell Broadband Engine Architecture 3.2 GHz 9 cores, 10 threads >200 Gflops(single precision) >20 Gflops(double precision) Up to 25 GB/s memory bandwidth Up to 75 GB/s I/O bandwidth >300 GB/s interconnect bus SPE: Synergistic Processor Element SPU: Synergistic Processor Unit SXU: SPU Core LS: Local Storage SMF: Synergistic Memory Flow Control EIB: Element Interconnect Bus PPE: PowerPC Processor Element PPU: PowerPC processor Unit PXU: Power Processor Unit MIC: Memory Interface Controller BIC: Bus Interface Controller L1: Memory Cache Internal to the CPU L2: Memory Cache External to the CPU http://domino.research.ibm.com/comm/research.nsf/pages/r.arch.innovation.htm

H.264 Advanced Video Coding H.264 is a video compression standard Version 1 completed May 2003 ITU-T Video Coding Experts Group (H.264) ISO/IEC Moving Picture Experts Group (MPEG-4 AVC) Macroblock(MB) based CODEC closely related to MPEG-2 Growing demand for HD and Wireless video 50% bit rate reduction over previous standard Computational complexity approximately 2.4 x MPEG2

H.264: Decoder

Reference Code: FFmpeg (H.264 Decoder) Open source video and audio converter Handles a multitude of formats Codecs other than H.264 decoder removed About 250K Lines of Code after paring to H.264 only About 200 functions ported to SPU in our implementation http://www.ffmpeg.org/

H.264 Frame Level Relationships I Frame: Independently Encoded Intra Prediction P Frame: Predicted from a Preceding frame Intra and Inter Prediction B Frame: Predicted from Both preceding and following frames

H.264 Opportunities for Parallelism: GOP and Frame Level I, P, B Frames Picture Sequence IBBPBBP Independent Group of Pictures (GOP) Independent Frames within GOP

H.264 Opportunities for Parallelism: Slice and MB Level Slices: Independently encoded groups of MBs within a frame Intra Dependencies:

Data Partitioning Scheme Our Scheme: One row of MBs issued to each SPU Possible Intra MB dependencies:

Functional Partitioning CBE Architecture:

FFmpeg main MB decoding loop Intra Inter

Scalable Implementation

FFmpeg Data Structure Modification Single threaded code: monolithic data structure Entire structure needed to decode single MB but majority is static from one MB to the next SPU only requires applicable subset for one row of MBs Only MB specific data replicated in SPU LS Figure 10: Data structure modifications reducing memory requirements in the local store. W is the width of the video frame in macroblocks.

SPU LS(Local Store) Limitations

Code Overlay Code segment contains one or more functions Memory region assigned one or more segments At run time, region contains exactly one segment

Designing an Overlay Scheme Start with one flat region 1. Identify key functions and assign to new regions Profiling indicates f21() is most important with 50 calls However, f11() is present 80 times in the trace f11() is a key function 2. Create new regions based on profiling data until memory is exhausted

Designing an Overlay Scheme

Overlay Performance

Additional Performance Optimizations

Experimental Results Microsoft’s WMV HD demonstration page [13] The source videos were transcoded into H.264 1920x1080 (1080p) format 5 different bitrates: 2.5, 4, 8, 12, 16Mbps CAVLC and CABAC Use the x264 H.264 encoder integrated into ffmpeg The videos were encoded using the x264 presets: baseline, normal, and hq Decoder performance is measured on the Sony’s Playstation 3, 3.2 GHz Cell Processor (limited by Sony for access to six of the CBE’s eight SPUs) running Linux Fedora 9 [13] Microsoft Corporation. WMV HD Content Showcase. http://www.microsoft.com/windows/windowsmedia/musicandvideo/hdvideo/contentshowcase.aspx

The white band at the bottom is the PPU (entropy decoder) contribution Motion vector decoding and deblocking are the most expensive components The white band at the bottom is the PPU (entropy decoder) contribution Figure 14: Breakdown of decoder performance by component using a single SPU.

Decoder Performance Compare with [4], our implementation achieves an average 25.23fps or a 23% improvement when decoding similarly encoded video streams on four SPUs. [4] H. Baik, K.-H. Sihn, Y. il Kim, S. Bae, N. Han, and H. J. Song. “Analysis and Parallelization of H.264 decoder on Cell Broadband Engine Architecture.” In Signal Processing and Information Technology, pages 791–795. Samsung Electron. Co., Ltd., Suwon, Korea, 2007.

Our implementation achieved a “best case” average framerate of 34 Our implementation achieved a “best case” average framerate of 34.94fps on 2.5Mbps modified-normal CAVLC encoded video streams on six SPUs And a “worst case” entropy decoder limited average framerate of 15.43fps on 16Mbps hq CABAC encoded video streams.

Conclusion Demonstrated scalable H.264 decoder for multicore processor 23% frame rate advantage over prior work [4] on similar videos and using same number of cores Careful engineering required to efficiently manage data structures and scratchpad memory