Advanced Science and Technology Letters Vol.28 (CIA 2013), pp.72-76 An OpenCL-based Implementation of H.264.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Early Linpack Performance Benchmarking on IPE Mole-8.5 Fermi GPU Cluster Xianyi Zhang 1),2) and Yunquan Zhang 1),3) 1) Laboratory of Parallel Software.

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,

ECE 562 Computer Architecture and Design Project: Improving Feature Extraction Using SIFT on GPU Rodrigo Savage, Wo-Tak Wu.

Efficient Bit Allocation and CTU level Rate Control for HEVC Picture Coding Symposium, 2013, IEEE Junjun Si, Siwei Ma, Wen Gao Insitute of Digital Media,

© Fastvideo, Key Points We implemented the fastest JPEG codec Many applications using JPEG can benefit from our codec.

{ Fast Disparity Estimation Using Spatio- temporal Correlation of Disparity Field for Multiview Video Coding Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen.

2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.

Multi Agent Simulation and its optimization over parallel architecture using CUDA™ Abdur Rahman and Bilal Khan NEDUET(Department Of Computer and Information.

Wei Zhu, Xiang Tian, Fan Zhou and Yaowu Chen IEEE TCE, 2010.

Highly Parallel Rate-Distortion Optimized Intra-Mode Decision on Multicore Graphics Processors Ngai-Man Cheung, Oscar C. Au, Senior Member, IEEE, Man-Cheung.

1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.

Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.

1 An Efficient Mode Decision Algorithm for H.264/AVC Encoding Optimization IEEE TRANSACTION ON MULTIMEDIA Hanli Wang, Student Member, IEEE, Sam Kwong,

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

HARDEEPSINH JADEJA UTA ID: What is Transcoding The operation of converting video in one format to another format. It is the ability to take.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

GPGPU platforms GP - General Purpose computation using GPU

Department of Electrical Engineering National Cheng Kung University

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

EE 5359 PROJECT PROPOSAL FAST INTER AND INTRA MODE DECISION ALGORITHM BASED ON THREAD-LEVEL PARALLELISM IN H.264 VIDEO CODING Project Guide – Dr. K. R.

Y. Kotani · F. Ino · K. Hagihara Springer Science + Business Media B.V Reporter: 李長霖.

General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.

Gregory Fotiades.  Global illumination techniques are highly desirable for realistic interaction due to their high level of accuracy and photorealism.

Applying GPU and POSIX Thread Technologies in Massive Remote Sensing Image Data Processing By: Group 17 King Mongkut's Institute of Technology Ladkrabang.

GPU Architecture and Programming

Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Figure 1.a AVS China encoder [3] Video Bit stream.

SIGGRAPH 2011 ASIA Preview Seminar Rendering: Accuracy and Efficiency Shinichi Yamashita Triaxis Co.,Ltd.

-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

QCAdesigner – CUDA HPPS project

UNDER THE GUIDANCE DR. K. R. RAO SUBMITTED BY SHAHEER AHMED ID : Encoding H.264 by Thread Level Parallelism.

-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.

Advanced Science and Technology Letters Vol.28 (EEC 2013), pp A 0 Ohm substitution current probe is used.

Advanced Science and Technology Letters Vol.32 (Architecture and Civil Engineering 2013), pp A Study on.

Design Issues of Prefetching Strategies for Heterogeneous Software DSM Author :Ssu-Hsuan Lu, Chien-Lung Chou, Kuang-Jui Wang, Hsiao-Hsi Wang, and Kuan-Ching.

Advanced Science and Technology Letters Vol.28 (EEC 2013), pp Histogram Equalization- Based Color Image.

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

An Integrated GPU Power and Performance Model (ISCA’10, June 19–23, 2010, Saint-Malo, France. International Symposium on Computer Architecture)

CS 732: Advance Machine Learning

BLFS: Supporting Fast Editing/Writing for Large- Sized Multimedia Files Seung Wan Jung 1, Seok Young Ko 2, Young Jin Nam 3, Dae-Wha Seo 1, 1 Kyungpook.

GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

Advanced Science and Technology Letters Vol.28 (EEC 2013), pp Fuzzy Technique for Color Quality Transformation.

A Frame-Level Rate Control Scheme Based on Texture and Nontexture Rate Models for HEVC IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.

S. Pardi Frascati, 2012 March GPGPU Evaluation – First experiences in Napoli Silvio Pardi.

Advanced Science and Technology Letters Vol.47 (Healthcare and Nursing 2014), pp Design of Roof Type.

Stencil-based Discrete Gradient Transform Using

Steven Ge, Xinmin Tian, and Yen-Kuang Chen

Kwang-yeob Lee1, Nak-woong Eum2, Jae-chang Kwak1 *

Seunghui Cha1, Wookhyun Kim1

YangSun Lee*, YunSik Son**

6- General Purpose GPU Programming

Accelerating Regular Path Queries using FPGA

Presentation transcript:

Advanced Science and Technology Letters Vol.28 (CIA 2013), pp An OpenCL-based Implementation of H.264 Encoder Young Chun Kwon 1, Chanho Park 2, Daeseok Oh 2, WooSuk Jang 3, and Nakhoon Baek 1,* 1 School of Computer Sci. and Eng., Kyungpook National Univ., Daegu , Korea 2 Kumoh National Institute of Technology, Gumi , Korea 3 Yeungnam University, Gyeongsan , Korea Abstract. We present an accelerated implementation of high-speed video stream encoder for the H.264 digital video codec standard. Based on the parallel processing techniques with GPU's, we used an OpenCL- based GPU kernel programs. We achieved a high-level CPU-GPU interoperability, through making CPU perform all input/output operations and overall stream control, while GPU does the core encoding operations. Our final result shows remarkable speed-up in comparison with the previous implementations. Keywords: Video codec, H.264, OpenCL, parallel processing, implementation. 1 Introduction H.264 is one of the high-quality digital video codec standards [1]. It is now one of the most widely used video codes. The overall encoder processing of the H.264 standard requires huge amount of computations, to guarantee the final video quality. Though most previous H.264 encoder implementations are focused on their CPU- based accelerations, there are a few results on the parallelized H.264 encoders, recently. In this paper, we represent a new acceleration method for the H.264 decoders, based on the OpenCL (Open Computing Language). OpenCL [2] is currently one of the most widely used de facto standards in the field of general-purpose GPU (GPGPU) computing. Most previous researchers are concentrated on the acceleration of CPU computing, or introducing intuitive parallel processing architectures. In contrast, we focused on the overall parallelization of the encoding operations, and finally achieved remarkable speed-ups. In Section 2, we present previous CPU- based and GPGPU-based encoding schemes for the H.264 encoder. Our overall parallelized architecture and details of implementation algorithms are followed in Section 3. We show the actual implementation results in Section 4. Section 5 shows our conclusion and future works. * corresponding author: Nakhoon Baek, † This investigation was financially supported by Semiconductor Industry Collaborative Project between Kyungpook National University and Samsung Electronics Co. Ltd. ISSN: ASTL Copyright © 2013 SERSC

Advanced Science and Technology Letters Vol.28 (CIA 2013) 2 Previous Works GPGPU-based methods have been widely used for not only encoder processing but also various areas including image processing, big-scale data processing, and others. Specifically for the encoder processing, CUDA [3] and OpenCL are used for various video encoding schemes to enhance their internal algorithms. In the case of H.264, it shows relatively good encoding performance, and there are CUDA-based acceleration examples [4] and also OpenCL-based acceleration examples [5]. In this paper, we are focusing on the OpenCL standard. In the case of CUDA, it was released from a single manufacturer, NVIDIA, and thus, it cannot be easily used for the general devices from other hardware vendors. 3 Design and Implementation There has been a previous work [6] for the H.264 encoding pipeline implementation, with OpenCL. Our work is based on their implementation. We analyzed their overall architecture and located the key operations to be parallelized. Using OpenCL, we accelerated those key operations to finally achieve more optimized performance. Fig. 1. Overall architecture of our parallelized H.264 encoder. Figure 1 shows the overall architecture of our proposed algorithms. To concentrate on the encoding operations, we used the input/output features of the existing DirectShow filters [7] for the file input/output operations. When the target video files are read from the disk images, the DirectShow filters construct the raw image data in the host memory. Then, these image data are sent to the GPU, and the OpenCL-based GPU kernel program does the core encoding operations. The finally encoded results 73 Copyright © 2013 SERSC

Advanced Science and Technology Letters Vol.28 (CIA 2013) are copied back the host memory, and the DirectShow filter converts them to the final encoded video files on the disk. At the early stages of our parallelized program design, we analyzed that the overall operations are concentrated on the per-screen operations. Thus, we first tried to optimize these per-screen operations. However, our optimization immediately causes rapid increase of data transfer between the GPU and host memory. This kind of bottle-neck problem limits the overall performance of our architecture. Thus, we changed our optimization strategy to optimize another processing step of look- ahead operations. The look-ahead operations examine the future frames in the video stream, and decide the frame types and slice types of them in advance, according to the H.264 standard specification. The whole look-ahead processing is implemented as a single thread, and this thread uses the OpenCL-based GPU kernel to determine each slice type. The OpenCL kernel internally performs 4 steps: down-scale, intra-prediction, motion-search, and mode select. At the down-scale stage, the H.264 macro blocks are converted into lower resolution images. And then, the intra-prediction stage calculates the intra-cost estimation values. At the motion search stage, the motion vectors are obtained from the scaled-down images and intra-cost values. At the mode select stage, each slice types are selected from the cost values from the intermediate results. Our accelerated look-ahead processing based on the OpenCL-based GPU kernels achieves overall enhancements on the system processing speed. More detailed results are shown in the following section. 4 Results Our system is implemented on a Windows PC with an Intel i7 CPU and GeForce GTX560 Ti GPU. To check the execution speeds of a set of H.264 encoders, we use the same parameter settings as possible. Figure 2 shows the final benchmark results from a set of encoders, our implementation, x264 [6], MeGUI [8] and StaxRip [9]. Other encoders, x264, MeGUI, and StaxRip are selected from widely used H.264 encoders. As shown in Figure 2.(a), for the example video file of 154MB with 1920x1080 resolution, the finally encoded file sizes are almost same sizes of about 34MB. In the case of encoding time, our implementation shows the best result. It shows about 50% enhancement in comparison with StaxRip. And, it shows at least 10% speed-up even in comparison with the previous x264 implementation. 5 Conclusion We aimed to reduce the H.264 digital video encoding time, with GPGPU techniques. We started from analyzing the pros and cons of existing H.264 encoders, and located the heavy-computation areas in the real implementation. And, we applied parallel- processing techniques to those locations, while we preserve the already-optimized areas in the existing implementation. Finally, our OpenCL-based GPU kernel Copyright © 2013 SERSC 74

Advanced Science and Technology Letters Vol.28 (CIA 2013) implementation achieves 10%-to-50% speed-ups in comparison with widely-used H.264 encoders. (b) elapsed encoding time Fig. 2. Experimental results from various implementations, for 154MB video file with 1920×1080 resolution. Acknowledgments. This investigation was financially supported by Semiconductor Industry Collaborative Project between Kyungpook National University and Samsung Electronics Co. Ltd. References 1.Wiegand, T., et al.: Overview of the H.264/AVC Video Coding Standard, IEEE Trans. on Circuits Systems for Video Technology, vol.13, no.7, pp (2003) 2.Munshi A.: The OpenCL Specification, version , Khronos OpenCL Working Group (2012) 3.NVIDIA: CUDA C Programming Guide, version 2.5, NVIDIA (2012) 4.Wu, N., Wen, M., Su, H., Ren, J., Zhang, C.: A Parallel H.264 Encoder with CUDA:Mapping and Evaluation, IEEE 18th Int'l Conf. on Parallel and Distributed System (2012) (a) final file sizes 75 Copyright © 2013 SERSC

Advanced Science and Technology Letters Vol.28 (CIA 2013) 5.Marth, E., Marcus, G.: Parallelization of the x264 encoder using OpenCL, ACM SIGGRAPH 2010 Posters Article No. 72 (2012) 1.Merritt, L., Vanam, R.: Improved Rate Control and Motion Estimation for H.264 Encoder, IEEE Int'l Conf. on Image Processing 2007, vol. 5, pp (2007) 2.Polinger, A.: Developing Microsoft Media Foundation Applications, Microsoft Press (2011) 3.MeGUI, 4.StaxRip, Copyright © 2013 SERSC 76