Implementation of a De-blocking Filter and Optimization in PLX

Slides:

Advertisements

Similar presentations

Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.

Advertisements

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst

ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.

1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.

1 Optimization Optimization = transformation that improves the performance of the target code Optimization must not change the output must not cause errors.

Optimizing single thread performance Dependence Loop transformations.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.

Jack Ou, Ph.D. CES522 Engineering Science Sonoma State University

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Image Quilting for Texture Synthesis and Transfer Alexei A. Efros1,2 William T. Freeman2.

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia

Design Space Exploration

Processor Architecture Needed to handle FFT algoarithm M. Smith.

Develop and Implementation of the Speex Vocoder on the TI C64+ DSP

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.

By: Hitesh Yadav Supervising Professor: Dr. K. R. Rao Department of Electrical Engineering The University of Texas at Arlington Optimization of the Deblocking.

An Efficient Implementation of Scalable Architecture for Discrete Wavelet Transform On FPGA Michael GUARISCO, Xun ZHANG, Hassan RABAH and Serge WEBER Nancy.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

Area: VLSI Signal Processing.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

QCAdesigner – CUDA HPPS project

Design and Implementation of Turbo Decoder for 4G standards IEEE e and LTE Syed Z. Gilani.

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Vector and symbolic processors

1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.

Muhammad Shoaib Bin Altaf. Outline Motivation Actual Flow Optimizations Approach Results Conclusion.

WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.

Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

Adaptive Median Filter

Parallelizing an Image Compression Toolbox

Embedded Systems Design

Parallel Data Laboratory, Carnegie Mellon University

DESIGN AND IMPLEMENTATION OF DIGITAL FILTER

Digital Filter Design Tools

Adaptation Behavior of Pipelined Adaptive Filters

Implementation of DWT using SSE Instruction Set

Figure 13.1 MIPS Single Clock Cycle Implementation.

Implementation of IDEA on a Reconfigurable Computer

A systolic array for a 2D-FIR filter for image processing

Faster File matching using GPGPU’s Deephan Mohan Professor: Dr

Kerry Widder ECE734 Spring 2006

Dongkeun Oh Sanghamitra Roy

Study and Optimization of the Deblocking Filter in H

Tsung-Hao Chen and Kuang-Ching Wang May

Finding a Eulerian Cycle in a Directed Graph

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Multiplier-less Multiplication by Constants

Numerical Algorithms Quiz questions

Samuel Larsen and Saman Amarasinghe, MIT CSAIL

By: Mrs. S. Allin Christe, Mr.M.Vignesh, Dr.A.Kandaswamy

Mapping DSP algorithms to a general purpose out-of-order processor

Real time signal processing

rePLay: A Hardware Framework for Dynamic Optimization

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Murugappan Senthilvelan May 4th 2004

Implementation Plan system integration required for each iteration

Presentation transcript:

Implementation of a De-blocking Filter and Optimization in PLX Ashwin Alapati Anandnayan Jayaraman

Outline Motivation Algorithm Transformation Proposed Architecture PLX Implementation Conclusion and Results

Motivation What is De-blocking? Types In-Loop De-blocking Post Processing De-blocking Computationally Intensive!!

Algorithm Input Image Pick a Macro-Block (16 x 16 ) Identify Blocking Artifacts in Horizontal Direction and apply Adaptive Filtering Identify Blocking Artifacts in Vertical Direction and apply Adaptive Filtering Output Image

Block Boundary Detection Determine Block Boundaries Strength Determination Adaptive Filtering FIR filtering with varying coefficients

Algorithm Transformation Concepts Used Retiming  Reducing Critical Path Unfolding  Reduce Iteration Bound PLX Sub-word Parallelism Exploit Parallelism Parallel Execution by Loop Vectorization

Architecture Post Processing Address Generation + Memory OUT Mux IN Horizontal Block Boundary Detection Horizontal Filtering Vertical Filtering OUT Mux IN Post Processing Vertical Block Boundary Detection Address Generation + Memory

Input to the Architecture

Results

Profiling Results Operation % of total Execution Time 35.285 24.428 Vertical Boundary Detection 35.285 Horizontal Boundary Detection 24.428 Vertical Filtering 12.324 Horizontal Filtering 8.532 Misc ( Image IO ) 19.431

Issues in PLX Getting Input Values Used C to dump the bmp values into a file Memory Access Used a sequential way of addressing the data

Results of PLX Implementation PLX implementation - 1548 cycles C code profiling - 4843 cycles Approximate speedup is 3.1X Around 20% faster in terms of time

Work Done Selecting the Algorithm Developed Architecture Implemented algorithm in C Profiling Implemented algorithm in PLX Performance Evaluation

Future Work Try optimizing the PLX code Use PLX for filtering as well

Thank You !! Questions ???