Implementation of a De-blocking Filter and Optimization in PLX

Slides:



Advertisements
Similar presentations
Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University.
Advertisements

25 July, 2014 Martijn v/d Horst, TU/e Computer Science, System Architecture and Networking 1 Martijn v/d Horst
ADSP Lecture2 - Unfolding VLSI Signal Processing Lecture 2 Unfolding Transformation.
1 A HIGH THROUGHPUT PIPELINED ARCHITECTURE FOR H.264/AVC DEBLOCKING FILTER Kefalas Nikolaos, Theodoridis George VLSI Design Lab. Electrical & Computer.
1 Optimization Optimization = transformation that improves the performance of the target code Optimization must not change the output must not cause errors.
Optimizing single thread performance Dependence Loop transformations.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
Jack Ou, Ph.D. CES522 Engineering Science Sonoma State University
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Image Quilting for Texture Synthesis and Transfer Alexei A. Efros1,2 William T. Freeman2.
Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.
Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
1 EECS Components and Design Techniques for Digital Systems Lec 21 – RTL Design Optimization 11/16/2004 David Culler Electrical Engineering and Computer.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.
Communication-Minimizing 2D Convolution in GPU Registers Forrest N. Iandola David Sheffield Michael Anderson P. Mangpo Phothilimthana Kurt Keutzer University.
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
NCCS Brown Bag Series. Vectorization Efficient SIMD parallelism on NCCS systems Craig Pelissier* and Kareem Sorathia
Design Space Exploration
Processor Architecture Needed to handle FFT algoarithm M. Smith.
Develop and Implementation of the Speex Vocoder on the TI C64+ DSP
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.
Array Synthesis in SystemC Hardware Compilation Authors: J. Ditmar and S. McKeever Oxford University Computing Laboratory, UK Conference: Field Programmable.
By: Hitesh Yadav Supervising Professor: Dr. K. R. Rao Department of Electrical Engineering The University of Texas at Arlington Optimization of the Deblocking.
An Efficient Implementation of Scalable Architecture for Discrete Wavelet Transform On FPGA Michael GUARISCO, Xun ZHANG, Hassan RABAH and Serge WEBER Nancy.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
Area: VLSI Signal Processing.
Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.
QCAdesigner – CUDA HPPS project
Design and Implementation of Turbo Decoder for 4G standards IEEE e and LTE Syed Z. Gilani.
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
Vector and symbolic processors
1 Modular Refinement of H.264 Kermin Fleming. 2 What is H.264? Mobile Devices Low bit-rate Video Decoder –Follow on to MPEG-2 and H.26x Operates on pixel.
Muhammad Shoaib Bin Altaf. Outline Motivation Actual Flow Optimizations Approach Results Conclusion.
WARP PROCESSORS ROMAN LYSECKY GREG STITT FRANK VAHID Presented by: Xin Guan Mar. 17, 2010.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
Adaptive Median Filter
Parallelizing an Image Compression Toolbox
Embedded Systems Design
Parallel Data Laboratory, Carnegie Mellon University
DESIGN AND IMPLEMENTATION OF DIGITAL FILTER
Digital Filter Design Tools
Adaptation Behavior of Pipelined Adaptive Filters
Implementation of DWT using SSE Instruction Set
Figure 13.1 MIPS Single Clock Cycle Implementation.
Implementation of IDEA on a Reconfigurable Computer
A systolic array for a 2D-FIR filter for image processing
Faster File matching using GPGPU’s Deephan Mohan Professor: Dr
Kerry Widder ECE734 Spring 2006
Dongkeun Oh Sanghamitra Roy
Study and Optimization of the Deblocking Filter in H
Tsung-Hao Chen and Kuang-Ching Wang May
Finding a Eulerian Cycle in a Directed Graph
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
Multiplier-less Multiplication by Constants
Numerical Algorithms Quiz questions
Samuel Larsen and Saman Amarasinghe, MIT CSAIL
By: Mrs. S. Allin Christe, Mr.M.Vignesh, Dr.A.Kandaswamy
Mapping DSP algorithms to a general purpose out-of-order processor
Real time signal processing
rePLay: A Hardware Framework for Dynamic Optimization
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Murugappan Senthilvelan May 4th 2004
Implementation Plan system integration required for each iteration
Presentation transcript:

Implementation of a De-blocking Filter and Optimization in PLX Ashwin Alapati Anandnayan Jayaraman

Outline Motivation Algorithm Transformation Proposed Architecture PLX Implementation Conclusion and Results

Motivation What is De-blocking? Types In-Loop De-blocking Post Processing De-blocking Computationally Intensive!!

Algorithm Input Image Pick a Macro-Block (16 x 16 ) Identify Blocking Artifacts in Horizontal Direction and apply Adaptive Filtering Identify Blocking Artifacts in Vertical Direction and apply Adaptive Filtering Output Image

Block Boundary Detection Determine Block Boundaries Strength Determination Adaptive Filtering FIR filtering with varying coefficients

Algorithm Transformation Concepts Used Retiming  Reducing Critical Path Unfolding  Reduce Iteration Bound PLX Sub-word Parallelism Exploit Parallelism Parallel Execution by Loop Vectorization

Architecture Post Processing Address Generation + Memory OUT Mux IN Horizontal Block Boundary Detection Horizontal Filtering Vertical Filtering OUT Mux IN Post Processing Vertical Block Boundary Detection Address Generation + Memory

Input to the Architecture

Results

Profiling Results Operation % of total Execution Time 35.285 24.428 Vertical Boundary Detection 35.285 Horizontal Boundary Detection 24.428 Vertical Filtering 12.324 Horizontal Filtering 8.532 Misc ( Image IO ) 19.431

Issues in PLX Getting Input Values Used C to dump the bmp values into a file Memory Access Used a sequential way of addressing the data

Results of PLX Implementation PLX implementation - 1548 cycles C code profiling - 4843 cycles Approximate speedup is 3.1X Around 20% faster in terms of time

Work Done Selecting the Algorithm Developed Architecture Implemented algorithm in C Profiling Implemented algorithm in PLX Performance Evaluation

Future Work Try optimizing the PLX code Use PLX for filtering as well

Thank You !! Questions ???