Sum of Absolute Differences Hardware Accelerator

Slides:



Advertisements
Similar presentations
1 RTL Example: Video Compression – Sum of Absolute Differences Video is a series of frames (e.g., 30 per second) Most frames similar to previous frame.
Advertisements

© 2008 Wayne Wolf Overheads for Computers as Components 2 nd ed. Accelerators zExample: video accelerator.
H.264 Intra Frame Coder System Design Özgür Taşdizen Microelectronics Program at Sabanci University 4/8/2005.
Time Optimization of HEVC Encoder over X86 Processors using SIMD
Basics of MPEG Picture sizes: up to 4095 x 4095 Most algorithms are for the CCIR 601 format for video frames Y-Cb-Cr color space NTSC: 525 lines per frame.
Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,
 Understanding the Sources of Inefficiency in General-Purpose Chips.
1 Adaptive slice-level parallelism for H.264/AVC encoding using pre macroblock mode selection Bongsoo Jung, Byeungwoo Jeon Journal of Visual Communication.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Super Fast Camera System Supervised by: Leonid Boudniak Performed by: Tokman Niv Levenbroun Guy.
1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.
Analysis, Fast Algorithm, and VLSI Architecture Design for H
Motion Vector Refinement for High-Performance Transcoding Jeongnam Youn, Ming-Ting Sun, Fellow,IEEE, Chia-Wen Lin IEEE TRANSACTIONS ON MULTIMEDIA, MARCH.
A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.
An Introduction to H.264/AVC and 3D Video Coding.
JPEG 2000 Image Type Image width and height: 1 to 2 32 – 1 Component depth: 1 to 32 bits Number of components: 1 to 255 Each component can have a different.
ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken.
EKT 221/4 DIGITAL ELECTRONICS II  Registers, Micro-operations and Implementations - Part3.
By: Hitesh Yadav Supervising Professor: Dr. K. R. Rao Department of Electrical Engineering The University of Texas at Arlington Optimization of the Deblocking.
Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.
MOTION ESTIMATION IMPLEMENTATION IN VERILOG
COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/
Computer Science 101 Circuit Design - Examples. Sum of Products Algorithm Identify each row of the output that has a 1. Identify each row of the output.
MOTION ESTIMATION IMPLEMENTATION IN RECONFIGURABLE PLATFORMS
Effect of Saturation Arithmetic on Sum of Absolute Difference (SAD) Computation in H.264 Venkata Suman Sanikommu ECE 734 Project Presentation.
Page 11/28/2016 CSE 40373/60373: Multimedia Systems Quantization  F(u, v) represents a DCT coefficient, Q(u, v) is a “quantization matrix” entry, and.
Motion Estimation Multimedia Systems and Standards S2 IF Telkom University.
Time Optimization of HEVC Encoder over X86 Processors using SIMD Kushal Shah Advisor: Dr. K. R. Rao Spring 2013 Multimedia.
Mode Decision and Fast Motion Estimation in H.264 K.-C. Yang Qionghai Dai, Dongdong Zhu and Rong Ding,”FAST MODE DECISION FOR INTER PREDICTION IN H.264,”
Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.
Principles of Video Compression Dr. S. M. N. Arosha Senanayake, Senior Member/IEEE Associate Professor in Artificial Intelligence Room No: M2.06
An Area-Efficient VLSI Architecture for Variable Block Size Motion Estimation of H.264/AVC Hoai-Huong Nguyen Le' and Jongwoo Bae 1 1 Department of Information.
Multi-Frame Motion Estimation and Mode Decision in H.264 Codec Shauli Rozen Amit Yedidia Supervised by Dr. Shlomo Greenberg Communication Systems Engineering.
Complexity varying intra prediction in H.264 Supervisors: Dr. Ofer Hadar, Mr. Evgeny Kaminsky Students: Amit David, Yoav Galon.
DIGITAL SYTEM DESIGN MINI PROJECT CONVOLUTION CODES
EKT 221 : Digital 2 Serial Transfers & Microoperations
Computer Organization and Architecture + Networks
EC6703 EMBEDDED AND REAL TIME SYSTEMS
EKT 221 : Digital 2 Serial Transfers & Microoperations
Multipliers Multipliers play an important role in today’s digital signal processing and various other applications. The common multiplication method is.
Basics Combinational Circuits Sequential Circuits Ahmad Jawdat
Lecture 16: Parallel Algorithms I
LOW POWER DIGITAL VIDEO COMPRESSION HARDWARE DESIGN
Nested Loop Structure for Fixed Size ME
Lecture 12: Adders, Sequential Circuits
Lecture 12: Adders, Sequential Circuits
Unit-2 Divide and Conquer
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
An enhanced estimation: motion and rotation estimation
Multivector and SIMD Computers
MPEG4 Natural Video Coding
Search,Sort,Recursion.
Overview Part 1 – Design Procedure Part 2 – Combinational Logic
Instructor: Professor Yu Hen Hu
Chapter-4 Combinational Logic
Bongsoo Jung, Byeungwoo Jeon
Superscalar and VLIW Architectures
Lecture 2 The Art of Concurrency
ECE 352 Digital System Fundamentals
Lecture 9 Digital VLSI System Design Laboratory
Comparison of Various Multipliers for Performance Issues
Quantizing Compression
Unit 2: Computational Thinking, Algorithms & Programming
COMP60611 Fundamentals of Parallel and Distributed Systems
LSH-based Motion Estimation
DSP Architectures for Future Wireless Base-Stations
ECE 352 Digital System Fundamentals
Chapter 10 Introduction to VHDL
Quantizing Compression
Presentation transcript:

Sum of Absolute Differences Hardware Accelerator Mark Lodermeier 9:13 PM

Outline Overview of Motion Estimation and MPEG4 – Part 10 - AVC Approach Tasks Performed To Do 9:13 PM

Motion Estimation Used for video compression – block matching between successive frames Search for best matching block (find motion vectors) Used to create model of current frame using reference frame(s) from either previous or future frames Motion vectors found by determining the minimum SAD MV(r,s) = argmin[SAD(x,y,r,s)] 9:13 PM

Motion Estimation Full Search algorithm produces the best results. However, computationally expensive. Motion Estimation accounts for 50-70% of computational complexity in MPEG-4 video encoding/decoding For a small search range of [-8, +7] in each direction with 16x16 macroblocks, there are 16*16 pixel comparisons performed 16*16 times = 65,536 additions of absolute differences for a single 16x16 block Real-time video of a 480x640 9:13 PM

MPEG-4 Part 10 – AVC Variable Block Sizes Each 16x16 Macroblock can be split in half into two 16x8 or 8x16 blocks or into four 8x8 sub-blocks These sub-blocks can then be split in half into two 8x4 or 4x8 blocks or into four 4x4 blocks. 9:13 PM

MPEG-4 Part 10 – AVC Previous 16x16 macroblock split into smaller blocks 9:13 PM

MPEG4 Variable Block Sizes Purpose: Many small blocks requires large amount of bits to encode Few large blocks may produce poor quality Can produce higher efficiency with same quality Challenges: Generate motion vectors for all block sizes - Increase computation cost to an already intensive algorithm Choose the correct block size among many choices to balance bandwidth and quality 9:13 PM

Approach How to efficiently generate all MV’s for variable sized blocks? Take full advantage of parallel nature of both Motion Estimation and the generation of variable sized blocks Maintain high processor utilization 9:13 PM

Tasks Performed Implemented 1-D systolic array in VHDL 16 Processing Elements, each with: Absolute Difference unit 9 to 2 Compressor 3 to 1 Compressor 9:13 PM

Absolute Difference Unit 9:13 PM

Absolute Difference Math behind absolute difference unit: Just check condition - A > B B + B_not = 2n – 1  B_not = 2n – 1 – B 2n-1+|A-B| is the value of the sum of the two outputs of the absolute difference unit. Need to add a correction term of m to get rid of the 2n-1, where m is equal to the number of absolute difference units used. 9:13 PM

Single Processing Element Abs Diff Unit C B A 9 to 2 Adder Reduction Tree 3 to 1 Reduction Tree Latch Correction term - 4 9:13 PM

… … PE0 PE1 PE2 PE15 Systolic Array C B A D D D D control 4x4 SAD and MV 4x4 SAD and MV 4x4 SAD and MV 4x4 SAD and MV Shift Register Shift Register Shift Register Shift Register … control 16 4x4 SAD values and MV’s 8 4x8 SAD values and MV’s 8 8x4 SAD values and MV’s Back-End Adder Array for Variable Blocks 4 8x8 SAD values and MV’s 2 8x16 SAD values and MV’s 2 16x8 SAD values and MV’s 1 16x16 SAD value and MV 9:13 PM

Back-End Adder Array Diagram 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4x4 SAD’s and MV’s 8 8x4 SAD’s and MV’s 8 4x8 SAD’s and MV’s 4 8x8 SAD’s and MV’s 2 8x16 and 2 16x8 SAD’s and MV’s 16x16 SAD and MV Macroblock with 16 4x4 sub-blocks - Each dot represents the following: 12 13 14 15 8 9 10 11 4 5 6 7 0 1 2 3 Min Latch 9:13 PM

Another way to represent the generation of motion vectors for the variable sized blocks p x p Matrix SAD values for one 4x4 block in all search positions p x p Matrix SAD values for one 4x8 block in all search positions p x p Matrix SAD values for one 4x4 block in all search positions First you take the pxp matrices from the specified 4x4 blocks to generate the matrix for the 4x8 block (where p is the total search range) To find the corresponding motion vector, you just search for the minimum value in the matrix 9:13 PM

The Schedule for the Accelerator: The top box is the current frame data and the bottom box is the reference frame data Every 4 clock cycles a PE will produce a 4x4 SAD value It takes 16 clock cycles to fill the systolic array and have 100% Processor Utilization After 64 clock cycles PE0 will have completed a 16x16 macroblock search for one position, Every other PE will then do the same on the following cycle So, to do a full search of (-8, 7) in the x and y positions, a total of 64x16 = 1024 cycles are needed. 9:13 PM

Questions? 9:13 PM