Dynamically Parameterized Architectures for Power Aware Video Coding: Motion Estimation and DCT Wayne Burleson Prashant Jain

Slides:

Advertisements

Similar presentations

T.Sharon-A.Frank 1 Multimedia Compression Basics.

Advertisements

INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.

QUIZ What does ICAP stand for ? What is its main use ? Why is Partition Pin preferred over Bus Macro? 1.

Scalable and Low Cost Design Approach for Variable Block Size Motion Estimation Hadi Afshar, Philip Brisk, Paolo Ienne EPFL Hadi Afshar, Philip Brisk,

Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.

Technion - IIT Dept. of Electrical Engineering Signal and Image Processing lab Transrating and Transcoding of Coded Video Signals David Malah Ran Bar-Sella.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Yu-Han Chen, Tung-Chien Chen, Chuan-Yung Tsai, Sung-Fang Tsai, and Liang-Gee Chen, Fellow, IEEE IEEE CSVT

ELEC692 VLSI Signal Processing Architecture Lecture 9 VLSI Architecture for Discrete Cosine Transform.

SWE 423: Multimedia Systems Chapter 7: Data Compression (1)

1 Single Reference Frame Multiple Current Macroblocks Scheme for Multiple Reference IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY Tung-Chien.

FPGA Based Video Codec: Implementation and Techniques An Seminar Series Markus Adhiwiyogo Benjamin Ernest-Jones Matt Richey.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Analysis, Fast Algorithm, and VLSI Architecture Design for H

Page 1 CS Department Parallel Design of JPEG2000 Image Compression Xiuzhen Huang CS Department UC Santa Barbara April 30th, 2003.

Introduction to Video Transcoding Of MCLAB Seminar Series By Felix.

Burleson, UMASS1 Adaptive System on a Chip (ASOC): A Backbone for Power-Aware Signal Processing Cores Andrew Laffely, Jian Liang, Russ Tessier and Wayne.

Adaptive System on a Chip (aSoC) for Low-Power Signal Processing Andrew Laffely, Jian Liang, Prashant Jain, Ning Weng, Wayne Burleson, Russell Tessier.

Automatic Generation of Customized Discrete Fourier Transform IPs Grace Nordin, Peter A. Milder, James C. Hoe, Markus Püschel Carnegie Mellon University.

A Low-Power VLSI Architecture for Full-Search Block-Matching Motion Estimation Viet L. Do and Kenneth Y. Yun IEEE Transactions on Circuits and Systems.

Xinqiao LiuRate constrained conditional replenishment1 Rate-Constrained Conditional Replenishment with Adaptive Change Detection Xinqiao Liu December 8,

Low power and cost effective VLSI design for an MP3 audio decoder using an optimized synthesis- subband approach T.-H. Tsai and Y.-C. Yang Department of.

A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.

1 Real time signal processing SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

1 Thread-Parallel MPEG-2, MPEG4 and H.264 Video Encoders for SoC Multi- Processor Architecture Tom R. Jacobs, Vassilios A. Chouliars, and David J. Mulvaney.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

Platform-based Design for MPEG-4 Video Encoder Presenter: Yu-Han Chen.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

MPEG Motion Picture Expert Group Moving Picture Encoded Group Prateek raj gautam(725/09)

1 Efficient Reference Frame Selector for H.264 Tien-Ying Kuo, Hsin-Ju Lu IEEE CSVT 2008.

A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,

Architectures for mobile and wireless systems Ese 566 Report 1 Hui Zhang Preethi Karthik.

Efficient FPGA Implementation of QR

Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.

ELEC692 VLSI Signal Processing Architecture Lecture 7 VLSI Architecture for Block Matching Algorithm for Video compression * Part of the notes is taken.

Performance Enhancement of Video Compression Algorithms using SIMD Valia, Shamik Jamkar, Saket.

Page 1 Reconfigurable Communications Processor Principal Investigator: Chris Papachristou Task Number: NAG Electrical Engineering & Computer Science.

Image Processing and Computer Vision: 91. Image and Video Coding Compressing data to a smaller volume without losing (too much) information.

L28:Lower Power Algorithm for Multimedia Systems(2) 성균관대학교 조 준 동

Image Compression Supervised By: Mr.Nael Alian Student: Anwaar Ahmed Abu-AlQomboz ID: IT College “Multimedia”

J. Greg Nash ICNC 2014 High-Throughput Programmable Systolic Array FFT Architecture and FPGA Implementations J. Greg.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

Diploma Project Real Time Motion Estimation on HDTV Video Streams (using the Xilinx FPGA) Supervisor :Averena L.I. Student:Das Samarjit.

Low-Power H.264 Video Compression Architecture for Mobile Communication Student: Tai-Jung Huang Advisor: Jar-Ferr Yang Teacher: Jenn-Jier Lien.

MOTION ESTIMATION IMPLEMENTATION IN VERILOG

COARSE GRAINED RECONFIGURABLE ARCHITECTURE FOR VARIABLE BLOCK SIZE MOTION ESTIMATION 03/26/

Compression video overview 演講者：林崇元. Outline Introduction Fundamentals of video compression Picture type Signal quality measure Video encoder and decoder.

A Reconfigurable Low-power High-Performance Matrix Multiplier Architecture With Borrow Parallel Counters Counters : Rong Lin SUNY at Geneseo

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #4 – FPGA.

-BY KUSHAL KUNIGAL UNDER GUIDANCE OF DR. K.R.RAO. SPRING 2011, ELECTRICAL ENGINEERING DEPARTMENT, UNIVERSITY OF TEXAS AT ARLINGTON FPGA Implementation.

Advances in digital image compression techniques Guojun Lu, Computer Communications, Vol. 16, No. 4, Apr, 1993, pp

Séminaire COSI-Roscoff’011 Séminaire COSI ’01 Power Driven Processor Array Partitionning for FPGA SoC S.Derrien, S. Rajopadhye.

COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY

Implementation, Comparison and Literature Review of Spatio-temporal and Compressed domains Object detection. By Gokul Krishna Srinivasan Submitted to Dr.

Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.

Chapter 8 Lossy Compression Algorithms. Fundamentals of Multimedia, Chapter Introduction Lossless compression algorithms do not deliver compression.

1 Power-Aware System on a Chip A. Laffely, J. Liang, R. Tessier, C. A. Moritz, W. Burleson University of Massachusetts Amherst Boston Area Architecture.

Motion Estimation Multimedia Systems and Standards S2 IF Telkom University.

Hierarchical Systolic Array Design for Full-Search Block Matching Motion Estimation Noam Gur Arie,August 2005.

PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.

Last update on June 15, 2010 Doug Young Suh

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Jian Huang, Matthew Parris, Jooheung Lee, and Ronald F. DeMara

Kyoungwoo Lee, Minyoung Kim, Nikil Dutt, and Nalini Venkatasubramanian

DSPs for Future Wireless Base-Stations

Presentation transcript:

Dynamically Parameterized Architectures for Power Aware Video Coding: Motion Estimation and DCT Wayne Burleson Prashant Jain Subramanian Venkatraman Dept. of Electrical and Computer Engineering University of Massachusetts Amherst This work was partially supported by NSF

Outline Introduction Video Content Variation Dynamic Parameterization to achieve Power-Aware Video Coding Motion Estimation & DCT On-Going Work

Introduction Video Content and processing are non-uniform in space and time. Video processing can gracefully degrade in power constrained environments. Exploits Perceptual tolerance. MPEG-4. High level algorithm changes affect power efficiency the most.

Recent Work Configurable FPGA based Architectures [Villasenor ‘95]. Heterogeneous architecture with Programmable Processors [Kneip ‘98]. Heterogeneous Configurable architecture with on-chip low- power FPGA [Zhang ‘00]. FPGAs Slow High power dissipation

Adaptive System-On-a-Chip (aSOC) Partially Predefined Configuration Architecture Heterogeneous tiles with Statically scheduled interconnection switches Tiles can be reconfigured internally as well as from an external source uP DSP RISC RAM ME/DCT Core SRAM Switch Switch Memory FPGA Ref. J. Liang et. al., aSOC: A Scalable, Single-Chip Communications Architecture in the Proceedings of the IEEE International Conference on Parallel Architectures and Compilation Techniques, 2000

Outline Introduction Video Content Variation Dynamic Parameterization Motion Estimation & DCT On-Going Work

Content Variation across sequences

Content Variation in Time Horizontal Component of the Motion Vectors

Content Variation in Space Background: Not much variation High variation

Outline Introduction Content Variation Dynamic Parameterization Motion Estimation & DCT On-Going Work

Dynamic Parameterization Functional parameters vary the output of a computation. Architectural parameters allow trade-offs in area, performance, power and reliability. Parameters can be bound at varying stages. Standard Time IP TimeRun-Time Config. Time Compile/ Boot Time Design Time Years…Months…Secs…msecs…  secs…

Dynamic Parameter Adjustment System Requirements and Constraints Signal statistics from the Input Signals Algorithm statistics from the post processing of the Input Signals Algorithm Architecture Predictor Archi. Para. Function. Para. Signals Precision, Quality, Compress. Algo. & Archi. Stats. Signal Stats. Area Speed Power Area, Latency, Power Predictor Inputs Predictor Outputs Architectural and Functional Parameters Signal Processing System

Functional Parameter Adjustment: Algorithms Full SearchLogarithmic AlgorithmsCompressionFrames encoded/sec (fps) Full Search70:10.2 Logarithmic50:12.76

Functional Parameter Adjustment: Search Space Larger search space improves chances of a good match. A Good match Increasing search space is effective up to a point Larger search space increases computations. High Compression bpp Plot for a specific sequence

Power versus Search Area Memories – Major contributors to Power dissipation. Algorithms presented reduce memory accesses and computations. Our novel architecture reconfigures to different algorithms with reduced memory accesses and computations, thus saving power.

Power Consumption in Video Coding Ref. Peter Kuhn, “Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation” Computation (%) ME DCT IDCT VLC, etc.

Outline Introduction Content Variation Dynamic Parameterization Motion Estimation & DCT On-Going Work

Functional Parameter: Full Search Selects the most representative block from an exhaustive set of candidate blocks within a search window.

Functional Parameter: Spiral Search Performs a Spiral Search for the matching block. Algorithm is data dependent during run-time.

Functional Parameter : 3-Step Search

Functional Parameter: Pel Subsampling 16x16 Pixel Array 4:1 Subsampling2:1 Subsampling

Functional Parameter: Half-Pel ME Current and Previous block data can be filtered to Half-Pel resolution. Ref. Peter Kuhn, “Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation” A DC B a c b a= (A+B+C+D)/2 b= (B+D)/2 c= (C+D)/2

I/O Re-use Current Block Candidate Blocks Candidate blocks differ by a single row of pixels Can reuse the previous rows of pixels Previous rows are stored in FIFOs

Matching Criteria The Matching Criteria used is Sum of Absolute Differences (SAD). Ref. Peter Kuhn, “Algorithms, Complexity Analysis and VLSI Architectures for MPEG-4 Motion Estimation”

Proposed Architecture for Dynamically Parameterized ME 16x16 PE Array Address Generator Unit SRAM External to PE Array Memory Block Summing Block PE RAM Addresses PE Control 307, 200 bytes/frame storage

Architecture: Processing Element (PE) |c-p| Local Control Sum of Absolute Differences Half-Pel FIFO Current Pixel &  256 bytes

Outline Introduction Content Variation Dynamic Parameterization Motion Estimation & DCT On-Going Work

Discrete Cosine Transform Integral part of any still-image or video compression system. Compute intensive - next only to motion estimation. Amenable to VLSI implementation – “Decomposition” property and “Distributed Arithmetic”.

Decomposition Property 1D DCT in matrix notation 2D DCT~ 2 1D DCTs Ref. W.H. Chen at al., “A Fast Computational Algorithm for the Discrete Cosine Transform”, IEEE Trans. Commun.,

Distributed Arithmetic A0 A1 A1+A0 A2 A3+A2+A1 A3+A2+A1+A0 + Result X0 0 X0 1 X0 2 X0 3 X1 0 X1 1 X1 2 X1 3 X2 0 X2 1 X2 2 X2 3 X3 0 X3 1 X3 2 X3 3 4 to 16 Address Decoder X2 Bit-serial arithmetic using Read Accumulate Computation (RAC) unit Inner product computation of coefficient vector A and input vector X Facilitates variable- precision processing Ref. T. Xanthopoulos et al., “A Low-Power DCT Core Using Adaptive Bitwidth and Arithmetic Activity Exploiting Signal Correlations and Quantization”, IEEE JSSC 2000

Exploiting Content Variation Most Significant Bit Rejection (MSBR) RAC operation disabled in the presence of spatial correlation Row Column Classification (RCC) Reduction in overall arithmetic activity by imposing upper bound on RAC cycles Replication of Arithmetic Units (RAU) Replication of the RAC units – trade-off between Power and Performance

Energy Efficiency Comparison Among DCT/IDCT ChipSw-Cap/sample Matsui et al.375 pF Bhattacharya et al.479 pF Kuroda et al.417 pF T. Xanthopoulos et al.128 pF Ref. T. Xanthopoulos et al., “A Low-Power DCT Core Using Adaptive Bitwidth and Arithmetic Activity Exploiting Signal Correlations and Quantization”, IEEE JSSC 2000

Architecture of DCT Core Ref. T. Xanthopoulos et al., “A Low-Power DCT Core Using Adaptive Bitwidth and Arithmetic Activity Exploiting Signal Correlations and Quantization”, IEEE JSSC 2000

Outline Introduction Video Content Variation Dynamic Parameterization to achieve Power-Aware Video Coding Motion Estimation & DCT On-Going Work

Implementations at the RTL, netlist and physical levels. Power estimation at the various levels mentioned above. Techniques for statistically tracking content variation. Full prototyping based on actual video workloads using a logic emulator from IKOS systems, and Extensions to other parameterized multimedia computations (e.g. 3D Graphics, natural and synthetic audio).

Conclusions Content variation and Dynamic Parameterization can be used to achieve power aware video coding. Proposed Motion Estimation & DCT architectures to be implemented to achieve the above.

Dynamically Parameterized Architectures for Power Aware Video Coding: Motion Estimation and DCT Wayne Burleson Prashant Jain Subramanian Venkatraman Dept. of Electrical and Computer Engineering University of Massachusetts Amherst This work was partially supported by NSF