VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
Basics of MPEG Picture sizes: up to 4095 x 4095 Most algorithms are for the CCIR 601 format for video frames Y-Cb-Cr color space NTSC: 525 lines per frame.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Computer Organization and Architecture
 Understanding the Sources of Inefficiency in General-Purpose Chips.
H.264/AVC Baseline Profile Decoder Complexity Analysis Michael Horowitz, Anthony Joch, Faouzi Kossentini, and Antti Hallapuro IEEE TRANSACTIONS ON CIRCUITS.
Evaluation of Data-Parallel Splitting Approaches for H.264 Decoding
Computer Organization and Architecture The CPU Structure.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
EEL 6935 Embedded Systems Long Presentation 2 Group Member: Qin Chen, Xiang Mao 4/2/20101.
LPC Speech Coder on the TI C6x DSP Mark Anderson, Jeff Burke EE213A / EE298-2 Prof. Ingrid Verbauwhede.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.
Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
Basics and Architectures
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
Low Bit Rate H Video Coding: Efficiency, Scalability and Error Resilience Faouzi Kossentini Signal Processing and Multimedia Group Department of.
Real-Time HD Harmonic Inc. Real Time, Single Chip High Definition Video Encoder! December 22, 2004.
Codec structuretMyn1 Codec structure In an MPEG system, the DCT and motion- compensated interframe prediction are combined. The coder subtracts the motion-compensated.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
RISC Architecture RISC vs CISC Sherwin Chan.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Advanced Processor Technology Architectural families of modern computers are CISC RISC Superscalar VLIW Super pipelined Vector processors Symbolic processors.
Copyright © 2003 Texas Instruments. All rights reserved. DSP C5000 Chapter 18 Image Compression and Hardware Extensions.
VLIW Digital Signal Processor Michael Chang. Alison Chen. Candace Hobson. Bill Hodges.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
Performance Tuning John Black CS 425 UNR, Fall 2000.
Overview von Neumann Architecture Computer component Computer function
HOW COMPUTERS WORK THE CPU & MEMORY. THE PARTS OF A COMPUTER.
Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
PRESENTED BY: MOHAMAD HAMMAM ALSAFRJALANI UFL ECE Dept. 3/31/2010 UFL ECE Dept 1 CACHE OPTIMIZATION FOR AN EMBEDDED MPEG-4 VIDEO DECODER.
Present by 楊信弘 Advisor: 鄭芳炫
Chapter Overview General Concepts IA-32 Processor Architecture
Assembly language.
Instruction Level Parallelism
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Vector Processing => Multimedia
Flow Path Model of Superscalars
Introduction to Pentium Processor
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
Comparison of Two Processors
ENEE 631 Project Video Codec and Shot Segmentation
Coe818 Advanced Computer Architecture
Standards Presentation ECE 8873 – Data Compression and Modeling
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Lecture 20: OOO, Memory Hierarchy
Lecture 20: OOO, Memory Hierarchy
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
Computer Architecture
Overview Prof. Eric Rotenberg
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Superscalar and VLIW Architectures
Main Memory Background
A Level Computer Science Topic 5: Computer Architecture and Assembly
Presentation transcript:

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder “Human beings are great programmers, Computers are poor actors” VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder Serene Banerjee Hamid R. Sheikh Lizy K. John Brian L. Evans Alan C. Bovik Department of Electrical and Computer Engineering The University of Texas at Austin VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder November 1st, 2000 serene@ece.utexas.edu

Baseline H.263 Video Encoding I: Intra frame: Discrete Cosine Transform (DCT) is used to reduce spatial redundancy within a frame. P: Predicted frame: Motion compensated prediction (MCP) used to reduce temporal redundancy. DCT is used to reduce spatial redundancy in the prediction error. I P Frame … VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

Baseline H.263 Encoder 2-D DCT Coding control Video input Q Q-1 IDCT + Quantizer index for transform coefficient VLC: Variable Length Coding 2-D DCT Coding control Video input Q Q-1 IDCT + - ME Control info Motion vectors VLC MCP ME: Motion Estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

H.263 Encoder Goals: baseline H.263 encoder only Evaluate performance of compiled C code on Very Long Instruction Word (VLIW) Digital Signal Processors (DSPs) and superscalar processors Hand optimize H.263 video encoder on VLIW DSP University of British Columbia (UBC) H.263 Version 2 (H.263+) video codec By Prof. Faouzi Kossentini’s group: http://spmg.ece.ubc.ca 23000 lines (720 kbytes) of C code targeted for PCs Baseline H.263 and many optional H.263+ modes Primarily for research purposes VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

TMS320C6701 Processor Up to 8 32-bit instructions are executed in one instruction cycle in an in-order way 2 32-bit data paths, with 16 32-bit registers and 16 16-bit data memory banks Program Fetch Control Registers Instruction Dispatch Instruction Decode Control Logic A Register File B Register File Test/ Emulation Interrupts control L1 S1 M1 D1 L2 S2 M2 D2 TMS320C6701 CPU Core VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

TMS320C6701 EVM TMS320C6701 processor External memory 11 - 17 stages of pipeline, depending on instruction External memory 256 kB of 133 MHz synchronous burst static random-access memory (SBSRAM) 8 MB of 100 MHz synchronous dynamic RAM (SDRAM) in two 16-bit RAM banks 100 MHz clock speed due to SDRAM Development environment Code Composer: Interactive real-time debugging Simulator: Does not report pipeline stalls VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

SimpleScalar Simulator Superscalar processor reorders sequential instructions based on data dependencies for parallel (out-of-order) execution SimpleScalar is configurable superscalar simulator: http://www.simplescalar.org Fetch Dispatch Scheduler Execute Writeback Memory Memory TLB: Translation lookahead buffer Instruction cache Virtual memory Data cache Data-TLB Commit Six pipeline stages for out-of-order simulation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

Comparison of Processors VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

Encoder Profile for VLIW DSP (with level two C optimization only) 1476 Mcycles/frame for 128 x 96 resolution with full-search motion estimation SAD VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

Encoder Profile for SuperScalar (1-way with level two C optimization) 196 Mcycles/frame for 128 x 96 resolution with full-search motion estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

H.263 Encoder Comparison (with level 2 C optimization only) Frame resolution: 128 x 96 (Sub-QCIF) Full search motion estimation Clock speed: 100 MHz VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

VLIW DSP Memory Optimizations Internal program memory holds Computationally intensive routines Commonly used runtime support functions from TI libraries (memcpy, memcmp and memset) Internal data memory holds Macroblocks and search area for motion estimation Macroblocks for DCT, quantization, coding, reconstruction Local data for computationally intensive routines Stack Speedup: 29 times over level two optimization VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

VLIW DSP Code Optimizations Compiler intrinsics gave little improvement Wrote assembly routines Parallel assembly: SAD, Clip_MB (clips overflowing values) Linear assembly: Interpolate, FillMBData (pack copy of pixel data into macroblock structures) Rewriting the C code Unroll loops and pipeline computations Use 32-bit packed data I/O to slower external RAM Avoid pipeline stalls due to memory bank conflicts Speedup: 4 times over level two C optimization VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

VLIW DSP Optimizations (assembly routines only) VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

VLIW DSP Encoder Profile (after all C6701 optimizations) 24 Mcycles/frame for 128 x 96 resolution with full-search motion estimation SAD VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

Superscalar Encoder Profile (256-way SimpleScalar processor) 28 Mcycles/frame for 128 x 96 resolution with full-search motion estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

Subroutine Comparisons VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

H.263 Encoder Comparison Frame resolution: 128 x 96 (Sub-QCIF) Full search motion estimation Clock speed: 100 MHz VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

Conclusions With level 2 optimization only One-way superscalar is 7.5x faster than VLIW DSP Four-way to one-way issue speedup is 2.88x 256-way to four-way speedup is 2.4x Variable length coding much faster on superscalar VLIW DSP hand optimization produces 61x speedup vs. level two C optimization Placement of often-used data and code on-chip Hand coded SAD, interpolation, and reconstruction 14% faster than 256-way superscalar version http://www.ece.utexas.edu/~sheikh/h263 VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder