Presentation is loading. Please wait.

Presentation is loading. Please wait.

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Similar presentations


Presentation on theme: "VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder"— Presentation transcript:

1 VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
“Human beings are great programmers, Computers are poor actors” VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder Serene Banerjee Hamid R. Sheikh Lizy K. John Brian L. Evans Alan C. Bovik Department of Electrical and Computer Engineering The University of Texas at Austin VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder November 1st, 2000

2 Baseline H.263 Video Encoding
I: Intra frame: Discrete Cosine Transform (DCT) is used to reduce spatial redundancy within a frame. P: Predicted frame: Motion compensated prediction (MCP) used to reduce temporal redundancy. DCT is used to reduce spatial redundancy in the prediction error. I P Frame VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

3 Baseline H.263 Encoder 2-D DCT Coding control Video input Q Q-1 IDCT +
Quantizer index for transform coefficient VLC: Variable Length Coding 2-D DCT Coding control Video input Q Q-1 IDCT + - ME Control info Motion vectors VLC MCP ME: Motion Estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

4 H.263 Encoder Goals: baseline H.263 encoder only
Evaluate performance of compiled C code on Very Long Instruction Word (VLIW) Digital Signal Processors (DSPs) and superscalar processors Hand optimize H.263 video encoder on VLIW DSP University of British Columbia (UBC) H.263 Version 2 (H.263+) video codec By Prof. Faouzi Kossentini’s group: 23000 lines (720 kbytes) of C code targeted for PCs Baseline H.263 and many optional H.263+ modes Primarily for research purposes VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

5 TMS320C6701 Processor Up to 8 32-bit instructions are executed in one instruction cycle in an in-order way 2 32-bit data paths, with bit registers and bit data memory banks Program Fetch Control Registers Instruction Dispatch Instruction Decode Control Logic A Register File B Register File Test/ Emulation Interrupts control L1 S1 M1 D1 L2 S2 M2 D2 TMS320C6701 CPU Core VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

6 TMS320C6701 EVM TMS320C6701 processor External memory
stages of pipeline, depending on instruction External memory 256 kB of 133 MHz synchronous burst static random-access memory (SBSRAM) 8 MB of 100 MHz synchronous dynamic RAM (SDRAM) in two 16-bit RAM banks 100 MHz clock speed due to SDRAM Development environment Code Composer: Interactive real-time debugging Simulator: Does not report pipeline stalls VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

7 SimpleScalar Simulator
Superscalar processor reorders sequential instructions based on data dependencies for parallel (out-of-order) execution SimpleScalar is configurable superscalar simulator: Fetch Dispatch Scheduler Execute Writeback Memory Memory TLB: Translation lookahead buffer Instruction cache Virtual memory Data cache Data-TLB Commit Six pipeline stages for out-of-order simulation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

8 Comparison of Processors
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

9 Encoder Profile for VLIW DSP (with level two C optimization only)
1476 Mcycles/frame for 128 x 96 resolution with full-search motion estimation SAD VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

10 Encoder Profile for SuperScalar (1-way with level two C optimization)
196 Mcycles/frame for 128 x 96 resolution with full-search motion estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

11 H.263 Encoder Comparison (with level 2 C optimization only)
Frame resolution: 128 x 96 (Sub-QCIF) Full search motion estimation Clock speed: 100 MHz VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

12 VLIW DSP Memory Optimizations
Internal program memory holds Computationally intensive routines Commonly used runtime support functions from TI libraries (memcpy, memcmp and memset) Internal data memory holds Macroblocks and search area for motion estimation Macroblocks for DCT, quantization, coding, reconstruction Local data for computationally intensive routines Stack Speedup: 29 times over level two optimization VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

13 VLIW DSP Code Optimizations
Compiler intrinsics gave little improvement Wrote assembly routines Parallel assembly: SAD, Clip_MB (clips overflowing values) Linear assembly: Interpolate, FillMBData (pack copy of pixel data into macroblock structures) Rewriting the C code Unroll loops and pipeline computations Use 32-bit packed data I/O to slower external RAM Avoid pipeline stalls due to memory bank conflicts Speedup: 4 times over level two C optimization VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

14 VLIW DSP Optimizations (assembly routines only)
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

15 VLIW DSP Encoder Profile (after all C6701 optimizations)
24 Mcycles/frame for 128 x 96 resolution with full-search motion estimation SAD VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

16 Superscalar Encoder Profile (256-way SimpleScalar processor)
28 Mcycles/frame for 128 x 96 resolution with full-search motion estimation VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

17 Subroutine Comparisons
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

18 H.263 Encoder Comparison Frame resolution: 128 x 96 (Sub-QCIF)
Full search motion estimation Clock speed: 100 MHz VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder

19 Conclusions With level 2 optimization only
One-way superscalar is 7.5x faster than VLIW DSP Four-way to one-way issue speedup is 2.88x 256-way to four-way speedup is 2.4x Variable length coding much faster on superscalar VLIW DSP hand optimization produces 61x speedup vs. level two C optimization Placement of often-used data and code on-chip Hand coded SAD, interpolation, and reconstruction 14% faster than 256-way superscalar version VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Video Encoder


Download ppt "VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder"

Similar presentations


Ads by Google