Presentation is loading. Please wait.

Presentation is loading. Please wait.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

Similar presentations


Presentation on theme: "RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696"— Presentation transcript:

1 RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696 sridhar@rice.edu

2 RICE UNIVERSITY Motivation  Viterbi decoding - One of the major bottlenecks in baseband processing [PHY]  Need for flexibility in the algorithm parameters due to different protocols “read programmable”  No architecture developed yet to meet real-time requirements of 3G systems.  2 - 8 Mbps range for wideband CDMA  100 Mbps range for wireless LAN

3 RICE UNIVERSITY Today  Background  Advanced DSP architectures -- TI C6x [15]  Viterbi algorithm basics [10]  Viterbi on TI DSPs [10]  A programmable processor specifically designed for Viterbi [15]

4 RICE UNIVERSITY  VLIW [Very Long Instruction Word] arch.  Similar to a vector processor -- but  multiple instructions -> multiple Func. Units  FU’s are not all the same  32-bit architecture  8 functional units TI C6x architecture Inst 1 Inst 2 Inst 3 Inst 4 FU 1 FU 2 FU 3 FU 4 4-wide VLIW

5 RICE UNIVERSITY

6 8 VelociTI principles  Parallel fetch, decode and execute  Pipelined enough to make ADD critical path  Instructions based on RISC  Load - Store architecture  Orthogonal - Instruction Set and Reg. File  Determinism  Conditional Instructions  Instruction Packing

7 RICE UNIVERSITY 2 * 4 = 8 Functional Units .M Multiplication unit  16 bit x 16 bit signed/# packed/# .L arithmetic Logic unit  Comparisons and logic operations  Saturation arithmetic and absolute value .S Shifter unit  Bit manipulation (set, get, shift, rotate)  Branching, addition and packed addition .D Data unit  Load/store to memory  Addition and pointer arithmetic

8 RICE UNIVERSITY How powerful am I?  8 instructions per cycle  Max:  6 adds per cycle  2 multiplies per cycle  2 load/stores per cycle  2 branches per cycle  Idea is you will be using instructions in these ratios to get full FU utilization.

9 RICE UNIVERSITY C6x DSP Core

10 RICE UNIVERSITY C6x Datapath

11 RICE UNIVERSITY C6x Resource Constraints  Instructions using the same FU  1 inst. / FU  Cross Paths  only 1 operand from other reg. file to (L,S,M)  Loads and stores  2 loads and stores from 2 different reg. files  Reads and writes  max 4-reads from the same register  No 2 writes to the same register :)

12 RICE UNIVERSITY Instruction Packing  Fetch Packet  Execute Packet  Avoid NOPs in the instruction code  Multi-cycle NOPs if absolutely necessary  LSB- “p” bit of instruction for packing A || B || C,D || E, F, G || H 8 instructions instead of 32 A B C D 1 1 0 1 E F G H 0 0 1 0

13 RICE UNIVERSITY Conditional Instructions  All instructions can be conditioned based on the value in registers A1,A2,B0,B1,B2  Avoids branch latencies  If condition not met by end of first phase of execution, results not written back to reg. file  Conditional loads/stores squashed before data phase

14 RICE UNIVERSITY C6x Pipeline  Fetch (if necessary) - 4 phases  Address Generate  Address Send  Access Ready Wait  Fetch Packet Receive  Decode - 2 phases  Instruction dispatch (if necessary)  Instruction decode  Execute - 10 phases  Most 1 phase

15 RICE UNIVERSITY Some interesting instructions  Saturation  Bit-counting -- Image coding  Integer-comparison  Bit-manipulation  Seed generation for reciprocal instructions

16 RICE UNIVERSITY Other details  64 KB internal program and data  DMA - peripherals to memory  Intrinsics in code for better programming  similar to using “ViS” in UltraSPARC  Software pipelining of loops  PERFORMANCE:  5-10X  higher clock -- higher pipeline (2-4X)  Additional ALUs

17 RICE UNIVERSITY Additional features in C64x  SIMD support  Communication-specific instructions  interleaving, galois field multiply  Bit count and rotate hardware  64 32-bit registers  Lower resource constraints  No more NOPs needed ever [no boundaries]

18 RICE UNIVERSITY C64x DSP Core

19 RICE UNIVERSITY Today  Background  Advanced DSP architectures -- TI C6x [15]  Viterbi algorithm basics [10]  Viterbi on TI DSPs [10]  A programmable processor specifically designed for Viterbi [15]

20 RICE UNIVERSITY Viterbi Decoding Encoder Decoder k k n > k n Rate k/n = 1/2 Convolutional Encoder

21 RICE UNIVERSITY Error Protection  States = 2^(FFs) = 2^(Constraint Length - 1)  Cannot go from any state to any state

22 RICE UNIVERSITY Trellis for decoding

23 RICE UNIVERSITY Trellis for an input sequence

24 RICE UNIVERSITY Error detection  Branch metric = “Distance” between received symbol pair and possible symbol pairs  Path metric = Accumulated error metric

25 RICE UNIVERSITY Error-correction

26 RICE UNIVERSITY Stages in Viterbi Decoding  Calculate Branch metrics for all states every stage  Update Path metrics for all states every stage  At the end, Traceback the trellis to get the decoded bits

27 RICE UNIVERSITY Computations  Branch metrics:  Hamming distance: (XOR) and Count 1’s  Euclidean distance: squared distance  Path metrics:  Add Branch metrics to existing path metrics  Compare for minimum and Select minimum  Survivor Traceback:  Linked list /Pointer chasing  Memory Intensive / Sequential Operations

28 RICE UNIVERSITY Today  Background  Advanced DSP architectures -- TI C6x [15]  Viterbi algorithm basics [10]  Viterbi on TI DSPs [10]  A programmable processor specifically designed for Viterbi [15]

29 RICE UNIVERSITY Viterbi support in different processors  C54x  Special hardware accelerator  ACS unit with 2 ACC and split ALU  Viterbi butterfly (2 ACS) in 4 cycles  C62x  nothing special  C6416  Viterbi coprocessor  K = 5-9,Rate = 1/2,1/3,1/4

30 RICE UNIVERSITY Viterbi Coprocessor in C6416

31 RICE UNIVERSITY Viterbi Coprocessor in C6416  SM, SD and HD memory not accessible to DSP

32 RICE UNIVERSITY Today  Background  Advanced DSP architectures -- TI C6x [15]  Viterbi algorithm basics [10]  Viterbi on TI DSPs [10]  A programmable processor specifically designed for Viterbi [15]

33 RICE UNIVERSITY Need for VSP architecture  Large amount of memory access  Traceback decoding  Not efficient on a GPP  Program instructions in a GPP is of a higher order than complexity of the algorithm

34 RICE UNIVERSITY VSP architecture

35 RICE UNIVERSITY Branch Metric Calculation

36 RICE UNIVERSITY Path Metric Calculation

37 RICE UNIVERSITY Traceback Unit

38 RICE UNIVERSITY Traceback with survivor updates Start Filling the Trellis Start Traceback 5*Constraint Length Symbol Decoded Update Survivor Path for most recent symbol

39 RICE UNIVERSITY Survivor Path Updates

40 RICE UNIVERSITY Circular updates

41 RICE UNIVERSITY Software Programming  Small but specialized instruction set  LOAD, ACS  Shorter execution time  All 3 subprocessors programmed independently  10 ns, (100 MHz) in 1990 to get 1.5 Mbps

42 RICE UNIVERSITY Conclusions  Viterbi algorithm important for implementation in a programmable communication receiver  Approaches have been as co-processor support to DSPs or specialized processors.  We are yet to design programmable processors that meet real-time requirements for 100 Mbps applications.


Download ppt "RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696"

Similar presentations


Ads by Google