Chapter 5 Assembly Language

Chapter 5 Assembly Language
DSP C5000 Chapter 5 Assembly Language

Assembly Language Two Main types of assembly language :
Algebraic Mnemonic Both C54x and C55x can use either type of assembly language. C54 There are two ways to write assembly language. One is the mnemonic syntax (displayed in green) which could be seen as an old-fashioned assembly language. It is the one used in all the following examples. The other is algebraic assembly language (displayed in red), which tends to be more natural and intuitive but is not normally used in industry. C55 Running C54 code on the C55

C54x Assembly Language The instruction set is divided into four basic types : Arithmetic Logic Load, Store & Move Program control C54x has a fixed length instruction word Instruction must be encoded in one 16-bit word in order to be executed in one cycle The C54x instruction set can be divided into four basic types of operations: Arithmetic, Logic, Load & Store and Program control

Instructions and Operands
General syntax of an instruction : Instr Op1,[Op2,[Op3,[…]]] For Instr field refer to TI documentation or following slides The Op1,[Op2,[Op3,[…]]] field syntax is specified in instruction documentation and specifies the way (type of addressing mode) you could use for the operands. An instruction is made of an op code specifying the instruction to be executed and one or multiple operands on which the instruction is carried out.

Operand Syntax

Arithmetic Instructions
General purpose Arithmetic : Addition/subtraction Multiply (and accumulate) Square Divide Application Specific Arithmetic : Miscellaneous Polynomial evaluation Distance computation Specific filters Butterfly computation (Viterbi)

General Purpose Arithmetic
Addition/subtraction field4=field3 field1*2field2 Result is stored in field4 if present else in field3, shift is done according to field2 if present. Shift field is detailed on the next slide … The ADD/SUB instructions add or subtracts a 16-bit value from the contents of the selected accumulator or from a 16-bit Xmem operand in dual data-memory addressing mode. The 16-bit value to be added/subtracted is one of the following: The content of a single data-memory operand The content of a dual data-memory operand A 16-bit long-immediate operand The shifted value in the source accumulator If a destination is specified, the ADD/SUB instructions stores the result in the destination accumulator. If no destination is specified, ADD/SUB stores the result in the source accumulator. The shift, when specified apply on the first operand before arithmetic operation: For a left shift: Low-order bits are cleared High-order bits are: sign-extended if SXM = 1, cleared if SXM = 0 For a right shift, the high order bits are: Sign-extended if SXM = 1, Cleared if SXM = 0

Shift Field Many instructions use shift on one operand. This shift is specified in operands field and could be : Immediate if specified by the keywords: -16  SHIFT  15 0  SHIFT1  15 Register indirect: -16  ASM 15 (Accumulator Shift Mode field of ST1) -16  TS 31 (TS are ths 6 LSBs of T register) Shift is used for example to remove non significant integer bits after a multiplication.

Addition/subtraction (special cases) Arithmetic with unsigned operand : field2=field2 unsigned(field1) Direct computation on memory : With : SXM means Sign Extension Mode. When turned on (SXM=1), the numbers are treated as signed and can take positive or negative values. When turned off (SXM=0), the numbers are treated as unsigned. SXM=1 field2=field2 +signed(field1)  lk  32767 SXM=0 field2=field2 +unsigned(field1) 0  lk  65535

Addition/subtraction (extended precision) : 32 bits : 1 : field2=field2 field1 2 : field2=field1-field2 If C16=0 field1 and field2 are considered as 32-bit operands and 32 bits  is realized. If C16=1 field1 and field2 are considered as a pair of 16-bit operands and SIMD computation take place. 64 bits : field2=field2 unsigned(field1) carry/borrow SIMD means Single Instruction Multiple Data : the same instruction is conducted on multiple data.

Addition/subtraction (extended precision contd) 64 bits addition/subtraction : is realized as follows: (Look at a code exemple)

Multiply 1,3 : field3=field1TREG 2 2,4 : field3=field1field2 Multiply and Accumulate/Subtract 1 : field3=field3  (field1TREG) 2 3 : field4=field3  (field1TREG)1 2,4 : field4=field3  (field1field2) 1, 2 1 Result is stored in field4 if present else in field3 2 [R] : rounding result on the 16bits MSB of dst, 16bits LSB are zeroed

Multiply and Accumulate (with program memory) 1,2 : field4=field3  (field1field2) 2 : contents of data memory pointed by field1 operand (Smem) is copied in next following data memory address. Delay of data together with scalar product are needed for FIR filter computation 1 : TREG=field1 and contents of data memory pointed by field1 is copied in the next following data memory address. The delay D is used to shuffle the values in a straight buffer along one place. Here it is used as part of the instructions MACD (Multiply by program memory and accumulate with delay) and LTD(Load T register and insert delay)

Multiply, Accumulate and Delay In all cases x(n-k) are in data memory Case 1: h(k) are in program memory Case 2: h(k) are in data memory RPT #N-1 MACD *AR1-,coef,A To compute a FIR filter, we need to perform a dot product between two vectors H and X then data in X array have to be delayed for the next sampling instant. The H array can be either in program memory or in data memory. In the first case the instruction MACD is repeated N times. AR1 has been initialized to point to the oldest data in X array. In the second case H array is in data memory and is addressed through AR2, while the oldest data in X array is addressed by AR1. This block of two instruction is repeated N times. RPTB endLoop-1 LTD *AR1- MAC *AR2+,A endLoop:

Multiply (with accumulator) 1 : B=field1 A(32-16)1 2 : field1=TREGA(32-16)1 Multiply and Accumulate/Subtract(with accumulator) 1 : B=B  (field1 A(32-16))1,2 2 : field3=field2  (TREGA(32-16))1,2,3 This multiplication instruction is often used after a previous multiplication and uses the previous product already in the accumulator. 1A(32-16) stands for the 16bits MSB of accumulator A, B stands for accumulator B. 2 [R] : rounding result on the 16bits MSB of dst, 16bits LSB are zeroed 3 Result is stored in field3 if present else in field2

Extended precision multiplication 1: field3=unsigned(field1) unsigned(TREG) 2: field3=field3 +(unsigned(field1) signed(field2)) MPYU is equivalent to MPY syntax 1, but with unsigned operands. MACSU is equivalent to MAC syntax 2, but with field1 operand unsigned. Extended precision multiplication requires unsigned,unsigned and signed,unsigned operations as it will be seen in the next slide. Those instructions are building blocks of extended precision multiply capabilities of DSP.

Extended precision multiplication Principle : Look at a code exemple MPYU MACSU MACSU MAC

Square 1: field2=field1  field1 2: field2=A(32-16) A(32-16) 1 3: field2=field2 (field1field1) 1A(32-16) stands for the 16bits MSB of accumulator A

Divide Division is implemented by using repeated conditional subtraction. Perform a single cycle 1-bit unsigned divide instruction: Dividend (numerator) is in LSB of src and divisor in Smem, then : after operation the quotient is LSB of src and remainder in MSB of src. (src) - (Smem) << 15 --> ALU output If ALU output  0 Then (ALU output) << > src Else (src) << 1 --> src

Division Routine (More examples) B = num*den (tells sign) ABS A Strip sign of denominator STH ABS A Strip sign of numerator RPT # iterations 1-bit divide XC 1,BLT If result needs to be negative NEG A Invert sign STL Store negative result

Miscellaneous Arithmetic
ABS src,[dst] ; compute the absolute value of src and store it in dst if specified src otherwise (dst=|src|). NEG src,[dst] ; store the 2s complement of src in dst if specified, src otherwise (dst=-src). MAX(dst),MIN(dst) ; store in dst the greatest (resp. the lowest) between A and B accumulator (dst=MAX(A,B), dst=MIN(A,B)). For this kind of arithmetic operation, algebraic instructions are more close to « natural » language.

EXP/NORM :Tools for fixed point to (block)-floating point conversion1 Store the high part of the accumulator (A or B) in « Mantissa*2^Exponent » form. EXP src ; compute the number of shift necessary to normalize the high part of the accumulator src and store it in T register (T=EXP(src)) . Here T=3 after operation. Because of the guard bits, T could be negative after operation. Guard bits High part low part Floating point (either block or strict sense) is managed in fixed point processor through specific instructions which detect significant bits and realize shifting. S stands for sign bits and X for significant bits. …/… 1 see ch13 «Numerical Issues » for in depth explanation of floating point format

NORM src,[dst] ; The contents of the accumu-lator src is shifted according to the value in T reg ans stored in dst if specified, src otherwise . (dst=src<<TS). Accumulator after operation : Guard bits High part low part .bss Mantissa,2,1 Expo .set Mantissa+1 .text ; A = 1234h format: LD #Mantissa,DP EXP A NORM A ST STH We guess that the accumulator A contain 1234h before the routine « format », then after the routine the memory Mantissa contains 48D0h and Exponent contains 12h 0012 48D0 Mantissa Expo

Rounding and saturation are intended for finite precision and finite dynamic number representation : RND src,[dst] ; the high part of the accumulator src is rounded up and stored either in dst if specified or in src (dst=rnd(src)). SAT src ; If src is greater than then src is set to (007FFFh). If src is lower than –32768 then src is set to –32768 (FF8000h) (SATURATE(src)). Before rounding After rounding Rounding is used to decrease the error in the result. Saturation is used to prevent unexpected sign changes occurring when limits are exceeded.

Polynomial Evaluation
Considering the 3rd order polynomial: It can be computed as: in an iterative way: A polynomial of any order can be computed from an order one polynomial computation basic building block. …/…

Polynomial Evaluation
Before using POLY instruction, we have to load T reg. with the proper value for x. POLY Smem ; The high part of accumulator A is multiplied by T reg. then added with the high part of B and stored in A. The high part of accumula-tor B is loaded with the contents of Smem (current coefficient) (POLY(Smem)). coef .sect “COEF” .word 1234h,3456h; .word 4567h,5678h; .bss y,1 .text ; A(15-0) = 7FFCh (x) PoEval: STLM A,T STM #COEF,AR1 LD *AR1+,16,A LD *AR1+,16,B RPT #2 POLY *AR1+ STH A,*(y) We guess that x value is in the low part of accumulator A at the input of the routine, so it transfered into T register, then we use Auxiliary regiter AR1 to address the polynomial coefficients in memory starting form high order. We load the most significant one in the high part of A then the following one in the high part of B. We are then ready to do the three iterations needed for an order 3 polynomial evaluation as shown in the previous slide. At the end we store the result (which is in the high part of A) in y memory.

Distance Computation ABDST Xmem,Ymem ; computes the L1 norm of the distance between 2 vectors according to (ABDST(Xmem,Ymem)): .bss X,10 .bss Y,10 .bss D,1 .text ABDST Xmem,Ymem : compute the distance between two vectors stored in memory, at each cycle this instruction computes the difference between two elements of each vector (which is stored in the high parts of of accumulator A), while the absolute value of the previous computation is accumulated in accumulator B. This instruction must be iterated N+1 times if the vectors have N elements in order to accumulate the difference between the two Last element of each vector. dist: STM #X,AR2 STM #Y,AR3 RPT #10 ABDST *AR2+,*AR3+ STH B,*(D)

Distance Computation SQDST Xmem,Ymem ; computes the squared L2 norm of the distance between 2 vectors according to (SQDST(Xmem,Ymem)): .bss X,10 .bss Y,10 .bss D,1 .text SQDST Xmem,Ymem : compute the distance between two vectors stored in memory. At each cycle this instruction computes the difference between two elements of each vector (which is stored in the high parts of of accumulator A), while the squared value of the previous computation is accumulated in accumulator B. This instruction must be iterated N+1 times if the vectors have N elements in order to accumulate the difference between the two Last element of each vector. dist: STM #X,AR2 STM #Y,AR3 RPT #10 SQDST *AR2+,*AR3+ STH B,*(D)

Specific Filters Intructions
Symmetric FIR filters1: An even length symetric FIR filter can be computed according to: which yields N multiplications, because of the symmetry of h(k), the equation can be rewritten: yielding only N/2 multiplications. This optimization is handled by FIRS instruction. …/… 1 see ch14 «FIR filter implementation » for a full treatment of this topic

FIRS Xmem,Ymem,pmad ;The high part of accumulator A is multiplied by the content of pmad and accumulated in accumulator B. Xmem and Ymem are added together and stored in the high part of accumulator A (FIRS (Xmem,Ymem,pmad)). At each step, FIRS do the following computation: where y(n) is in accumulator B and tmp in accumulator A. In repeat mode pmad is loaded once, then incremented at each iteration. A more detailed description of how to use the FIRS instruction is given in “TMS320C54x DSP Applications Guide”, SPRU173

LMS Algorithm1: LMS Adaptive filtering require to update coeffi-cients of the filter according to an error signal e(n) while computing the output of the filter y(n). This involves the following computations: At each step we have two computations: one for the filter tap and update of the coefficient tap. …/… 1 see ch16 «Adaptive Filter Implementation » for a full treatment of this topic

LMS Xmem,Ymem ; Xmem is accumulated to the high part of accumulator A with rounding while Xmem and Ymem are multiplied and accumula-ted into accumulator B (LMS(Xmem,Ymem)). At each step LMS do the following computations: where y(n) is in accumulator B and tmp in accumu-lator A. In addition others instructions have to store in accumulator A the error times the adaptation step and store in Xmem the updated coefficient value (ST||MPY). A more detailed description of how to use the LMS instruction is given in “TMS320C54x DSP Applications Guide”, SPRU173 Here ST || MPY is a store instruction with a parallel multiply. It executes faster than would separate ST and MPY instructions.

Butterfly Computation
These instructions are only useful in dual 16 bits mode (C16=1) 1: dst(31-16)=Lmem(31-16)+TREG dst(15-0)=Lmem(15-0)-TREG 2: dst(31-16)=Lmem(31-16)-TREG dst(15-0)=Lmem(15-0)+TREG 3: dst(31-16)=Lmem(31-16)-TREG d -d N N+1 N N+1 -d d 1 see ch22 «Viterbi Algorithm» for in depth explanation and CMPS for other Viterbi related instructions

Logic Instructions Overview Logic Comparison and bit test
Shift and rotate AND OR XOR CMPL ANDM ORM XORM CMPM CMPR CMPS BIT BITF SFTL SFTA SFTC ROR ROL ROLTC

Logic Instructions Logic operations on accumulators
1,2,3: field4=field3 [ + ] field1*2field2 Result is stored in field4 if present else in field3, shift is done according to field2 if present. 4: field4=field4 [ + ] field1*2field2 field4 is used if present else field1is used instead, shift is done according to field2 if present. Shift field is recalled on the this slide … field2=field11 Result is stored in field2 if present else in field1 1bit complement

Logic Instructions Logic with memory field2=field1 [ + ] field2
About ANDM look at BITF BITF realizes the same operation as ANDM except that it does not store the result, but sets TC if the AND result is not zero.

Logic Instructions Comparison (memory) Comparison (auxiliary register)
Equality test TC=1 if field1==filed2, else TC=0 Comparison (auxiliary register) Versatile comparison ARx is compared against AR0 according to CC (field1) and TC is set if compare success

We guess that src(32-16)=x and src(15-0)=y then
Logic Instructions Compare, select, store (and remember) Intended for Viterbi algorithm (see Chapter 22 for an in depth treatment, see DSADT and DADST for other Viterbi related instructions) x y ? N N+1 Two paths arrive to a node of stage N+1 from stage N. Only one will be retained according to its weight x or y. We guess that src(32-16)=x and src(15-0)=y then If src(32-16) > src(15-0) then Smem=src(31-16) TRN=(TRN) << 1 TRN(0)=0 TC=0 Else src(32-16)  src(15-0) then Smem=src(15-0) TRN=(TRN) << 1 TRN(0)=1 TC=1 Transition register

Logic Instructions Bit test Bit field test
BIT and BITT set TC according to a bit value in a word specified by the operand in field1. The bit number is specified either by BITC in case of syntax 1 or by T[3..0] register in case of syntax 2. Bit numbering is in reverse order, with 0 corresponding to the MSB and 15 to the LSB. Bit field test TC is set according the result of (field1  field2) For this instruction look also at ANDM BITF can be seen as a multi-input AND component where the number of input considered are equal the number of bits true in field 2 operator.

Logic Instructions Shift and rotate field3=field1  2field2
Field 1 is left or right shifted according to the sign of SHIFT and stored in field3 if present, field 1 otherwise SFTL stands for LOGICAL shift : input bits are equal to 0 SFTA stands for ARITHMETIC shift : input low order bits are 0 in case of positive SHIFT. Input high order bits are equal to the sign bits (if SXM1 is set) when SHIFT is negative. Shift conditionaly (SFTC) apply to signed, one left shift is done to remove one redundant sign bit (TC is then set) otherwise nothing is done and TC is reset. For shift field content information see this slide 1Sign Extension Mode

Logic Instructions Shift and rotate
ROR perform one right rotate through the carry C on src (guard bits=0,src(31)=C,C=src(0)). ROL perform one left rotate through the carry C on src (guard bits=0,src(0)=C,C=src(31)). ROLTC perform one left rotate with TC as input and C as ouput. (guard bits=0,src(0)=TC,C=src(31)).

Load, Store & Move Instructions
Load & Store accumulator field3=field1* 2field2. shift is done according to field2 if present. Recall on SHIFT field can be found at this place 0  K  255 and 0  lk  65535

Load & Store accumulator 1: field2=field1* or dst(32-16)=field1+0.5 2: field2=unsigned(field1) or dst(32-16)=0 and dst(15-0)=field1. 3: particular case of syntax1 LD for Memory Map Register.

Load & Store accumulator STL store src(15-0) and STH src(32-16) field3= field1* 2field2 shift is done according to field2 if present. Recall on SHIFT field can be found at this place Same as syntax1 STL above except that field3 is a Memory Map Register.

Load & Store other registers Allows initialization of T, DP or ASM either from memory (Smem) or from an immediate value. #k3, #k5 and #k9 stand respectively for 3, 5 and 9 bits immediate value. ARP is only intended for ‘C25 compatibility mode and is not of interest in native ‘C54x software. ARP: stands for Auxiliary Register Pointer. In old generation TI DSP (before C5x) the auxiliary register used was not directly mentionned in the instruction but through a 3 bits register (ARP) that had to be updated before instruction execution.

Save others registers or write immediate to memory. field1=field2. Syntax 3 allows initialization of any data memory location with an immediate value. Write an immediate 16 bit value into any Memory Map Register.

Direct transfer from memory to memory Destination Space Source space

Data space  IO space field2=field11 Data space  Prog. space 1,3: field2=field11 2: source prog. memory address is specified by A(15-0) 4: destination prog. memory address is specified by A(15-0) 1 0  PA  , 0  pmad 

Data space  Data space Data space  MMR MMR  MMR1 MMR is a Memory Mapped Register 1MMR1,MMR2:AR0-AR7, SP only

Program Control Instructions
† Values for words (W) and cycles assume the use of DARAM for data. ‡ Conditions true § Condition false ¶ Delayed instruction

† Values for words (W) and cycles assume the use of DARAM for data. ‡ Conditions true, § Condition false, ¶ Delayed instruction

† Values for words (W) and cycles assume the use of DARAM for data.

† Values for words (W) and cycles assume the use of DARAM for data. ‡ Conditions true, § Condition false, ¶ Delayed instruction

† Values for words (W) and cycles assume the use of DARAM for data. Add 1 word and 1 cycle when using long-offset indirect addressing or absolute addressing with an Smem.

C55x Assembly Language Instructions set generalities
C55 Parallelism features and rules C55 mnemonic instruction set summary

C55 Instruction set generalities
General syntax of an instruction : Instr Op1,[Op2,[Op3,[…]]] For Instr field refer to TI documentation or following slides. The Op1,[Op2,[Op3,[…]]] field syntax is specified in instruction documentation and specifies the way (type of addressing mode) you could use for the operands. Parallelism is presented in a following section Terms, symbols and abbreviations are presented in spru374.pdf Some instruction terms, symbols or abreviations [ ] Optional operands 40 If the optional 40 keyword is applied to the instruction, the instruction provides the option to locally set M40 to 1 for the execution of the instruction ACOVx Accumulator overflow status bit: ACOV0, ACOV1, ACOV2, ACOV3 ACw, ACx,ACy, Acz Accumulator: AC0, AC1, AC2, AC3 Baddr Register bit address BitIn Shifted bit in: Test control flag 2 (TC2) or CARRY status bit BitOut Shifted bit out: Test control flag 2 (TC2) or CARRY status bit CSR Computed single-repeat register dst Destination accumulator (ACx), lower 16 bits of auxiliary register (ARx), or temporary register (Tx): AC0, AC1, AC2, AC3, AR0, AR1, AR2, AR3, AR4, AR5, AR6, AR7, T0, T1, T2, T3 kx Unsigned constant coded on x bits Kx Signed constant coded on x bits lx Program address label coded on x bits (unsigned offset relative to program counter register) Lx Program address label coded on x bits (signed offset relative to program counter register) Pipeline Limiting execution pipeline phase: D Decode, AD Address, R Read, X Execute pmad Program memory address Px Program or data address label coded on x bits (absolute address) R or rnd If the optional R or rnd keyword is applied to the instruction, rounding is performed in the instruction SHFT Immediate shift value, 0 to 15 SHIFTW Immediate shift value, –32 to +31 src Source accumulator (ACx), lower 16 bits of auxiliary register (ARx), or temporary register (Tx): AC0, AC1, AC2, AC3, AR0, AR1, AR2, AR3, AR4, AR5, AR6, AR7, T0, T1, T2, T3 STx Status register: ST0, ST1, ST2, ST3 TAx, TAy Auxiliary register (ARx) or temporary register (Tx): AR0, AR1, AR2, AR3, AR4, AR5, AR6, AR7,T0, T1, T2, T3 TCx, TCy Test control flag: TC1, TC2 TRNx Transition register: TRN0, TRN1 Tx, Ty Temporary register:T0, T1, T2, T3 U or uns If the optional U or uns keyword is applied to the input operand, the operand is zero extended XAdst Destination extended register: All 23 bits of stack pointer (XSP), system stack pointer (XSSP), data page pointer (XDP), coefficient data pointer (XCDP), and extended auxiliary register (XARx): XAR0, XAR1, XAR2, XAR3, XAR4, XAR5, XAR6, XAR7 XARx All 23 bits of auxiliary register: XAR0, XAR1, XAR2, XAR3, XAR4, XAR5, XAR6, XAR7 XAsrc Source extended register: All 23 bits of stack pointer (XSP), system stack pointer(XSSP), data page pointer (XDP), coefficient data pointer (XCDP), and extended auxiliary register (XARx): XAR0, XAR1, XAR2, XAR3, XAR4, XAR5, XAR6, XAR7 xdst Accumulator: AC0, AC1, AC2, AC3 Destination extended register: All 23 bits of stack pointer (XSP), system stack pointer(XSSP), data page pointer (XDP), coefficient data pointer (XCDP), and extendedauxiliary register (XARx):XAR0, XAR1, XAR2, XAR3, XAR4, XAR5, XAR6, XAR7 xsrc Accumulator: AC0, AC1, AC2, AC3 Source extended register: All 23 bits of stack pointer (XSP), system stack pointer (XSSP), data page pointer (XDP), coefficient data pointer (XCDP), and extended auxiliary register (XARx): XAR0, XAR1, XAR2, XAR3, XAR4, XAR5, XAR6, XAR7

C55 Instruction set generalities
Instruction size can be: 8, 16, 24, 32, 40 or 48 bis. In the mnemonic syntax, instruction names are constituted with: a root and potentially a prefix and a suffix. There are: 37 roots (see next slides) 2 Prefixes (see next slides) 13 Suffixes (see next slides)

List of roots for instruction names 1/2

List of roots for instruction names 2/2

List of prefixes for instruction names

List of suffixes for instruction names

Operand modifiers

Addressing-mode operands

Absolute addressing modes

Direct addressing modes

Indirect addressing modes

C55 Parallelism features and rules
The C55x DSP architecture enables to execute two instructions in parallel within the same cycle of execution. The types of parallelism are: Built-in parallelism within a single instruction. Some instructions perform two different operations in parallel. Double colons, ::, are used to separate the two operations. User-defined parallelism between two instructions. Two instructions may be paralleled by the programmer or the C compiler. The parallel bars, ||, are used to separate the two instructions to be executed in parallel. Built-in parallelism can be combined with user-defined parallelism.

Rules for Simple and parallel instructions
Simple instructions are not allowed to span multiple lines. One exception, single instructions that use the double colons, ::, notation to imply parallelism. These instructions may be split up following the :: notation. Example of a single instruction on two lines: MPYR40 uns(Xmem), uns(Cmem), ACx :: MPYR40 uns(Ymem), uns(Cmem), Acy User-defined parallelism instructions (using || notation) are allowed to span multiple lines. For example, all of the following instructions are legal: MOV AC0, AC1 || MOV AC2, AC3 MOV AC0, AC1 || MOV AC2, AC3 MOV AC0, AC1 || MOV AC2, AC3 ||

Built-In vs. User-Defined Parallelism
Parallel instructions execute in a single cycle: Built-In (Implied) - separate, unique instructions (you can look ‘em up) that are built-in to the instruction set (actually ONE instruction) - use a “::” to separate the pieces: built_in: MAC *AR3+,*CDP+,AC0 :: MAC ... User-Defined (Explicit) - determined by the user or compiler - user places a double “pipe” | | to separate the two instructions: user_defined: MPYM *AR1+,*AR2+,AC || AND AR3,T1 Combo - built-in and user-defined can be combined: combo: ADD *AR3<<#16, AC0, AC1 :: MOV HI(AC1 << T2), *AR4 || MOV #3,AR1 Let’s find out more about user-defined parallelism...

User-Defined Parallelism
MOV *AR1,AC1 || ADD @var,AR2 Can these two instructions be used in parallel? First, we need to know which processor resources are being used: Operators “Architecture components that perform tasks.” For example: ALUs, address generators, Dual-MAC (DMAC), etc. Registers Tx, ARx, ACx, BRCx, BSAxx, etc. Buses Memory Read/Write, Constant, etc. Let’s explore the types of operators first...

Operators PU AU DU Which operators are used in the instruction above?
MOV *AR1,AC1 || ADD @var,AR2 LOAD STORE DAGEN X Y C LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE SWAP CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU Which operators are used in the instruction above? Each unit has multiple load/store operators (shown later) All other operators are single use Let’s review the contents of the register files...

Register Files Following is the list of registers contained in each unit: LOAD STORE DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP RPTC BRC0-1 RSA0-1 REA0-1 RETA CSR T XAR0-7 XSP/SSP Xreg XCDP XDP BKxx BSAxx ST0-3 PDP AC0-3 TRN0-1 PU AU DU Next, let’s review the buses...

Read (BB, CB, DB) Write (EB, FB)
Read/Write Buses One load/store operator exists for each bus entering/leaving the unit. LOAD STORE DAGEN X Y C LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE SWAP Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU C D B E F Read (BB, CB, DB) Write (EB, FB)

Examples of Operator/Bus Usage
Instructions Operators Buses MOV @var,BRC1 P-Unit Load D-Bus A-Unit DAGEN X MOV A-Unit Store E-Bus A-Unit DAGEN X MOV D-Unit Load C-Bus A-Unit DAGEN X D-Bus MPYM *AR1,T0,AC0 D-Unit ALU/MAC/SH D-Bus A-Unit DAGEN X

User-Defined Parallelism - 3 Basic Rules
Two instructions can be written in parallel if: 1. There is no hardware resource conflict (operands, buses and registers) 2. Total size is less than or equal to 6 bytes 3. For instructions with 2 memory accesses, each must use indirect addressing (soft-dual addressing) Ex: MOV *AR2,AC0 || MOV T1,*AR1 If the instructions meet these rules, write it and assemble. If the assembler doesn’t like it, refer to the “advanced” rules at the end of this module.

Parallelism - Exercise 1
LOAD STORE DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU E F E F E F C D C D B C D Read (BB, CB, DB) Write (EB, FB) MOV *AR1,AC1 ;bytes:2 || ;bytes:3

Parallelism - Exercise 1 (Solution)
LOAD STORE DAGEN X Y ALU w/SHIFT LOAD DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU E F E F E F C D C D B C D Read (BB, CB, DB) Write (EB, FB) MOV *AR1,AC1 ;bytes:2 || ;bytes:3 Reason: No soft-dual addressing

LOAD STORE DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU E F E F E F C D C D B C D Read (BB, CB, DB) Write (EB, FB) MOV dbl(*(AR1+T0)),AC0 ;bytes:3 || MOV AC2,dbl(*AR2+) ;bytes:3

LOAD STORE LOAD STORE X Y DAGEN DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU E F E F E F C D C D B C D Read (BB, CB, DB) Write (EB, FB) MOV dbl(*(AR1+T0)),AC0 ;bytes:3 || MOV AC2,dbl(*AR2+) ;bytes:3 It works.

LOAD STORE DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU E F E F E F C D C D B C D Read (BB, CB, DB) Write (EB, FB) CALL AC0 ;bytes:2 || MOV AC1,AR1 ;bytes:2

LOAD STORE DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP ALU w/SHIFT Cross-Unit Buses CONTROL Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU E F E F E F C D C D B C D Read (BB, CB, DB) Write (EB, FB) CALL AC0 ;bytes:2 || MOV AC1,AR1 ;bytes:2 Reason: cross-unit bus conflict

LOAD STORE DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU E F E F E F C D C D B C D Read (BB, CB, DB) Write (EB, FB) MAC *AR1+, *CDP+, AC0>>#16 :: MAC *AR2+, *CDP+, AC1>>#16 ;bytes:4 || RPTADD CSR,T ;bytes:2

LOAD ALU, SHIFTER, DMAC CONTROL ALU w/SHIFT X Y C DAGEN STORE DAGEN LOAD STORE S W A P LOAD STORE STORE w/SHFT REGISTER FILE REGISTER FILE REGISTER FILE X Y C SWAP Cross-Unit Buses CONTROL ALU w/SHIFT ALU, SHIFTER, DMAC PU AU DU E F E F E F C D C D B C D Read (BB, CB, DB) Write (EB, FB) MAC *AR1+, *CDP+, AC0>>#16 :: MAC *AR2+, *CDP+, AC1>>#16 ;bytes:4 || RPTADD CSR,T ;bytes:2 It works...

Optimization Suggestions
There are two main types of code you can optimize: Usually contains instructions w/large constants such as: AMOV #x,XAR5 which cannot be placed in parallel anyway Creating a long sequences of 5-6byte instructions could stall the IBQ (4-byte fetch only) and negate the benefit Only run once - i.e. usually not inside a loop Recommendation: don’t spend time on setup code Setup Largest impact for parallel instructions and avoiding pipeline stalls based on #loops Usually contains simple/math instructions which have an increased chance of placing in parallel Usually inside RPTB/LOCAL - easy to locate exactly which code you should focus your efforts on Recommendation: FOCUS ALL YOUR TIME HERE Inner Loop

User-Defined Parallelism - Advanced Rules
A. Address modifiers can not use 16/24-bit constants. B. If D-unit shift-and-store operator is used: (a) D-unit ALU shifter can not be used and, (b) only two source accumulators in total are allowed. C. Instruction priority: (a) MMR access takes priority over internal access MOV AR1,AC0 || MOV ;takes priority (b) In a bus conflict, 2nd value is used in both instructions (assembler flags warning) MOV #3,AC0 || MOV #4,AC1 ;takes priority D. DAGEN tag combination must be allowed E. One of the instructions must have a parallel enable bit

Soft-dual parallelism
Instructions that reference memory operands do not have parallel enable bits. Two such instructions may still be combined with a type of parallelism called soft-dual parallelism. The constraints of soft-dual parallelism are: Both memory operands must meet the constraints of the dual AR indirect addressing mode (Xmem and Ymem). Neither instruction can contain any of the following: Instructions embedding high_byte(Smem) and low_byte(Smem). These instructions that read and write the same memory location. The operands available for the dual AR indirect addressing mode are: *ARn *ARn+ *ARn– *(ARn + AR0) *(ARn + T0) *(ARn – AR0) *(ARn – T0) *ARn(AR0) *ARn(T0) *(ARn + T1) *(ARn – T1) These instructions that read and write the same memory location: BCLR src, Smem BNOT src, Smem BSET src, Smem BTSTCLR k4, Smem, TCx BTSTNOT k4, Smem, TCx BTSTSET k4, Smem, TCx

Soft-dual parallelism
With regard to soft-dual parallelism, the AMAR Smem instruction has the same properties as any memory reference instruction. Soft-Dual Parallelism of MAR Instructions Although the following modify auxiliary register (MAR) instructions do not reference memory and do not have parallel enable bits, they may be combined together or with any other memory reference instructions (not limited to Xmem/Ymem) to form soft-dual parallelism: AADD TAx, TAy AADD k8, TAx AMOV TAx, TAy AMOV k8, TAx ASUB TAx, TAy ASUB k8, TAx

C55 mnemonic instruction set summary
The C55 mnemonic instruction set can be divided into six basic types of operations: Arithmetical operations Bit manipulations operations Extended auxiliary register (XAR) operations Logical operations Move operations Program-control operations With each instruction listing, you will find the availability of a parallel enable bit, word count(size), cycle time, what pipeline stage the instruction is executed, and in what unit the instruction is executed.

Arithmetic Instructions
The ABDST instruction executes two operations in parallel: one in the D-unit MAC and one in the D-unit ALU: ACy = ACy + |HI(ACx)| ACx = (Xmem << #16) - (Ymem << #16)

Arithmetic Instructions: ADD
Brief Description These instructions (17 forms) perform an addition operation: In the D-unit ALU, if the destination operand is an accumulator (ACx). In the A-unit ALU, if the destination operand is an auxiliary or temporary register (TAx). In the D-unit ALU, if the destination operand is the memory (Smem). In the D-unit shifter, if the instruction has a shift quantity other than the immediate 16 bit shift. Status Bits Affected by CARRY, C54CM, M40, SATA, SATD, SXMD Affects ACOVx, ACOVy, CARRY

Some examples for the 17 forms of ADD
ADD AC1, AC0 The content of AC1 is added to the content of AC0 and the result is stored in AC0. ADD #15, AC0 The content of AC0 is added to an unsigned 4-bit value (15) and the result is stored in AC0. ADD #2E00h, AC0, AC1 The content of AC0 is added to the signed 16-bit value (2E00h) and the result is stored in AC1. ADD *AR3+, T0, T1 The content of T0 is added to the content addressed by AR3 and the result is stored in T1. AR3 is incremented by 1. ADD AC1 << T0, AC0 The content of AC1 shifted by the content of T0 is added to the content of AC0 and the result is stored in AC0. ADD AC1 << #31, AC0 The content of AC1 shifted left by 31 bits is added to the content of AC0 and the result is stored in AC0.

ADD #FFFFh << #16, AC1, AC0 A signed 16-bit value (FFFFh) shifted left by 16 bits is added to the content of AC1 and the result is stored in AC0. ADD #FFFFh << #15, AC1, AC0 A signed 16-bit value (FFFFh) shifted left by 15 bits is added to the content of AC1 and the result is stored in AC0. ADD *AR1 << T0, AC1, AC0 The content addressed by AR1 shifted left by the content of T0 is added to the content of AC1 and the result is stored in AC0. ADD *AR3 << #16, AC1, AC0 The content addressed by AR3 shifted left by 16 bits is added to the content of AC1 and the result is stored in AC0. ADD uns(*AR3), CARRY, AC1, AC0 The CARRY status bit and the unsigned content addressed by AR3 are added to the content of AC1 and the result is stored in AC0. ADD uns(*AR3), AC1, AC0 The unsigned content addressed by AR3 is added to the content of AC1 and the result is stored in AC0.

ADD uns(*AR3) << #31, AC1, AC0 The unsigned content addressed by AR3 shifted left by 31 bits is added to the content of AC1 and the result is stored in AC0. ADD dbl(*AR3+), AC1, AC0 The content (long word) addressed by AR3 and AR3 + 1 is added to the content of AC1 and the result is stored in AC0. Because this instruction is a long-operand instruction, AR3 is incremented by 2 after the execution. ADD *AR3, *AR4, AC0 The content addressed by AR3 shifted left by 16 bits is added to the content addressed by AR4 shifted left by 16 bits and the result is stored in AC0. ADD #FFFFh, *AR3 The content addressed by AR3 is added to a signed 16-bit value and the result is stored back into the location addressed by AR3. ADDV AC1, AC0 The absolute value of AC1 is added to the content of AC0 and the result is stored in AC0.

Arithmetic instructions: compare and select extremum
Instructions [1] and [3] perform two paralleled 16-bit extremum selections in the D-unit ALU. Instructions [2] and [4] perform a single 40-bit extremum selection in the D-unit ALU. Status Bits Affected by C54CM, M40, SATD Affects ACOVw, CARRY Decription: MAXDIFF ACx, ACy, ACz, Acw This instruction performs two paralleled 16-bit extremum selections in the D-unit ALU in one cycle. This instruction performs a dual maximum search. The two operations are executed on 40 bits in the D-unit ALU that is configured locally in dual 16-bit mode. The 16 lower bits of both the ALU and the accumulators are separated from their higher 24 bits (the 8 guard bits are attached to the higher 16-bit data path). For each datapath (high and low): ACx and ACy are the source accumulators. The differences are stored in accumulator ACw. The subtraction computation is equivalent to dual 16-bit arithmetic operation instruction. For each of the two computations performed in the ALU, an overflow detection is made. If an overflow is detected on any of the data paths, the destination accumulator overflow status bit (ACOVw) is set. For the operations performed in the ALU low part, overflow is detected at bit position 15. For the operations performed in the ALU high part, overflow is detected at bit position 31. For all instructions, the carry of the operation performed in the ALU high part is reported in the CARRY status bit. The CARRY status bit is always extracted at bit position 31. Independently on each data path, if SATD = 1 when an overflow is detected on the data path, a saturation is performed: For the operations performed in the ALU low part, saturation values are 7FFFh (positive) and 8000h (negative). For the operations performed in the ALU high part, saturation values are 00 7FFFh (positive) and FF 8000h (negative). The extremum is stored in accumulator ACz. The extremum is searched considering the selected bit width of the accumulators: for the lower 16-bit data path, the sign bit is extracted at bit position 15 for the higher 24-bit data path, the sign bit is extracted at bit position 31 According to the extremum found, a decision bit is shifted in TRNx from the MSBs to the LSBs: TRN0 tracks the decision for the high part data path TRN1 tracks the decision for the low part data path If the extremum value is the ACx high or low part, the decision bit is cleared to 0; otherwise, it is set to 1: TRN0 = TRN0 >> #1 TRN1 = TRN1 >> #1 ACw(39–16) = ACy(39–16) – ACx(39–16) ACw(15–0) = ACy(15–0) – ACx(15–0) If (ACx(31–16) > ACy(31–16)) { bit(TRN0, 15) = #0 ; ACz(39–16) = ACx(39–16) } else { bit(TRN0, 15) = #1 ; ACz(39–16) = ACy(39–16) } if (ACx(15–0) > ACy(15–0)) { bit(TRN1, 15) = #0 ; ACz(15–0) = ACx(15–0) } { bit(TRN1, 15) = #1 ; ACz(15–0) = ACy(15–0) } See examples and more details on next slides

Index Search - DMAXDIFF, MAXDIFF
DMAXDIFF ACx,ACy,ACz,ACw,TRNx 32-bit Search ACx/ACy = values, ACz = max, ACw = ACy-ACx, TRNx = index ~N cycles for N taps with index TRN0/1 15 1 (If new max, shift “1” into MSB) Splits AC’s into two 16-bit registers Max of hi/low halves placed in ACz, ACy-ACx placed in ACw, TRN0/1 hold index TRN0 (tracks AC high), TRN1 (tracks AC low) ~N/2 cycles for N taps with index MAXDIFF ACx,ACy,ACz,ACw Dual 16-bit Search MINDIFF and DMINDIFF also supported

Example of MAXDIFF instruction
MAXDIFF AC0, AC1, AC2, AC1 Before After AC AC AC AC1 FF 8000 DDDE AC AC SATD 1 SATD 1 TRN TRN0 0800 TRN TRN1 0080 ACOV ACOV1 1 CARRY 1 CARRY 0 The difference is stored in AC1. The content of AC0(39–16) is subtracted from the content of AC1(39–16) and the result is stored in AC1(39–16). Since SATD = 1 and an overflow is detected, AC1(39–16) = FF 8000h (saturation). The content of AC0(15–0) is subtracted from the content of AC1(15–0) and the result is stored in AC1(15–0). The maximum is stored in AC2. The content of TRN0 and TRN1 is shifted right 1 bit. AC0(31–16) is greater than AC1(31–16), AC0(39–16) is stored in AC2(39–16) and TRN0(15) is cleared to 0. AC0(15–0) is greater than AC1(15–0), AC0(15–0) is stored inAC2(15–0) and TRN1(15) is cleared to 0.

Example of DMAXDIFF instruction
DMAXDIFF AC0, AC1, AC2, AC3, TRN1 Before After AC AC AC DDDE AC DDDE AC AC AC AC3 F0 5C00 BBBC M M40 1 SATD 1 SATD 1 TRN TRN1 0040 ACOV ACOV3 0 CARRY 0 CARRY 0 The difference is stored in AC3. The content of AC0 is subtracted from the content of AC1 and the result is stored in AC3. The maximum is stored in AC2. The content of TRN1 is shifted right 1 bit. AC0 is greater than AC1, AC0 is stored in AC2 and TRN1(15) is cleared to 0.

These instructions are useful for division.
Arithmetic instructions: Conditional Addition/subtraction, Shift, Subtract ADDSUBCC, ABBSUB2CC This instruction evaluates the selected TCx status bit and based on the result of the test, an addition, a move, or a subtraction is performed. Evaluation of the condition on the TCx status bit is performed during the Execute phase of the instruction. ADDSUB2CC Smem, Acx, Tx, TC1, TC2, ACy TC1 TC2 Operation 0 0 ACy = ACx – (Smem << Tx) 0 1 ACy = ACx – (Smem << #16) 1 0 ACy = ACx + (Smem << Tx) 1 1 ACy = ACx + (Smem << #16) If TC2 = 1 and TC1 = 1, then ACy = ACx + (Smem << #16): this instruction performs an addition operation between an accumulator ACx and the content of a memory (Smem) location shifted left by 16 bits and stores the result in accumulator ACy. If TC2 = 0 and TC1 = 1, then ACy = ACx + (Smem << Tx): this instruction performs an addition operation between an accumulator ACx and the content of a memory (Smem) location shifted left by the content of Tx and stores the result in accumulator ACy. The operation is performed on 40 bits in the D-unit shifter. Input operands are sign extended to 40 bits according to SXMD. The shift operation is identical to the signed shift instruction. Overflow detection and CARRY status bit depends on M40. When an overflow is detected, the accumulator is saturated according to SATD. If TC2 = 1 and TC1 = 0, then ACy = ACx – (Smem << #16): this instruction subtracts the content of a memory (Smem) location shifted left by 16 bits from an accumulator ACx and stores the result in accumulator ACy. If TC2 = 0 and TC1 = 0, then ACy = ACx – (Smem << Tx): this instruction subtracts the content of a memory (Smem) location shifted left by the content of Tx from an accumulator ACx and stores the result in accumulator ACy. ADDSUBCC Smem, ACx, TCx, ACy If TCx = 1, then ACy = ACx + (Smem << #16): this instruction performs an addition operation between an accumulator ACx and the content of a memory (Smem) location shifted left by 16 bits and stores the result in accumulator ACy. The operation is performed on 40 bits in the D-unit ALU. If TCx = 0, then ACy = ACx – (Smem << #16): this instruction subtracts the content of a memory (Smem) location shifted left by 16 bits from an accumulator ACx and stores the result in accumulator ACy. Conditional Addition or Subtraction: ADDSUBCC Smem, ACx, TC1, TC2, ACy TC1 TC2 Operation 0 0 ACy = ACx – (Smem << #16) 0 1 ACy = ACx 1 0 ACy = ACx + (Smem << #16) 1 1 ACy = ACx If TC2 = 1, then ACy = ACx: this instruction moves the content of ACx to ACy. The 40-bit move operation is performed in the D-unit ALU. During the 40-bit move operation, an overflow is detected according to M40: the destination accumulator overflow status bit (ACOVx) is set. the destination register (ACx) is saturated according to SATD. If TC2 = 0 and TC1 = 1, then ACy = ACx + (Smem << #16): this instruction performs an addition operation If TC2 = 0 and TC1 = 0, then ACy = ACx – (Smem << #16): this instruction subtracts the content of a These instructions are useful for division.

Example for conditional shift SFTCC
SFTCC AC0, TC1 Before After AC0 FF AC0 FF TC TC1 1 Because AC0(31) XORed with AC0(30) equals 1, the content of AC0 is not shifted left and TC1 is set to 1.

Example for SUBC SUBC *AR1, AC0, AC1 Before After
AR AR1 300 SXMD 0 SXMD 0 ACOV ACOV1 1 CARRY 0 CARRY 1 The content addressed by AR1 shifted left by 15 bits is subtracted from the content of AC0. The result is greater than 0; therefore, the result is shifted left by 1 bit, added to 1, and the new result stored in AC1. The result generated an overflow and a carry.

Dual 16-Bit Arithmetic See next slides for more details on ADDSUB and SUBADD. See chapter 22 for application to Viterbi algorithm.

Viterbi Decoding Procedure Channel ENC Data DEC ~Data
G0G1 DEC ~Data ~G0G1 Know: Received data, and how the original data was encoded Need: Derive the original data from the received data Viterbi: “Deriving the most likely path taken through a Viterbi trellis” Process: Establish path through trellis (using metric/penalties) to allow traceback to determine the original data that determined this path Procedure Current State *ptr ACxH ACyH ACxL ACyL Prelim Values Next State 1. Get current metric +T3 - T3 Viterbi is a tough subject to teach. First, if an experienced person is in the class, he/she knows more about it than you do and only wants to see the kernel on the next slide (3 cycles per state). If a person is new to Viterbi, even the most basic slide will confuse them and you don’t have two weeks to explain it to them. In fact, it takes people months or years to get comfortable with it. So, the intent of these two slides is to give a novice user a “feel” for how it works (not an understanding) and provide the experienced user with the kernel and benchmark. Any more or less information will bore/confuse most people and the expert already knows it. The actual contents of the “current state” is the penalty associated with the “high road” (a one) or the “low road” (a zero). This penalty is derived from the metric which is affected by the received data (G0G1). The “penalites” (an indication of how likely the data followed the high road or low road” are accumulative. Because there are 4 possible states, the DMAXDIFF instruction chooses the two “best” and then makes a note of which ones were chosen. During traceback, the two possibilities are narrowed down to one - the original data. The next slide shows the kernel for the experts to look at. 2. Add/sub local distance (T3) Select Max 3. Compare and select min/max TRN0/1 4. Note which path was taken (TRNx)

Viterbi Decoding Procedure
ADDSUB *AR0+,T3,AC0 ;hi(AC0) <- p0(J) ;lo(AC0) <- p0(J+N/2) SUBADD *AR0+,T3,AC1 ;hi(AC1) <- p1(J) ;lo(AC1) <- p1(J+N/2) MAXDIFF AC0,AC1,AC2,AC3 ;put “best path” in AC2 - Use ABDST/SQDST to determine metric; metric update and traceback not shown Procedure Current State *ptr ACxH ACyH ACxL ACyL Prelim Values Next State 1. Get current metric +T3 - T3 2. Add/sub local distance (T3) Select Max 3. Compare and select min/max TRN0/1 4. Note which path was taken (TRNx)

Dual Multiply/Accumulate/Subtract 1/2

Dual Multiply/Accumulate/Subtract 2/2

FIR symetrical and antisymetrical filters
See chapter 14 on FIR filters for detailed explanation of these instructions and next slide for illustration.

Symmetrical/Antisymmetrical FIR
Coeffs a4 a5 a6 a7 a3 a2 a1 a0 Symmetrical Anti-symmetrical These filters may be “folded” and performed with N adds and N/2 MACs Filters need to be designed as even length Y(n) = a0(x7 + x0) + a1(x6 + x1) + a2(x5 + x2) + a3(x4 + x3) If you’re using a block FIR, a dual-MAC beats FIRS. On the 54x, FIRS was faster. But on the 55x, having the dual-MAC basically obsoletes the FIRS instruction. So, no lab on this. For migration from the 54x (which has the SAME benchmark - 2 taps/cycle), the advantage of FIRSADD is lower power. FIRSADD Xmem,Ymem,coef,ACx,ACy FIRSADD = ACy = ACy + (ACx * (*CDP)) || ACx = Xmem + Ymem Pseudo Code: Anti-symmetrical: use FIRSSUB (e.g. a0(x7-x0)), FIRSADD: 2 taps/cycle

Implied parallelism instructions

LMS, MAX and MIN instructions
See chapter 16 on adaptive FIR filters for detailed explanation of the LMS instruction and next slide for illustration. See next sides for more explanation on min and max.

Adaptive Filtering Using LMS - Concept
A least mean square (LMS) approach is widely used for adaptive filter routines. The technique minimizes an error term by tuning the filter coefficients. input + noise input Modeled System d y  a0 z-1 ... a1 an x0 x1 xk Provide input to the real system to get desired output: d(n) Filter input through FIR model to get estimated output: y(n) Compare results to get error term: e(n) e  - Correlate the error term with x0, x1 …, to update coefficients -input - noise e*x0 e*x1 e*xn ei = e*xi LMS = FIR , Coeff update = (ACy = ACy + (ai*xi)), ACx = rnd(ei+ai)

LMS Adaptive Filter Code
... .asg AR3, Coefs .asg AR4, Data MOV @B2e,T3 || MOV #(N-2),BRC0 MPYM *Data,T3,AC0 || MOV #0,AC1 MOV LMS *Coefs,*Data,AC0,AC1 || RPTBLOCAL e1 MOV HI(AC0),*Coefs+ || MPYM *Data+,T3,AC0 e1: LMS *Coefs,*Data,AC0,AC1 MOV HI(AC0),*Coefs || MOV HI(AC1),*Result+ Pre-calculate 2**e(n) ... . AR3 pts to coeff table: a[n] AR4 pts to data table: x[n] T3 holds error step amount... … while loading BRC0 ACO = error * oldest sample: x(n)... … while clearing AC1 (running FIR) Overwrite x(n) with x(0) Start FIR calc, update oldest coeff… … and start repeat block Store update coefficient ... ...while calculating next update term Calc FIR, update coefficient Store final coefficient... …while storing FIR output explaining this code ... 2Be is the error step amount and must be pre-calculated Set up the pointers for Coeffs and Data, load 2Be into the T register Zero out the B accumulator and load the Branch Repeat Counter with N-2 (not N-1 since we’ll do the first one after RPTBLOCAL) MPYM mults. the error in T * the first data and places result in AC0L accumulator (bear in mind we’re reducing our result ... large data values tell us we can reduce more quickly ... small ones less quickly) LMS .. do the filter tap in AC1, do coeff update in AC0H now for the loop .. LMS ... do the filter tap in B, do coeff update in AH one extra coeff update will be done (no penalty)... Store the final updated coeff and final filter result

How do you determine WHICH value was the min/max?
Search - MIN, MAX Goal: find the max (or min) value in an array 1628 24 24893 588 *ptr src dst max Y Carry=0 MAX src,dst Operands: src/dst can be AC0-3, AR0-7, T0-3 RPT #(N-2) MAX AC0,AC1 || MOV *ptr+,AC0 Benchmark: ~N cycles to find the min/max of N elements How do you determine WHICH value was the min/max?

Memory comparison, modify auxiliary register, modify data stack pointer

Multiply instructions

Multiply and accumulate instructions

Multiply and subtract, negation

Normalization, Register comparison
These normalization instructions are useful when working with floating point numbers.

Round and saturate, signed shift

Square distance

Similar to ADD instruction, except for form n°5
Subtraction Similar to ADD instruction, except for form n°5

Example for 5th form of SUB
Syntax: SUB src, Smem, dst SUB AC1, *AR3, AC0 The content of AC1 is subtracted from the content addressed by AR3 and the result is stored in AC0.

Bit manipulations operations

Memory Bit Test/Set/Clear/Complement

Register Bit test/Set/Clear/Complement

Status bit Set/Clear † When these instructions are decoded to modify status bit CAFRZ (15), CAEN (14), or CACLR (13), the CPU pipeline is flushed and the instruction is executed in 5 cycles regardless of the instruction context.

Extended auxiliary register (XAR) operations

Logical operations

Logival operations

Miscelaneous operations

Move operations

Program-control operations
x/y cycles: x cycles = condition true, y cycles = condition false † These instructions execute in 3 cycles if the addressed instruction is in the instruction buffer unit.

x/y cycles: x cycles = condition true, y cycles = condition false.

x/y cycles: x cycles = condition true, y cycles = condition false

Running C54 code on the C55 Assembling/optimizing a C54x algorithm
System issues (interrupts, peripherals, etc)

Let’s look at algorithms first...
Running C54 Code on the C55 C55 assembler accepts all C54x mnemonics (no translation necessary: C55 instructions are a superset of C54) C54 source will execute correctly on C55 (bit exact) User must optimize C54 code to use new C55 features What % of commonly used C54 instructions assemble as ONE instruction on the C55 ? 2 areas of focus: The numbers shown in the pie chart are based on the MOST COMMONLY USED C54x instructions - based on loads of customer code that were used to gather the benchmarks. If you use strictly the 54x instruction set as is (and NOT look at commonly used instructions), the percentages look worse. This is documented in the Assembly Language Tools Guide in detail. Also, the documentation shows a range of percentages - I simply picked the middle number and ran with it. Algorithms System Let’s look at algorithms first...

C54 Filter Algorithm Example
Original C54 Code Resulting C55 Disassembly fir: STM #184,BRC STM #x,AR3 STM #y,AR4 RPTB done-1 STM #a0,AR2 MPY *AR2+,*AR3+,A RPT #14 MAC *AR2+,*AR3+,A MAR *+AR3(#-15) STH A,*AR4+ done: B next_routine fir: AMOV #184,BRC0 AMOV #x,XAR3 AMOV #y,XAR4 RPTB done AMOV #a0,XAR2 MPYM *AR2+,*AR3+,AC0 RPT #14 MACM *AR2+,*AR3+,AC0 AMAR *+AR3(#-15) done: MOVH AC0,*AR4+ B next_routine The key points here are: 1. This is “typical DSP code” which is being assembled using the C55x assembler. The C54x mnemonics assemble FINE with no errors or warnings AS IS. This helps the user understand that it’s not a big deal for generic, default code. (hey, this is the intro, we’re not going to tell them about all the nasties in 15 minutes) 2. You get a small reduction in code size (for this example - which CANNOT be extrapolated to the user’s code) and some increase in performance because many instructions on the C54x are 2 word and therefore 2 cycle. On the C55x, many times they are 3 bytes and 1 cycle for the same operation. 3. If you OPTIMIZE the code on the right (i.e. use the dual-MAC), user’s can cut the cycles in half - about 1700 cycles as demonstrated in the workshop lab time. 4. The assembler cannot “re-arrange” your code an pointers to make a single MAC into a dual-MAC. This will have to be re-written by the user and then re-compiled. RPTB kernel: 20 bytes, 4072 cycles kernel (w/o B): 17 bytes, 3701 cycles The code above assembles AS IS - no errors, no warnings User must modify code to use C55 dual-MAC (pointers, MAC:MAC) Does “anything go” with C54 source? Nope...

C54 Coding Practices - No-No’s
hard_addr: B h B loop+10 ERROR! at line 39: [E9999] Illegal operand or operand combination Original C54 Code C55 Assembler Output Cannot use hard-coded address or program label offsets C55 uses variable-length instructions. Location of hard address and/or offset will be different. pipe_trick: STM #100h,AR5 STLM A,AR5 ADD *AR5,B pipe_trick: AMOV #100h,AR MOV AC0,AR5 ADD *AR5,AC1 Original C54 Code Resulting C55 Disassembly Pipe Trick: C54 ADD uses old value of AR5 (#100h) Code assembles fine, but will NOT yield same results C55 PPU stalls ADD until AC0  AR5

Other C55 Assembler Examples
b_delay: BD next STM #x,AR5 Original C54 Code Resulting C55 Disassembly No delayed operations on C55 Delay slot instruction (STM) moved ahead of standard branch (same with CALLD, RETD, BCD) b_delay: AMOV #x,AR5 B next mv: MVDK *AR3,#800h Resulting C55 .lst File “PORT” used to signify 1-to-multiple translation Some temporary registers may be used (e.g. XCDP) 57 ****** PORT MVDK *AR4,#800h EC31 AMAR *(#(2048)),XCDP 00003a 7E00 00003c 1000 e EF81 MOV *AR4,*CDP Original C54 Code Next, let’s examine some system issues...

System Issues (1) The user must make the following modifications to the existing C54 code to achieve proper execution on a C55: rsv: BD reset_ISR STM #TOS,SP Vector Table - C54 Vector Table - C55 .ivec specifies vector address and aligns vector packet on an 8-byte boundary. C55 vector packet allows one instruction (up to 4 bytes) to be executed rsv: .ivec reset_ISR AMOV #TOS,XSP 1 Re-Write Vector Table Add System Stack (XSSP) Must also initialize the system stack (XSSP) for CALL/RET instructions to operate properly on C55 SP and SSP must reside on the same 64K (word) page 2

System Issues (2) 3 Re-write any instructions which access peripherals
C55 does not support sub-bank addressing C55 peripheral registers are located in I/O C55 may have a different mix of peripherals 4 Re-write linker.CMD file to use BYTE addresses/lengths C55 uses a unified memory map (program/data share same map) C55 linker.CMD file uses BYTE addresses and BYTE lengths 5 Possibly re-write C-callable assembly routines C55 compiler passes parameters differently than the C54 by default C55 can support C54 parameter passing rules if specified by the user

User Must Optimize C54 Code
The final step in the migration process is to optimize the code to take advantage of the new features of the C55: Dual MAC Additional registers Additional memory address range Address unit ALU Instruction-level parallelism Nested repeat blocks Etc.

For More Information This introduction to C54-to-C55 code migration is NOT a complete list of all issues Additional details can be found in the following locations: SPRU280b - TMS320C55x Assembly Language Tools User’s Guide Chapter 6 - Running C54x Code on C55x Chapter 7 - Migrating a C54x System to a C55x System CCS On-line Help Code Generation Tools  Running C54x Code on C55x

Chapter 5 Assembly Language

Similar presentations

Presentation on theme: "Chapter 5 Assembly Language"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 5 Assembly Language

Similar presentations

Presentation on theme: "Chapter 5 Assembly Language"— Presentation transcript:

Similar presentations

About project

Feedback