# MODULE 2. Syllabus Fixed and floating point formats code improvement Constraints TMS 320C64x CPU simple programming examples using C/assembly.

## Presentation on theme: "MODULE 2. Syllabus Fixed and floating point formats code improvement Constraints TMS 320C64x CPU simple programming examples using C/assembly."— Presentation transcript:

MODULE 2

Syllabus Fixed and floating point formats code improvement Constraints TMS 320C64x CPU simple programming examples using C/assembly.

Fixed point numbers Fast and inexpensive implementation Limited in the range of numbers Susceptible to problems of overflow Fixed-point numbers and their data types are characterized by their - word size in bits and whether they are signed or unsigned

unsigned integer the stored number can take on any integer value from 0 to 65,535. signed integer uses two's complement allows negative numbers it ranges from -32,768 to 32,767 With unsigned fraction notation 65,536 levels spread uniformly between 0 and 1 the signed fraction format allows negative numbers, equally spaced between -1 and 1

Carry and Overflow Carry applies to unsigned numbers when adding or subtracting, result is incorrect. Overflow applies to signed numbers when adding or subtracting, result is incorrect.

01111 + 100+ 00111 111 -------- ------------- 10110 1011 Overflow Carry Sign bit Carry Examples: Sign bit

Data types 1.Short: it is of size 16 bits represented as 2s complement with a range from -2 15 to (2 15 -1) 2.Int or signed int: it is of size 32 bits represented as 2s complement with a range from -2 31 to ( 2 31 -1) 3.Float: it is of size 32 bits represented as IEEE 32 bit with a range from 2- 126 (1.175494x10- 38 ) to 2+ 128 (3.40282346x10 38 ) 4.Double: it is of size 64 bits represented as IEEE 64 bit with a range from 2 -1022 (2.22507385x10 -308 ) to 2 1024 (1.79769313x10 308 )

Floating-point representation The advantage over fixed-point representation is that it can support a much wider range of values. The floating-point format needs slightly more storage The speed of floating-point operations is measured in FLOPS.

General format of floating point number : X= M. b e where M is the value of the significand (mantissa), b is the base e is the exponent. Mantissa determines the accuracy of the number Exponent determines the range of numbers that can be represented

Floating point numbers can be represented as: Single precision : called "float" in the C language family it is a binary format that occupies 32 bits its significand has a precision of 24 bits Double precision : called "double" in the C language family it is a binary format that occupies 64 bits its significand has a precision of 53 bits

Single Precision (SP): Bit 31 represents sign bit Bits 23 to 30 represents exponent bits Bits 0 to 22 represents fractional bits Numbers as small as 10 -38 and as large as 10 38 can be represented S ef 022233031

Double precision (DP) : since 64 bits, more exponent and fractional bits are available a pair of registers are used Bits 0 to 31 of first register represents fractional bits Bits 0 to 19 second register also represents fractional bits Bits 20 to 30 represents exponent bits Bits 31 is the sign bit Numbers as small as 10 -308 and as large as 10 +308 can be represented ffes 031019203031

Instructions ending in SP or DP represents single and double precision Some Floating point instructions have more latencies than fixed point instructions Eg: MPY requires one delay MPYSP has three delays MPYDP requires nine delays Single precision floating point value can be loaded into a single register where as Double precision values need a pair of registers A1:A0, A3:A2,…….. B1:B0, B3:B2,…………… C6711 processor has a single precision reciprocal instruction RCPSP for performing division

Code improvement Code written in assembly (ASM) is processor-specific. C code can readily be ported from one platform to another. Optimized ASM code runs faster than C and requires less memory space. Before optimizing, make sure that the code is functional and yields correct results. After optimizing, the code can be so reorganized and resequenced that the optimization process makes it difficult to follow. If a C coded algorithm is functional and its execution speed is satisfactory, there is no need to optimize further. If the performance of the code is not adequate, use different compiler options to enable software pipelining, reduce redundant loops, and so on.

Code improvement If the performance desired is still not achieved, you can use loop unrolling to avoid overhead in branching. This generally improves the execution speed but increases code size. You also can use word-wide optimization by loading/accessing 32-bit word (int) data rather than 16-bit half-word (short) data. You can then process lower and upper 16-bit data independently If performance is still not satisfactory, you can rewrite the time-critical section of the code in linear assembly, which can be optimized by the assembler optimizer. The profiler can be used to determine the specific function(s) that need to be optimized further.

Optimization Steps If the performance and results of your code are satisfactory after any particular step, you are done. 1.Program in C. Build your project without Optimization 2. Use intrinsic functions when appropriate as well as the various optimization levels. 3. Use the profiler to determine/ identify the functions that may need to be further optimized. Then convert these functions in linear ASM. 4. Optimize code in ASM.

Compiler options A C-coded program is first passed through a parser that performs preprocessing functions and generate an intermediate file (.if) which becomes the input to an optimizer. The optimizer generates an (.opt) file which becomes the input to a code generator for further optimization and generates ASM file. OptimizerParser code generatorASM C Code.if.opt

The options for optimization levels : 1. -o0 optimizes the use of registers 2. -o1 performs a local optimization in addition to optimization done by -o0. 3. -o2 performs global optimization in addition to optimization done by -o0 and -o1. 4. -o3 performs file optimization in addition to the optimizations done by -o0, -o1 and -o2. -o2 and -o3 attempt to do software optimizations.

Intrinsic C functions: Similar to run time support library function C intrinsic function are used to increase the efficiency of code. Int _mpy ( ) has an equivalent ASM instruction MPY, which multiplies 16 LSBs of a number by 16 LSBs of another number. 2. int_mpyh ( ) has an equivalent ASM instruction MPYH which multiplies 16 MSBs of a number by the 16 MSBs of another number. 3. int_mpylh ( ) has an equivalent ASM instruction MPYLH which multiplies 16 LSBs of a number by 16 MSBs of another. 4. int_mpyhl ( ) has an equivalent ASM instruction MPYHL which multiplies 16 MSBs of a number by the 16 LSBs of another. 5. Void_nassert (int) generates no code. It tells the compiler that expression declared with the asssert function is true. 6. Uint_lo (double) and Uint_hi (double) obtain low and high 32 bits of a double word.

PROCEDURE FOR CODE OPTIMIZATION 1. Use instructions in parallel so that multiple functional units can be operated within the same cycle. 2. Eliminate NOPs or delay slots, placing code where the NOPs are. 3. Unroll the loop to avoid overhead with branching. 4. Use word-wide data to access a 32-bit word (int) in lieu of a 16-bit half-word (short). 5. Use software pipelining,

PROGRAMMING EXAMPLES USING CODE OPTIMIZATION TECHNIQUES Sum of Products with Word-Wide Data Access for Fixed-Point Implementation Using C Code //twosum.c Sum of Products with separate accumulation of even/odd terms //with word-wide data for fixed-point implementation int dotp (short a[ ], short b [ ]) { int suml, sumh, sum, i; suml = 0; sumh = 0; sum = 0; for (i = 0; i < 200; i +=2) { suml += a[i] * b[i]; //sum of products of even terms sumh += a[i + 1] * b[i + 1]; //sum of products of odd terms } sum = suml + sumh; //final sum of odd and even terms return (sum); }

//dotpintrinsic.c Sum of products with C intrinsic functions using C for (i = 0; i < 100; i++) { suml = suml + _mpy(a[i], b[i]); sumh = sumh + _mpyh(a[i], b[i]); } return (suml + sumh);

Sum of Products with Word-Wide Access for Fixed-Point Implementation Using Linear ASM Code Sum of Products. Separate accum of even/odd terms With word-wide data for fixed-point implementation using linear ASM loop: LDW *aptr++, ai ;32-bit word ai LDW *bptr++, bi ;32-bit word bi MPY ai, bi, prodl ;lower 16-bit product MPYH ai, bi, prodh ;higher 16-bit product ADD prodl, suml, suml ;accum even terms ADD prodh, sumh, sumh ;accum odd terms SUB count, 1, count ;decrement count [count] B loop ;branch to loop

dotpnp.asm ASM Code with no-parallel instructions for fixed-point MVK.S1 200, A1 ;count into A1 ZERO.L1 A7 ;init A7 for accum LOOP LDH.D1 *A4++,A2 ;A2=16-bit data pointed by A4 LDH.D2 *A8++,A3 ;A3=16-bit data pointed by A8 NOP 4 ;4 delay slots for LDH MPY.M1 A2,A3,A6 ;product in A6 NOP ;1 delay slot for MPY ADD.L1 A6,A7,A7 ;accum in A7 SUB.S1 A1,1,A1 ;decrement count [A1] B.S2 LOOP ;branch to LOOP NOP 5 ;5 delay slots for B

Dot Product with Parallel Instructions for Fixed-Point Implementation Using ASM Code twosumfix.asm ASM code for two sums of products with word-wide data for fixed-point implementation MVK.S1 100, A1 ;count/2 into A1 || ZERO.L1 A7 ;init A7 for accum of even terms || ZERO.L2 B7 ;init B7 for accum of odd terms LOOP LDW.D1 *A4++,A2 ;A2=32-bit data pointed by A4 || LDW.D2 *B4++,B2 ;A3=32-bit data pointed by B4 SUB.S1 A1,1,A1 ;decrement count [A1] B.S1 LOOP ;branch to LOOP (after ADD) NOP 2 ;delay slots for both LDW and B MPY.M1x A2,B2,A6 ;lower 16-bit product in A6 || MPYH.M2x A2,B2,B6 ;upper 16-bit product in B6 NOP ;1 delay slot for MPY/MPYH ADD.L1 A6,A7,A7 ;accum even terms in A7 || ADD.L2 B6,B7,B7 ;accum odd terms in B7 ;branch occurs here

Trip directive for loop count: Linear assembly directive (.trip) is used to specify the number of times a loop iterates. If the exact number is known and used, redundant loops are not generated and can improve both code size and execution time.

Software pipelining software pipelining is a scheme which uses available resources to obtain efficient pipelining code. The aim is to use all eight functional units within one cycle. Optimization levels –o2 and –o3 enable code generation to generate (or attempt to generate) software-pipelined code. There are three stages: 1. prolog (warm-up)- This stage contains instructions needed to build up the loop kernel cycle. 2. Loop kernel (cycle)- within this loop, all instructions are executed in parallel. Entire loop is executed in one cycle. 3. Epilog (cool-off)- This stage contains the instructions necessary to complete all iterations

Procedure for hand-coded software pipelining: 1. Draw the dependency graph 2. Set up a scheduling table 3. Obtain code from the scheduling table. Dependency graph: (Procedure) 1. Draw the nodes and paths 2. Write the number of cycles to complete an instruction 3. Assign functional units associated with each code 4. Separate the data paths, so that the maximum number of units are utilized.

Dependency Graph dot product (a) initial stage (b) Final stage

A node has one or more data paths going in and/or out of the node The numbers next to each node represent the number of cycles required to complete the associated instruction. A parent node contains an instruction that writes to a variable; Child node contains an instruction that reads a variable written by the parent. The LDH instructions are considered to be the parents of the MPY instruction since the results of the two load instructions are used to perform the MPY instruction.

Dependency graph : (Eg. Two sum of product) bibi Sum l count loop Sum h Prod h aiai Prod l Side A Side B LDW.D1.D2.M1x.M2x.L1.L2.S1.S2 MPY MPYH ADD SUBB 5 5 5 5 2 2 1 11 1

Scheduling table: 1. LDW starts in cycle 1 2. MPY and MPYH must start five cycles after LDW, due to four delay slots. Therefore MPY/MPYH starts at cycle 6. 3. ADD must start two cycles after MPY/MPYH due to one delay slot of MPY/MPYH. Therefore ADD starts in cycle 8. 4. B has 5 delay slots and starts in cycle 3, since branching occurs in cycle 9, after ADD instructions. 5. SUB instruction must start one cycle before branch instruction, since the loop count is decremented before branching occurs. Therefore SUB starts in cycle 2.

Schedule table before software pipelining: units cycles.D1.D2.M1.M2.L1.L2.S1.S2 1,9,17.. 2,10,18.. 3,11,.. 4,12,.. 5,13,.. 6,14,.. 7,15,.. 8,16,.. LDW SUB B MPY MPYH ADD

Instructions within prolog stage (cycles 1-7) are repeated until and including loop kernel (cycle 8). Instructions in the epilog stage (cycles 9,10…) are to complete the functionality of the code.

Schedule table after software pipelining:

Loop Kernel Within the loop cycle 8, multiple iterations of the loop-execute in parallel. ie, different iterations are processed at same time. eg: ADDs add data for iteration 1 MPY/MPYH multiply data for iteration 3 LDW load data for iterations 8 SUB decrements the counter for iteration 7 B branches for iteration 6 ie, values being multiplied are loaded into registers 5 cycles prior to cycle when the values are actually multiplied. Before first multiplication occurs, fifth load has just completed. This software pipelining is 8 iterations deep.

If the loop count is 100 (200 numbers) Cycle 1: LDW, LDW (also initialization of count and accumulators A7 and B7) Cycle 2: LDW, LDW, SUB Cycle 3-5: LDW, LDW, SUB, B Cycle 6-7: LDW, LDW, MPY, MPYH, SUB, B Cycle 8-107: LDW, LDW, MPY, MPYH, ADD, ADD, SUB, B Cycle 108: LDW, LDW, MPY, MPYH, ADD, ADD, SUB, B Prolog section is within cycle 1-7 Loop kernel is in cycle 8 Epilog section is in cycle 108.

Execution Cycles: Number of cycles (with software pipelining): Fixed point = 7+ (N/2) +1 eg: N = 200 ; 7+100+1 = 108 Floating points = 9 + (N/2) + 15 Fixed Point Floating Point No Optimization 2 + (16 X 200) = 3202 2 + (18 X 200) = 3602 With parallel instructions 1 + (8 X 200) = 1601 1 + (10 X 200) = 2001 Two sums per iterations 1 + (8 X 100) = 801 1 + (10 X 100) + 7 = 1008 With S/W pipelining 7 + (200/2) + 1 = 108 9 + (200/2) +15 = 124

Memory Constraints : Internal memory is arranged through various banks of memory so that loads and stores can occur simultaneously. Since banks are single ported, only one access to each bank is performed per cycle. Two memory access per cycle can be performed if they do not access the same bank. If multiple access is performed to the same bank, pipeline will stall.

Cross Path Constraints : Since there is one cross path in each side of the two datapaths, there can be at most two instructions per cycle using cross path. eg: Valid code segment (because both available cross paths are utilized ) ADD.L1X A1, B1, A0 || MPY.M2X A2, B2, B3 eg: Not valid ( because one cross path is used for both instructions) ADD.L1X A1, B1, A0 || MPY.M1X A2, B2, A3

Load/store constraints : The address register to be used must be on the same side as the.D unit. eg: Valid code: LDW.D1 *A1, A2 || LDW.D2 *B1, B2 eg: Invalid code: LDW.D1. *A1, A2 || LDW.D2 *A3, B2 Loading and storing cannot be from the same register file. A load (or store) using one register file in parallel with another load (or store) must use a different register file. eg: Valid code: LDW.D1 *A0, B1 || STW.D2 A1,*B2 eg: Invalid code: LDW.D1 *A0, A1 || STW.D2 A2,*B2

TMS320C64x TMS320C64x is a family of 16-bit Very Long Instruction Word (VLIW) DSP from Texas Instruments Fixed point processor At clock rates of up to 1 GHz, C64x DSPs can process information at rates up to 8000 MIPS C64x DSPs can do more work each cycle with built-in extensions. They can process all C62x object code unmodified (but not vice-versa)

Applications for the C64x TMS320C64x can be used as a CPU in the following devices: Wireless local base stations; Remote access server (RAS); Digital subscriber loop (DSL) systems; Cable modems; Multichannel telephony systems; Pooled modems;

Block diagram Enhanced DMA Controller (64-channel) ZBT RAM SDRAM SBSRAM FIFO SRAM I/O devices L2 Memory 1024K bytes L1 Program cache Direct-mapped 16 K Bytes total EMIF A EMIF B. L1 Data cache 2-way set-associative 16 K Bytes total CPU CORE

C64X CPU

Features of TMS320C6413 - Based on the second-generation high-performance, advanced VelociTI 500-MHz Clock Rate 8000 MIPS Eight 32-Bit Instructions/Cycle DSP Core Eight Highly Independent Functional Units Six ALUs (32-/40-Bit), Each Supports Single 32-Bit, Dual 16-Bit, or Quad 8-Bit Arithmetic per Clock Cycle Two Multipliers Support Four 16 x 16-Bit Multiplies (32-Bit Results) per Clock Cycle or Eight 8 x 8-Bit Multiplies (16-Bit Results) per Clock Cycle Load-Store Architecture With Non-Aligned Support 64 32-Bit General-Purpose Registers

Features of TMS320C6413 Instruction Set Features Byte-Addressable (8-/16-/32-/64-Bit data) 8-Bit Overflow Protection Bit-Field Extract, Set, Clear Normalization, Saturation, Bit-Counting Increased Orthogonality L1/L2 Memory Architecture 128K-Bit (16K-Byte) L1P Program Cache (Direct Mapped) 128K-Bit (16K-Byte) L1D Data Cache (2-Way Set-Associative) 2M-Bit (256K-Byte) L2 Unified Mapped RAM/Cache [C6413] (Flexible RAM/Cache Allocation) Endianess: Little Endian, Big Endian

Features of TMS320C6413 32-Bit External Memory Interface (EMIF) Glueless Interface to Asynchronous Memories (SRAM and EPROM) and Synchronous Memories (SDRAM,SBSRAM, ZBT SRAM, and FIFO) 512M-Byte Total Addressable External Memory Space Enhanced Direct-Memory-Access (EDMA) Controller (64 Independent Channels) Host-Port Interface (HPI) [32-/16-Bit] Two Multichannel Audio Serial Ports (McASPs) - with Six Serial Data Pins each Two Multichannel Buffered Serial Ports Three 32-Bit General-Purpose Timers Sixteen General-Purpose I/O (GPIO) Pins Flexible PLL Clock Generator

New enhancements Register file enhancements Data path extensions Quad 8-bit and dual 16-bit extensions with data flow enhancements Additional functional unit hardware Increased orthogonality of the instruction set Additional instructions that reduce code size and increase register flexibility

Register file enhancements The C64x register file has double the number of general-purpose registers than the C62x/C67x cores There are 32 32-bit registers per data path A0-A31 for file A and B0-B31 for file B In all C6000 devices, registers A4-A7 and B4-B7 can be used for circular addressing.

Packed data processing The C64x register file supports all the C62x data types and extends this by additionally supporting packed 8-bit types and 64-bit fixed-point data types. Instructions operate directly on packed data to streamline data flow and increase instruction set efficiency. Packed data types store either four 8-bit values or two 16-bit values in a single 32- bit register or four 16-bit values in a 64-bit register pair. Besides being able to perform all the C62x instructions, the C64x also contains many 8–bit and 16–bit extensions to the instruction set. Eg: MPYU4 instruction performs four 8x8 unsigned multiplies with a single instruction on a.M unit.

Data path extensions On the C64x, all eight of the functional units have access to the register file on the opposite side via a cross path. on the C62x/C67x, only six functional units have access to the register file on the opposite side via a cross path; the.D units do not have a data cross path. The C64x pipelines data cross path accesses allowing multiple units per side to read the same cross path source simultaneously. In C62x/C67x, only one functional unit per data path per execute packet could get an operand from the opposite register file.

The C64x supports double-word loads and stores. There are four 32-bit paths for loading data for memory to the register file. For side A, LD1a is the load path for the 32 LSBs; LD1b is the load path for the 32 MSBs. For side B, LD2a is the load path for the 32 LSBs; LD2b is the load path for the 32 MSBs. There are also four 32-bit paths for storing register values to memory from each register file. ST1a is the write path for the 32 LSBs on side A; ST1b is the write path for the 32 MSBs for side A. For side B, ST2a is the write path for the 32 LSBs and ST2b is the write path for the 32 MSBs.

The C64x can also access words and double words at any byte boundary using non- aligned loads and stores. As a result, word and double-word data does not always need alignment to 32-bit or 64-bit boundaries as in the C62x/C67x

Additional Functional Unit Hardware the.L units can perform byte shifts and the.M units can perform bi- directional variable shifts in addition to the.S units ability to do shifts. The.L units can now perform quad 8-bit subtracts with absolute value. This absolute difference instruction greatly aids motion estimation algorithms. Special communication-specific instructions, such as SHFL, DEAL and GMPY4, have been added to the.M unit to address common operations in error-correcting codes. Bit-count and rotate hardware on the.M unit extends support for bit- level algorithms such as binary morphology, image metric calculations and encryption algorithms.

Increased Orthogonality The.D unit can now perform 32-bit logical instructions in addition to the.S and.L units. Also, the.D unit now directly supports load and store instructions for double-word data values The C62x/C67x allows up to four reads of a given register in a given clock cycle. The C64x allows any number of reads of a given register in a given clock cycle. On the C62x/C67x, one long source and one long result per data path could occur every clock cycle. On the C64x, up to two long sources and two long results can be accessed on each data path every clock cycle.

General-Purpose Register Files The C64x register file contains 32 32-bit registers (A0-A31 for file A and B0-B31 for file B); can be used for data, pointers or conditions Values larger than 32 bits (40-bit long and 64-bit float quantities) are stored in register pairs. Packed data types are: four 8-bit values or two 16-bit values in a single 32-bit register, four 16-bit values in a 64-bit register pair. Zero filled Odd register Even register 32 39 31 0

Pipeline FetchDecodeExecute The C64x pipeline has the following features: 11 phases divided into Fetch, Decode, Execute; Fetch has 4 phases for all instructions, the decode phase has two phases for all instructions; The execute stage of the pipeline requires a varying number of phases, depending on the type of the instruction. The stages of the fixed-point pipeline are:

In the C64x instructions are fetched from the instruction memory in grouping of eight instructions, called fetch packets (FPs); Each FP can be split into one to eight executable packets (EP). Each EP contains only instructions that can execute in parallel. Each instruction in EP executes in an independent functional unit; The C64x pipe is most effective when it is kept as full as possible by organizing instructions;

Pipeline Stages

Execute Pipeline Stages: E1 E1: Execute stage 1 –Single cycle instructions are completed –For all instructions, conditions are evaluated and operands are read –For load/store, address generation is performed, and address modifications are written to register file –For branch instructions, branch fetch packet in PG phase is affected –For single cycle instructions, results are written to register

Execute Pipeline Stages: E2 E2: Execute stage 2 –Multiply instructions are completed –Load inst. sends address to memory –Store inst. sends address and data to memory –The SAT bit in the control status register (CSR) is set if a single cycle instruction saturated the result set –Single 16x16 multiply inst. results are written to the register –.M Unit non-multiply instructions are written to the register

Execute Pipeline Stages: E3 E3: Execute stage 3 –Store instructions are completed –Data memory accesses are performed –The SAT bit in the control status register (CSR) is set for multiply instructions

Execute Pipeline Stages: E4 E4: Execute stage 4 –Multiply extension instructions are completed –Load instructions bring the data to the CPU –Multiply extension instruction (MPY2, MYP4, DOTPx2, DOTPU4, MPYHIx, MPYLIx and MVD) results are written to the register

Execute Pipeline Stages: E5 E5: Execute stage 5 –Load instructions are completed –Load instruction data is written to the register

Pipeline summary Instructions are decoded in functional unitsDCDecode The next execute packet in the fetch packet is determined and sent to the appropriate functional units to be decoded DPDispatchProgram decode The fetch packet is at the CPU boundaryPRProgram data receive A program memory access is performedPWProgram wait The address of the fetch packet is sent to memoryPSProgram address send The address of the fetch packet is determinedPGProgram address generate Program fetch During This PhaseSymbolPhaseStage

Pipeline summary For load instructions, data is written into a register.E5Execute 5 For load instructions, data is brought to the CPU boundary. The results of multiply extensions are written to a register file. E4Execute 4 Data memory accesses are performed. Any multiply instructions that saturates results sets the SAT bit in the control status register (CSR) if saturation occurs. E3Execute 3 For load instructions, the address is sent to memory. For store instructions, the address and data are sent to memory. Single-cycle instructions that saturate results set the SAT bit in the control status register (CSR) if saturation occurs. E2Execute 2 For all instruction types, the conditions for the instructions are evaluated and operands are read. For load and store instructions, address generation is performed and address modifications are written to a register file. For branch instructions, branch fetch packet in PG phase is affected For single-cycle instructions, results are written to a register file. E1Execute 1Execute

Delay Slots Delay slots mean how many CPU cycles come between the current instruction and when the results of the instruction can be used by another instruction Single Cycle Instructions: 0 delay slots 16x16 Single Multiply and.M Unit non-multiply Instructions: 1 delay slot

Store: 0 delay slots –If a load occurs before a store (either in parallel or not), then the old data is loaded from memory before the new data is stored. –If a load occurs after a store, (either in parallel or not), then the new data is stored before the data is loaded. C64x Multiply Extensions: 3 delay slots Load: 4 delay slots Branch: 5 delay slots –The branch target is in the PG slot when the branch condition is determined in E1. There are 5 slots between PG and E1 when the branch target begins executing useful code again.

Memory The C64x has different spaces for program and data memory; Uses two-level cache memory scheme;

Internal Memory The C64x has a 32-bit byte-addressable memory with the following features: Separate data and program address spaces; Large on chip RAM, up to 7MB; 2-level cache; Single internal program memory port with an instruction-fetch bandwidth of 256 bits; Two 64-bit internal data memory ports;

Memory Map (Internal and External Memory) Level 1 Program Cache is 128 Kbit direct mapped Level 1 Data cache is 128Kbit 2-way set- associative Shared Level 2 Program/Data Memory/Cache of 4Mbit –Can be configured as mapped memory –Cache (up to 256 Kbytes) –Combination of the two

Memory Buses Instruction fetch using 32-bit address bus and 256-bit data bus two 64-bit load buses (LD1 and LD2) two 64-bit store buses (ST1 and ST2)

Peripheral Set 2 multichannel buffered audio serial ports 2 inter-integrated circuit bus modules (I2Cs) 2 multichannel buffered serial ports (McBSPs) 3 32-bit general-purpose timers 1 user-configurable 16-bit or 32-bit host-port interface (HPI16/HPI32) 1 16-pin general-purpose input/output port (GP0) with programmable interrupt/event generation modes 1 32-bit glueless external memory interface (EMIFA), capable of interfacing to synchronous and asynchronous memories and peripherals.

ZBT RAM Zero Bus Turnaround (ZBT) is a synchronous SRAM architecture optimized for networking and telecommunications applications. It can increase the internal bandwidth of a switch fabric when compared to standard SyncBurst SRAM. The ZBT architecture is optimized for switching and other applications with highly random READs and WRITEs. ZBT SRAMs eliminate all idle cycles when turning the data bus around from a WRITE operation to a READ operation

Interfacing C and Assembly Language When an assembly function is called from C the values passed to the function will be stored in specific registers. The first 10 arguments passed to an assembly function will be stored to registers A4, B4, A6, B6, A8, B8, A10, B10, A12, B12 Any additional arguments will be stored in a stack The even registers are used when 32-bits of data (or less) are being passed to each register. When a 64-bit (double precision floating-point number) is passed to a function, it is stored in adjoining registers (e.g. A4:A5, B4:B5, A6:A7, etc.) Upon returning from a called function, only one value may be returned. By convention, the value in register A4 will be returned.

Sum of products example C code: int DotP(short* m, short* n, int count) { int i, product, sum = 0; for(i = 0; i < count; i++) { product = m[i] * n[i]; sum+=product; } return(sum); } TI TMS C64x code: LOOP: [A0] SUB.L1 A0, 1, A0 | | [!A0] ADD.S1 A6, A5, A5 | | MPY.M1X B4, A4, A6 | | [B0] BDEC.S2 LOOP, B0 LDH.D1T1 *A3++, A4 LDH.D2T2 *B5++, B4

Another code example MIPS: loop: LW R1, 0(R11) MUL R2, R1, R10 SW R2, 0(R12) ADDI R12, R12, #-4 ADDI R11, R11, #-4 BGTZ R12, loop TI TMS C64x: ADDK.S1 #-4,A11 || LDW.D1 A1,0(A11) || MVK.S2 #-4,B1 ADDK.S1 #-4,A11 || LDW.D1 A1,0(A11) || MUL.M1 A1,A10,A2 || ADDK.S2 #-12,B12 loop: ADDK.S1 #-4,A11 || LDW.D1 A1,0(A11) || MUL.M1 A1,A10,A2 || STW.D2x A2,0(B12) || ADD.L2 B12,B1,B12 || BGTZ.S2 B12, loop ADD.L2 B12, B1, B12 || MUL.M1 A1,A10,A2 || STW.D2x A2,0(B12) ADD.L2 B12, B1, B12 || STW.D2x A2,0(B12)

Special purpose instructions GSMSigned variable shiftSSHVL, SSHVR Motion estimationQuad 8-bit Absolute of differences SUBABS4 Motion compensationQuad 8-bit, Dual 16-bit averageAVGx AudioExtended precision 16x32 MPYsMPYHIx, MPYLIx GraphicsBit expansionXPNDx Endian swapByte swapSWAP4 Cable modemBit de-interleavingDEAL Convolution encoderBit interleavingSHFL Reed Solomon supportGalois Field MPYGMPY4 Machine visionBit counterBITC4 Example ApplicationDescriptionInstruction

//Factorial.c Finds factorial of n. Calls function factfunc.asm #include //for print statement void main() { short n=7; //set value short result; //result from asm function result = factfunc(n); //call assembly function factfunc printf("factorial = %d", result); //print result from asm function }

;Factfunc.asm Assembly function called from C to find factorial.def_factfunc;asm function called from C _factfunc:MVA4,A1 ;setup loop count in A1 SUBA1,1,A1 ;decrement loop count LOOP: MPY A4,A1,A4 ;accumulate in A4 NOP ;for 1 delay slot with MPY SUB A1,1,A1 ;decrement for next multiply [A1] B LOOP ;branch to LOOP if A1 # 0 NOP5 ;five NOPs for delay slots B B3 ;return to calling routine NOP5;five NOPs for delay slots.end

Similar presentations