Download presentation
1
Module 6: Programmable Components in SoC II
이 찬 호 (숭실대학교, 정보통신전자공학부)
2
목 차 1. DPS Processor Introduction 2. Piccolo for ARM 1.1 Introduction
목 차 1. DPS Processor Introduction 1.1 Introduction 1.2 Fast Multipliers 1.3 Multiple Execution Units 1.4 Efficient Memory Access 1.5 High Memory Bandwidth Requirement 1.6 Data Format 1.7 Efficient Zero Overhead Looping 1.8 Streamlined I/O 1.9 Specialized Instruction Sets 2. Piccolo for ARM 2.1 Overview of Piccolo 2.2 Organization 2.3 Input & Output Buffer 2.4 Register Bank 2.5 Process Copyrightⓒ2003
3
목 차 3. v5TE for ARM 4. TeakLite & Teak 3.1 Overview of v5TE
목 차 3. v5TE for ARM 3.1 Overview of v5TE 3.2 Multiplication Instruction 3.3 Addition/Subtraction Instruction 4. TeakLite & Teak 4.1 Overview of TeakLite & Teak 4.2 CEVA-TeakLite Core Block Diagram 4.3 CEVA-Teak Core Block Diagram 4.4 Features 4.5 CBU 4.6 DAAU 4.7 PCU 4.8 Memory Organization 4.9 Power Management Modes 4.10 CEVA-Teak Performance Copyrightⓒ2003
4
목 차 5. OMAP 5.1 Overview of OMAP5910 5.2 OMAP5910 Block Diagram
목 차 5. OMAP 5.1 Overview of OMAP5910 5.2 OMAP5910 Block Diagram 5.3 Features 5.4 DSP Subsystem 5.5 Components of DSP subsystem 5.6 TMS320C55x DSP Core 5.7 Features of TMS320C55x 5.8 C55x Block Diagram 5.9 IU 5.10 PU 5.11 AU 5.12 DU Copyrightⓒ2003
5
목차 1. DPS Processor Introduction 2. Piccolo for ARM 3. v5TE for ARM
1.2 Fast Multipliers 1.3 Multiple Execution Units 1.4 Efficient Memory Access 1.5 High Memory Bandwidth Requirement 1.6 Data Format 1.7 Efficient Zero Overhead Looping 1.8 Streamlined I/O 1.9 Specialized Instruction Sets 2. Piccolo for ARM 3. v5TE for ARM 4. Teak & TeakLite 5. OMAP Copyrightⓒ2003
6
1.1 Introduction (1/2) Communication system
Human interface: analog signal Signal processing: digital signal A/D, D/A converdion When you speak, your voice is picked up by an analog sensor in the cell phone’s microphone. An analog-to-digital converter chip converts your voice, which is an analog signal, into digital signals, represented by 1s and 0s. The DSP compresses the digital signals and removes any background noise. In the listener’s cell phone, a digital-to-analog converter chip changes the digital signals back to an analog voice signal. Your voice exits the phone through the speaker. [1] Copyrightⓒ2003
7
1.1 Introduction (2/2) [2] Digital signal processor
Type of microprocessor optimized for digital signal processing Fast and powerful Communication, medical, military and industrial products Adventage Speed, cost and energy efficiency. Fast Multipliers Multiple Execution Units Efficient Memory Access Data Format Efficient Zero-Overhead Looping Streamlined I/O Specialized Instruction Sets Copyrightⓒ2003
8
1.2 Fast Multipliers The most common operations in signal processing
y=∑xh :multiplication & accumulation Convolution, IIR filtering, Fourier Transforms, etc. Need for fast multiplication & addition operations Shift, multiplication and addition in a loop Each require one or more cycle Need to develope special hardware for multiplication In 1982, Texas Instruments(TMS32010) (in a single clock cycle) All modern DSP processors include at least One or more “dedicated, single-cycle multiplier” Combined multiply-and-accumulate unit (MAC) Copyrightⓒ2003
9
1.3 Multiple Execution Units
Need to perform high computational tasks Real time operation Ex) Filtering signals in kHz sampling rate in real time Several independent execution units are required Parallel operation Ex) Arithmetic Logical Unit (ALU) and shifter in parallel to MAC units Pipelining Single instruction multiple data (SIMD) Copyrightⓒ2003
10
1.4 Efficient Memory Access
Executing a MAC in a single cycle means Fetching the MAC instruction in a single cycle Fetching data sample in a single cycle Fetching filter coefficients in a single cycle Good performance requires high memory bandwidth Approach: using two or more seperate memory banks Each have its own bus Each could be read or written during every cycle Copyrightⓒ2003
11
1.5 High Memory Bandwidth Requirement
Dedicated hardware for calculating memory addresses Address Generation Units Memory access is very predictable in DSP Ex) FIR filter: coefficients accessed sequentially Register indirect addressing with post increment Increment of address pointer where repetitive computations are performed on a series of data Circular Addressing Allows processor to access data sequentially and then automatically wrap around to the beginning address Copyrightⓒ2003
12
1.6 Data Format DSP algorithms generally use floating point formats
More complex hardware Better accuracy than fixed point processors Fixed point processors Cheaper and less power consuming 16 bit data words: sufficient for many applications 20, 24 or 32 bit data word for better accuracy Shortest data word width with adequate accuracy Considering the cost & energy consumption Accumulator Registers Wider than other registers Provide extra guard bits to avoid overflow Copyrightⓒ2003
13
1.7 Efficient Zero Overhead Looping
DSP algorithms have many loops Efficient looping is required Special loop : Zero Overhead Looping No loop counter No branching back to the top of the loop Copyrightⓒ2003
14
1.8 Streamlined I/O Specialized serial or parallel I/O interfaces
Streamlined I/O handling mechanisms Ex) Low overhead interrupts Direct memory access (DMA) Copyrightⓒ2003
15
1.9 Specialized Instruction Sets (1/2)
Two goals in instruction sets To make maximum use of hardware and to increase efficiency Programmers can specify parallel operations in a single instruction To minimize memory space : keeping instructions short Use mode bits rather than encoding Restrict operations to specific registers Restrict operation combinations in the instruction This makes DSP instructions complicated Copyrightⓒ2003
16
1.9 Specialized Instruction Sets (2/2)
DSPs aren’t usually programmed in high level languages : C,C++..etc Program optimization is essential Programmers should optimize code in assembly level Easier instruction set, more desirable it is for programmers Copyrightⓒ2003
17
목차 1. DPS Processor Introduction 2. Piccolo for ARM 3. v5TE for ARM
2.1 Overview of Piccolo 2.2 Organization 2.3 Input & Output Buffer 2.4 Register Bank 2.5 Process 3. v5TE for ARM 4. Teak & TeakLite 5. OMAP Copyrightⓒ2003
18
2.1 Overview of Piccolo [3] Sophisticated 16-bit signal processor
Designed to assist the ARM7 Can’t be used as a stand-alone Licensable cores Up to 70MHz / 3V Supports zero-overhead single and multi instruction hardware loops Single instruction cycle Digital cellular handset, modem, pager, multimedia applications Copyrightⓒ2003
19
2.2 Organization (1/2) Copyrightⓒ2003
20
2.2 Organization (2/2) Input buffer Output buffer Register bank
16-bit Multiplier 32-bit Barrel shifter 32-bit ALU Copyrightⓒ2003
21
2.3 Input & Output Buffer To move many 16-bit values in a single instruction Transfer in pairs Full usage of 32-bit bus Input buffer Reorder buffer Output buffer First in first out (FIFO) buffer Operands are loaded and results are stored via the ARM coprocessor interface, so suitable addresses must be generated by the ARM core. The input and output buffers allow these transfers to move many 16-bit values in a single instruction, and values are transferred in pairs, making full use of the ARM’s 32-bit bus width. The input buffer stores values until they are called upon by the signal processing code, and they may be accessed out of order from buffer. Copyrightⓒ2003
22
2.4 Register Bank Hold operands : 16 bit, 32 bit, 48 bit
16 general purpose registers 12 registers : 32 bits 4 registers : 48 bits For accumulator Copyrightⓒ2003
23
2.5 Process The ARM7 and Piccolo are separate processors
RISC-based instruction set Execute programs When DSP functionality is required, -> the ARM7 issues an instruction -> the Piccolo processor begins execution (a specified address in memory ) Copyrightⓒ2003
24
목 차 DSP Processor Introduction Piccolo for ARM v5TE for ARM TeakLite
목 차 DSP Processor Introduction Piccolo for ARM v5TE for ARM TeakLite Teak OMAP Copyrightⓒ2003
25
목차 1. DPS Processor Introduction 2. Piccolo for ARM 3. v5TE for ARM
3.1 Overview of v5TE 3.2 Multiplication Instruction 3.3 Addition/Subtraction Instruction 4. Teak & TeakLite 5. OMAP Copyrightⓒ2003
26
3.1 Overview of v5TE [3] (1/2) In 1999, ARM v5TE architecture: ARM DSP instruction set extensions Enhanced 32-bit arithmetic capabilities in a single general purpose CPU improved performance and flexibility Included in E series (ARM9E, ARM10E…) Up to a 70% increase in speed for audio DSP applications Copyrightⓒ2003
27
3.1 Overview of v5TE (2/2) First implemented on the ARM9E-S synthesizable core A very different approach to the problem from that used in the design of Piccolo Q flag: sticky overflow flag remains set until explicitly reset by an MSR instruction Series of instruction may be executed Q flag inspected only once (using an MRS instruction) [ ARM v5TE PSR format ] Copyrightⓒ2003
28
3.2 Multiplication Instruction
16-bit data types 32-bit ARM register may hold two 16-bit values → efficient access to values 16x16, 32x16 multiplication/accumulation SMLAxy, SMLAWy, SMULWy, SMULxy x,y = 0 (lower) or = 1 (upper) Copyrightⓒ2003
29
3.3 Addition/Subtraction Instruction
32bit operation using saturating arithmetic Overflows → the nearest value is returned and the Q flag set QADD, QSUB, QDADD, QDSUB QDADD, QDSUB: doubles one of the operands before the addition and subtraction Copyrightⓒ2003
30
목차 1. DPS Processor Introduction 2. Piccolo for ARM 3. v5TE for ARM
4. Teak & TeakLite 4.1 Overview of TeakLite & Teak 4.2 Features 4.3 CEVA-TeakLite Core Block Diagram 4.4 CEVA-Teak Core Block Diagram 4.5 CBU 4.6 DAAU 4.7 PCU 4.8 Memory Organization 4.9 Power Management Modes 4.10 CEVA-Teak Performance 5. OMAP Copyrightⓒ2003
31
4.1 Overview of TeakLite & Teak [4][5]
Low power, high performance DSP Group 16-bit (data and program) core Fixed-point Fully synthesizable (soft core) Process independent design Application 2G wireless handsets, internet audio players, magnetic & optical drives, IP phones, modems, etc. (TeakLite) Cellular handset, PDA and Smart phone, VoIP, Portable Audio, Digital Still Camera, etc. (Teak) Based on OakDSPCore® (TaekLite) Copyrightⓒ2003
32
4.2 Features CBU - Computation and Bit Unit
Computation Unit (CU) Bit Manipulation Unit (BMU) DAAU - Data Address Arithmetic Unit PCU - Program Control Unit IDU - Instruction Decode Unit OFU - Operand Fetch Unit Copyrightⓒ2003
33
4.3 CEVA-TeakLite Core Block Diagram
Copyrightⓒ2003
34
4.4 CEVA-Teak Core Block Diagram
Copyrightⓒ2003
35
4.5 CBU (1/3) Computation unit (CU) and Bit Manipulation Unit (BMU)
Single cycle Multiply-Accumulate (MAC) instructions Dual MAC instruction (Teak) Single cycle division step Single cycle exponent evaluation Maximum/Minimum calculation in a single cycle Zero overhead block repeat Codebook search Viterbi built in accelerator Dedicated FFT accelerator (Teak) Parallel instructions execution in a single cycle (Teak) Copyrightⓒ2003
36
4.5 CBU (2/3) CBU of Teak CBU of TeakLite Copyrightⓒ2003
CBU - Computation and Bit Unit : Computation Unit (CU) + Bit Manipulation Unit (BMU) Copyrightⓒ2003
37
Independent operating
4.5 CBU (3/3) TeakLite Teak Note Transfer 2 (16 bit) 4 (16 bit) Parallel Multiplier 1 (16 x 16 bit) 2 (16 x 16 bit) Complement ALU 36 bit 40 bit 3 input Accumulator 4 (36 bit) 4 (40 bit) Independent operating Shifter Barrel Bit Field Operation (BFO) Bit Operation. Bit field is a group of bits. AND, OR, XOR, SHIFT, ROTATE… Set, reset, and change test. These operations are executed directly on registers and data memory content, with no affect on accumulator content. Copyrightⓒ2003
38
4.6 DAAU (1/2) Addressing modes Direct (TeakLite) / Indirect (Teak)
Short/Long Direct Short/Long Index Short/Long Immediate Stack Pointer Program Memory Indirect (TeakLite) Bit-reverse (Teak) Copyrightⓒ2003
39
4.6 DAAU (2/2) General purpose address pointer registers
(6+3) x 16-bit (TeakLite) (8) x 16-bit (Teak) Alternative bank of registers Four 16-bit User Defined Registers Enables both linear and cyclic pointer modification (Teak) Enables four 16-bit data memory transactions in parallel (Teak) Copyrightⓒ2003
40
4.7 PCU Program Control Unit Zero Overhead looping
Block repeat instructions Repeat instruction Single cycle interrupt latency Interrupts types Three maskable Non-maskable Trap (software interrupts) Breakpoint Vector (Teak) Copyrightⓒ2003
41
4.8 Memory Organization Program memory space Data memory space
TeakLite Up to 64K-word Teak Linear space, up to 256K-word Total space paging: up to 4M-word Data memory space Up to 64K-word, three sections X & Y : zero wait state transactions, memory interface Z : slow devices (peripheral) interface Flexible and configurable at a 1K word resolution X, Y and Z spaces are highly flexible and configurable (at a 1K word resolution). Data memory space . 64K-word size, divided into three sections . X & Y spaces . for zero wait state transactions . Z space . for slow devices . Flexible configuration of the three spaces (1K-word resolution) Copyrightⓒ2003
42
4.9 Power Management Modes
Active Mode Slow Mode Reduces clock speed and current consumption linearly Stop Mode leakage current only Copyrightⓒ2003
43
4.10 CEVA-Teak Performance
Frequency um worst case Power dissipation 0.27 mA/MHz in a typical DSP application 0.45 mA/MHz in a typical DSP application with DMA intensive transfers Copyrightⓒ2003
44
목 차 DSP Processor Introduction Piccolo for ARM v5TE for ARM
목 차 DSP Processor Introduction Piccolo for ARM v5TE for ARM TeakLite & Teak OMAP Overview of OMAP5910 OMAP5910 Block Diagram Features DSP Subsystem Component of DSP Subsystem DSP Module Block Diagram TMS320C55x DSP Core Feature of TMS320C55x C55x Block Diagram IU PU AU DU Copyrightⓒ2003
45
목차 1. DPS Processor Introduction 2. Piccolo for ARM 3. v5TE for ARM
4. Teak & TeakLite 5. OMAP 5.1 Overview of OMAP5910 5.2 OMAP5910 Block Diagram 5.3 Features 5.4 DSP Subsystem 5.5 Component of DSP Subsystem 5.6 DSP Module Block Diagram 5.7 TMS320C55x DSP Core 5.8 Feature of TMS320C55x 5.9 C55x Block Diagram 5.10 IU 5.11 PU 5.12AU 5.13 DU Copyrightⓒ2003
46
5.1 Overview of OMAP5910 [6][7] Highly integrated hardware
Software platform designed For next generation embedded devices Unique dual-core architecture TI-enhanced ARMTM 925 processor (TI925T) Command and control TMS320C55xTM DSP core Low-power High-performance Application Mobile communications, Video and image processing, Advanced speech applications, Audio processing, Graphics and video acceleration, Generalized web access, Data processing The OMAP5910 processor features a unique dual-core architecture that combines the command and control capabilities of the TI-enhanced ARM925 processor (TI925T) with the high-performance and low-power capabilities of the TMS320C55x DSP core. Power Conservation Features include: Software-programmable idle domains that provide configurable low-power modes Automatic power management Advanced low-power complimentary metal-oxide semiconductor (CMOS) process Architecture Increased Parallelism Minimizes Cycles Per Task Alternate Computational Hardware Use Provides a Low-Power Option for Many Tasks Memory Accesses Minimized Automatic Low-Power Mechanisms for Peripherals and On-Chip Memory Arrays Configurable Functional (IDLE) Domains Provide Greater Power-Down Flexibility Copyrightⓒ2003
47
5.2 OMAP5910 Block Diagram Copyrightⓒ2003
48
5.3 Features (1/4) TI925T MPU subsystem DSP subsystem DSP MMU
System DMA controller External memory interfaces Internal SRAM memory External memory traffic controller Mailboxes Endianism conversion Elastic buffering JTAG port Clock management Peripherals Copyrightⓒ2003
49
5.3 Features (2/4) TI925T MPU DSP (TMS320C55x DSP core) DSP MMU
Instruction cache :16K bytes Data cache : 8K bytes MMU 17-word write buffer (WB) Increases system performance DSP (TMS320C55x DSP core) Dual-access RAM (DARAM), single-access (SARAM), ROM Instruction cache Hardware accelerators DMA controller DSP MMU Address translation Access permission checks The write buffer (WB) increases system performance and can buffer up to seventeen 32-bit words of data. The MMU attributes B (B_MMU) and C (C_MMU) (which are part of the TLB descriptor) and the CP15 control register W bit (W_CP15) control WB behavior. Clearing W_CP15 and C_CP15 upon reset ensures that all accesses are non-bufferable until the MMU is enabled. To use the write buffer, you must enable the MMU. However, you can enable the two functions simultaneously with a single write to the CP15 control register. The write buffer is always disabled when the MMU is off. Clearing bit 3 in the CP15 control register disables the write buffer. Any writes already in the write buffer complete normally. It is not possible to abort buffered writes externally, because the s_abort external signal is ignored and data is simply discarded. Areas of memory that can generate aborts must be marked as unbufferable in the MMU page tables. Copyrightⓒ2003
50
5.3 Features (3/4) System DMA controller
Six ports, nine channels Additional dedicated DMA : LCD controller Transfer : 8-,16-, or 32-bit Simultaneous transfers Low-power design (no clocking when idle) Two external memory interfaces External memory interface slow (EMIFS) External memory interface fast (EMIFF) Clock management One digital phase-locked loop (DPLL) Three clock management units System power management Copyrightⓒ2003
51
5.3 Features (4/4) Peripherals For the MPU For the DSP
Shared peripherals 전체 block diagram에서 보면 private은 외부로 신호가 나가지 않는데, public은 외부와 연결되는 차이점 음. Copyrightⓒ2003
52
5.4 DSP Sub-system Copyrightⓒ2003
53
5.5 Component of DSP Subsystem (1/2)
DSP module DSP core : TMS320C55x (C55x) Hardware accelerators (HWA) DCT/IDCT Motion estimation Half-pixel interpolation Memories DARAM SARAM PDROM External memory interface (EMIF) 6-channel DMA controller MPUI TIPB Copyrightⓒ2003
54
5.5 Component of DSP Subsystem (2/2)
DSP peripherals Three general-purpose 32-bit timers One general-purpose UART 16-signal general-purpose input/output (GPIO) Mailbox For Inter-processor Communication (between MPU and DSP) Watchdog timer Level 2 interrupt handler Copyrightⓒ2003
55
5.6 DSP Module Block Diagram
P, B, C, D, E, F : internal data and address bus Copyrightⓒ2003
56
5.7 TMS320C55x DSP Core (1/2) Advanced multiple-bus architecture
Unified program/data memory architecture Dual 17 x 17-bit multipliers Add/compare/select (CSSU) unit Exponent encoder Two address generators 8M x 16-bit (16M-bytes) memory space Repeat operations 288MIPS/144MHz, 320MIPS/160MHz, 400MIPS/200MHz, 600MIPS/300MHz ARM9 : 220MIPS/200MHz 0.05 mW/MIPS (20mW) ARM9 : 0.8mW/MHz (160mW) Copyrightⓒ2003
57
5.7 TMS320C55x DSP Core (2/2) Conditional execution
Seven-stage pipeline Instruction buffer unit (IU) Program flow unit (PU) Address data flow unit (AU) Data computation unit (DU) Copyrightⓒ2003
58
5.8 Feature of TMS320C55x 64 x 8-bit Instruction buffer queue
Two 17 x17-bit MAC units One 40-bit ALU Performs high precision arithmetic and logical operations One 40-bit Barrel Shifter One 16-bit ALU Performs simpler arithmetic in parallel to main ALU Four 40-bit accumulators Twelve independent buses: Three data read buses Two data write buses Five data address buses One program read bus One program address bus Copyrightⓒ2003
59
5.9 C55x Copyrightⓒ2003
60
Summary (1/2) DSP processors Piccolo
Fast and powerful performance of digital signal processing operation Specialized instruction set : shift, multiplication, addition Piccolo Digital signal processing unit for ARM7 Licensable core v5TE: signal processing instruction set for ARM-E Teak & TeakLite Synthesizable embedded DSP core Process independent soft core OMAP (TI) : software platform with MCU and DSP core TI925T MPU & TMS320C55x DPS core Copyrightⓒ2003
61
Summary (2/2) Comparison Piccolo TeakLite Teak OMAP (TMS320C55x) ALU
One 32bit One 36bit One 40bit One 16bit Barrel shifter Multiplier Two 16bit Two 17bit Accumulator - Four 36bit Four 40bit Performance (MHz) 70 135 144, 160, 200, 300 Power (mA) 0.27/MHz 0.45/MHz 0.05/MIPS Copyrightⓒ2003
62
Reference [1] [2] [3]ARM system-on-chip architecture, 2nd, Steve Furber, ADDISON-WESLEY [4] (CEVA-Teak Datasheet) [5] (CEVA-TeakLite Datasheet) [6] (OMAP5910 Dual-Core Processor (Rev. C), OMAP5910 Dual-Core Processor Silicon Errata (Rev. A)) [7] Winning the SoC Revolution : Experienced in Real Design, Grant Martin & Henry Chang, KLUWER ACADEMIC PUBLISHERS Copyrightⓒ2003
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.