Platform-based Design

Platform-based Design
성균관대 조준동 교수

발표순서 Why Platform-based Design? S/W configurable platform의 필요성
Design Space of Reconfigurable Architectures Reconfigurable Radio and Multimedia Systems Network-centric Design: Clock and Power Reliable Design

SoC and Customizable Platform Based-Design
DSP Reconfigurable Hardware (Fine Grain) ASIC 2 Reconfigurable Hardware (Coarse Grain) ASIC 1

Semiconductor Revolutions
Makimoto’s Wave “Mainstream Silicon Application is switching every 10 Years” software reconfigurable standard instruction streams µproc., memory TTL FPGAs data streams 1967 1987 2007 1957 LSI, MSI 1977 ASICs, accel’s 1997 coarse grain Makimoto’s 1st wave: TTL: nand gate, nor gate, flipflop etc. are general purpose; chips for pocket calculators, radio, tv, etc. are application-specific Makimoto’s 2nd wave: microprocessor, mocrocontroller, RAM memory are general purpose; graphics, multimedia, communication chips, etc. are application-specific Makimoto’s 3rd wave: FPGAs (gates and flipflops) are general purpose; question: will the second half wave go application-specific ? custom structured VLSI design hardware 2nd design crisis 1st design crisis

Definition of Platforms?
An architecture that is designed for an application domain

Platform 분류 Application Platform: 멀티미디어 platform: Nexperia, TI의 OMAP
3G 무선 platform: Infineon의 M-gold Bluetooth platform: Parthus 무선 platform: ARM의 PrimeXsys Process-centric platform Improv System, ARC, Tensilica, Triscend Communication-centric platform: Sonics, Palmchip 프로세서와 버스, 그리고 모뎀미나 MPEG decoder와 같은 응용 전용 블록들을 갖추고 있다.

SoC Platform Adaptation

The Platform-Based Design Concept Cadence
Programmable SW IP Hardware IP Pre-Qualified/Verified Foundation-IP* + Reference Design HW-SW Kernel MEM FPGA CPU Scaleable bus, test, power, IO, clock, timing architectures Application Space Processor(s), RTOS(es) and SW architecture Reconfigurable Hardware Region (FPGA, LPGA, …) *IP can be hardware (digital or analogue) or software. IP can be hard, soft or ‘firm’ (HW), source or object (SW) Foundry-Specific HW Qualification SW architecture characterisation

Platform Architecture
How fast will my user interface software run? How much can I fit onto my microcontroller? Which Bus? PI? AMBA? Dedicated Bus for DSP? Which RTOS do I use? Which scheduling policy do I have to choose ? Do I need a dedicated DSP ? Which microcontroller? ARM? HC11? ARC? Can I buy a QCELP decoding core? Do I need a dedicated HW or can I run this on the Microcontroller ?

Example of a commercial SoC
More CPUs? More SRAM/Flash? Add FPGA?

A Legacy SoC Approach CoreConnect (PPC), AMBA (ARM)…

Networks-on-Silicon, Phillips

MP-SOC Cluster Tightly coupled design has been dominant
– Assumes largely synchronous, instantaneous and free communication – Widely practiced, and supported in design flows • BUT – Delivering clocks is problematic – Wire delay is dominant – Routing area can cost more than gates – Too many constraints, from too many blocks • Cannot afford lowest common denominator design

Definition of MP-SOC? Usually heterogeneous multiprocessor:
CPUs, DSPs, etc. Hardwired accelerators. Mixed-signal front end.

기존 MP-SoC의 문제점 ▷ 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 크게 된다.
▷ 같은 (호모지니어스) 프로세서를 여러 개 사용하는 것은 자원 유용도가 낮아서 리니어로 전력량이 늘어나게 된다. ▷ 온 칩 인터콘넥션의 설계가 코어와 캐쉬와 분리해서 독립적으로 설계되었다. ▷ 인터콘넥트는 와이어-의존 뿐아니라 로직 의존적이기도 하다. ▷ 프로세서가 와이어와 메모리 지연시간에 의해서 제약된다. ▷ 특정 응용분야에 대해서만 최고 성능을 낸다.

4G: Multiple standards Communications. Networking. Multimedia.
Security. Muti-band/multimode를 지원하는 Digital RF

The triangle, Chicken and Egg?
Hardware and software architectures determine capabilities. Applications guide design decisions. Methodologies allow repeatable, predictable design. architectures applications methodologies

Why Multi-Threaded Cores?
Increasing gap: memory & processor speeds (2x / 2 years) More parallel processing (lower-power, higher-perf./mm2) DSP GPP DSP DSP $ H/W Proc. Element H/W-MT RISC D$ I$ I$ NoC Increasing gap: interconnect & gate delays (multi-clock) In Out SRAM ……

MPSoC “Bus” Alternatives
• Fixed Bus [Bergamaschi, DAC, 2000] – Point to point communication – Signals between cores transferred dedicated wires • FPGA-like Bus [Cherepacha, FPGA Sym, – Programmable interconnects – Employ static network • Arbitrated Bus [IDT Inc., 2000] – Time-shared multiple core connectivity – Use arbitrator • Hierarchical Bus [AMBA, ARM Inc] – Combine multiple buses using bus – Separate buses for cores and I/O NoCBus [Dally, DAC, 2000] – Resources communicate with data packets – Use switch fabric

Future mobile platform?
Mudge et al: Mobile supercomputing Speech recognition. Cryptography. Augmented reality. Typical applications ( , etc.). Requires 16x 2 GHz Pentium 4. Peak power must not exceed 75 mW 미래 모빌 어플리케이션 플랫폼? Culture and Education? Personal Entertainment Platform?

Road Map to MP-SoC Trends
mask NRE: Over 1M$; design NRE: 10M$ to 75M$ ASICs replaced by programmable ASSP, FPGA’s number of embedded processors DVD/STB/HDTV, mobile phones: 5 to 8 Image proc, networking, basestation: 8 to 100+ eS/W complexity Set-top box, audio: >1 million lines of code eS/W becoming essential part of SoC’s ?’s Law?

Should the SoC designer work hard?
Verify Compose the system Requirements Verify Simulate Verify SoC Composer Verify (timing, area) Synthesis + P&R Mobile SoC에서 검증이 왜 중요한지? 왜 우리는 검증이 취약하게 되었는지 Verify Simulate (performance) Tape Out

More SoC topics … Platform optimization Low Power Verification
Power management BW allocation Resource sharing Task distribution Efficient communications Low Power Verification 인재 (System Architect) 양성

Available Mobile and VLIW Processors
The ARM Family The ARM7 Generation The StrongARM The ARM Thumb Option The ARM Piccolo Option The ARM9 and ARM10 The Motorola M-Core The LSI TinyRisc The Hitachi SuperH Family VLIW Processors The Motorola-Lucent Star*Core The Philips TriMedia The HP/Intel IA-64

NexperiaTM DVP Hardware architecture (source: Th
NexperiaTM DVP Hardware architecture (source: Th. Claasen, Philips, DAC 2000)

Exploitable Parallelism
Min parallel grain size (instrns.) MultiFlex Thread- Level Parallelism GP O/S Thread-Level Parallelism Exploitable task parallelism Instruction- Level Parallelism 10 000’s Instructions 1~100 100’s 1~8 2~6 1

NEC MP211: Homogeneous MP core
Asymmetric mp with very coarse grain multitasking 3 ARM9’s utilized as predefined function units NO complex overhead : e.g. no cache coherency, dynamic scheduling/load balancing

MP-SoC의 장점 쉬운 하드웨어 Implementation이 가능하다. : 즉, 현재 널리 사용되고 있는 프로세서 코어를 사용함으로 빠른 하드웨어 개발기간과 가격을 낮출 수 있다. 전력 소비를 줄일 수 있다. : 분산된 각각의 일을 클럭 주파수를 낮추어 멀티 프로세서가 충당한다. 낮은 클럭 주파수는 적은 supply voltage를 가능하게 하고 파워 소모를 줄일 수 있다. Scalable: 성능과 가격을 프로세서 코어의 수를 늘이거나 줄임으로 조절이 가능하다. Boosting real-time 성능: 각 어플리케이션은 각기 다른 프로세서에서 수행이 가능하다. 이는 다중 어플리케이션간 인터페이스를 줄일 수 있다. 시스템의 안전도를 높일 수 있다. : 시스템 소프트웨어와 안전하지 안은 어플리케이션은 다른 프로세서를 사용하여 구분이 가능하다.

AMP task allocation image
. 3 개의 ARM926 프로세서 코어, DSP, Graphic accelerators, 512KB SRAM, a DDR SDRAM interface 와 IPs로 구성 각 IP들은 32bit Multi-layer AHB에 연결된다. MP211에서 ARM926, DSP는 192MHz로 동작하며, AHB, 대부분 Graphic IPs는 96 MHz 에서 동작한다. Chip Specification을 [표 4.2]에 나타내었다.

Bus and Memory Architecture

MP211 block diagram

Power consumption of H.264+AAC
MP211에서 H.264 video decoder(QVGA 15fps)와 MPEG2 AAC decoder(48K Stereo 128kbps)를 이용하여 파워 소모량 측정하였다. 결과적으로 DTV 프로그램에서 87mW(exclude I/O, SDRAM), 124mW(include I/O, SDRAM)의 평균적인 전력 소모가 있었다. 이를 [그림 3.5]에 나타내었다. L0의 영역은 기본적인 전력의 소모를 뜻하며, L1 영역은 IP에서 높은 IP 전력소모가 실행되고 있는 영역을 뜻한다

Holistic design of multi-core architectures
Naïve Methodology is inefficient Demonstrated inefficiency for cores and proposed alternatives Single-ISA Heterogeneous Multi-core Architectures for Power[MICRO03] Single-ISA Heterogeneous Multi-core Architectures for Performance[ISCA04] Conjoined-core Chip Multiprocessing [MICRO04] What about interconnects? How much can interconnects impact processor architecture? Need to be co-designed with caches and cores?

Heterogenous MP Core ▷ Single-ISA heterogeneous multicore 구조는 볼테지 스케일링, 클럭 게이팅, speculation control등을 사용하는 경우에 비해 우수한 성능을 보인다. ▷ Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous CMP(또는 asymmetric CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은 큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다. 또한 바테리를 사용하는 경우와 전원을 사용하는 경우등 시스템의 콘텍스트에 의존적이다. 따라서 복잡도가 다른 코어들을 사용하는 것이 효율적이다. ▷Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 그러나 single-ISA heterogeneous CMP는 모든 코어가 같은 ISA를 수행하기 때문에 각 응용이 어느 코어에 매핑이 되어도 상관없게 된다. 코어 숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core 프로세서의 경우, 인터콘넥트의 전력 소모량은 하나의 코어와 같다. 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지-딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다. ▷ 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을 이용하는 heterogeneous processors는 homogeneous에 비해서 63% 성능이 개선된다. 5-8 threads level을 사용하는 경우에는 평균 29%의 개선이 있다. Amdahl's의 법칙에 의하면 병렬 응용들의 속도개선은 직렬 응용 부분때문에 제한적이 된다. ▷ 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에 대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를 최대화 한다. [Annavaram, et al]

Heterogeneous MP-SoC 문제점들
Processors are bound by wire and memory latencies Peak performance on only a small class of applications. How well they map to a given design Diversification of workloads Increased hardware complexity Poor resource utilization

Alpha cores scaled to 0.10 um.
EV8 is 80 times bigger but provides only two to three times more single-threaded performance

Heterogenous MP Core If two or more cores share L2, the way a lot of present CMPs do, a crossbar provides a high bandwidth connection. HP solution: 전력 제한 조건에 따라 monolithic 프로세서는 전력 소모가 클 뿐 아니라 성능이 충분하지 않다. 같은 프로세서를 여러 개 사용하는 것은 리니어로 전력량이 늘어나고 성능은 서브리니어하게 늘어나게 된다. Single-ISA heterogeneous multicore 구조는 볼테지 스케일링, 클럭 게이팅, speculation control등을 사용하는 경우에 비해 우수한 성능을 보인다. Heterogeneous Chip Multiprocessors, R. Kumar, D. M. Tullsen, UC San Diego N. P. Jouppi, P. Ranganathan, HP Labs. Homogeneous CMP (Chip Multiprocessor)와 비교해서 Heterogeneous CMP(또는 asymmetric CMP)는 많은 장점을 가지고 있다. 많은 응용 제품들은 큰 사이즈의 코어를 비롯하여 작은 사이즈의 코어를 이용하기를 원한다. 또한 바테리를 사용하는 경우와 전원을 사용하는 경우등 시스템의 콘텍스트에 의존적이다. 따라서 복잡도가 다른 코어들을 사용하는 것이 효율적이다. Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism을 동시에 처리 가능하도록 설계되었다. 그러나 single-ISA heterogeneous CMP는 모든 코어가 같은 ISA를 수행하기 때문에 각 응용이 어느 코어에 매핑이 되어도 상관없게 된다. 코어 숫자와 크기, 타입, 그리고 캐쉬를 결정해야 한다. 8-core 프로세서의 경우, 인터콘넥트의 전력 소모량은 하나의 코어와 같다. 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지-딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다. 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을 이용하는 heterogeneous processors는 homogeneous에 비해서 63% 성능이 개선된다. 5-8 threads level을 사용하는 경우에는 평균 29%의 개선이 있다. Amdahl’s의 법칙에 의하면 병렬 응용들의 속도개선은 직렬 응용 부분때문에 제한적이 된다. 직렬 부분을 수행할 때는 큰 코어를 사용하여 빠르게 수행하며, 병렬 부분에 대해서는 전력 소모가 적은 작은 코어를 사용하여 성능대 전력 소모 비를 최대화 한다. [Annavaram, et al] S. Ghiasi, T. Keller, and F. Rawson, “Scheduling for Heterogeneous Processors in Server System”, Proc. Computing Frontiers, ACM Press, 2005, pp Gentoo Linux with a kernel 은 core들의 성능일 동일한 것으로 간주하나 heterogeneous-aware 스케줄릴으로 수정한 결과 40%의 전력 소모를 줄일 수 있었다. 1%-3.5%의 성능 손실으로.. 면적 및 전력소모 제약조건이 있을 때 Heterogeneity의 잇점이 어떻게 변하는지? 멀티플 서비스를 수용하는 플랫폼의 경우에 코어를 선택하는 기준을 어떻게 마련할 것가? 코어 타입의 갯스가 Heterogenous CMP 성능에 미치는 민감도는 어떻게 평가할 수 있는가? 2개의 다른 타입이면 충분한가? 다른 타입의 코어를 사용하는 대신 다른 전력 레벨을 사용하는 경우와의 비교는 ? Multi-ISA multicore architecture는 다른 ISA를 가진 프로세서들로 구성되며 vector/data-level parallellism, instruction level parallelism을 동시에 처리 가능하도록 설계되었다.

헤티로지니어스 플랫폼의 특징 8-core 프로세서의 경우, 인터콘넥트의 전력 소모량은 하나의 코어와 같다. 다이나믹 볼테지 스케일링 및 사용하지 않는 코어에 대해서 게이팅 기술을 이용하면 에너지-딜레이 프로덕트가 75% 개선되는 효과를 얻을 수 있다. 듀얼 프로세서의 경우를 예를 들면 low Thread level과 high thread level을 이용하는 heterogeneous processors는 homogeneous에 비해서 63% 성능이 개선된다. threads level을 사용하는 경우에는 평균 29%의 개선이 있다. Amdahl’s의 법칙에 의하면 병렬 응용들의 속도 개선은 직렬 응용 부분때문에 제한적이 된다.

10 Performance of heuristics for equal-area heterogeneous architectures with multithreaded cores.

Exploring the potential from heterogeneity

CT 3400 Multi-core DSP H.264 encoder , decoder and audio codecs and the system control 8개 32비트 DSP 코어 6개 32비트 범용 프로세서 코어 128핀 프로그램 가능 I/O 서브시스템으로 구성 C 프로그램 가능 H.264 및 MPEG4 코드를 지원 Cradle Technologies사는 자사의 최신 칩에 내장하는 프로세싱 엔진의 수를 두 배로 높여 24개로 늘였으며, picoChip Designs사는 무선 애플리케이션에 최적화된 자사의 308 프로세서 엔진은 WiMax와 3G 휴대폰 기지국 양 시장에 대한 발판을 마련했다고 발표했다. Airspan Networks사는 WiMax 기지국에 picoChip사의 Picoarray 실리콘을 사용하고 있는데, 그 이유는 유선에서 모바일 (IEEE e) WiMax 표준으로 업그레이드가 "소프트웨어만으로" 가능하기 때문이라고 설명했다.

H.264 codec onto the cradle CT3400 MDSP

CT 3400 Multi-core DSP CT3400 DPS Engine DSP Engine
Each DSP engine contains A Single Instruction Multiple Data Arithmetic Logic Unit (SIMD ALU) A Packed Integer Multiplier Accumulator (PIMAC) A Floating Point Unit (FPU) Bi-directional FIFO data buffers DMA channels A 128 x 32 register and A 512 x 20 program memory CT3400 DPS Engine

CT3600계열 제품군 CT3600 Multiprocessor DSP Family Members CT3616은 채널 당 5.50 달러(MPEG4 SP L3)로 업계에서 가장 뛰어난 가격 대 성능비 인코딩 솔루션을 제공하고 있어 가장 가까운 경쟁 제품보다 2배 이상 우수 프로그램 가능 DSP를 기반으로 하는 단일 칩 실시간 D1 H.264 메인 프로파일 비디오 인코더를 업계 최초로 구현한다 0.13미크론 기술, 16개의 DSP, 8개의 범용 프로세서로 전체 성능을 네 배로 증가 40달러에서 90달러 Benefits Fast time-to-market, completely programmable Customer adds enhancements, proprietary value Lower development costs Lower chip costs vs. combo DSP/FPGA or combo RISC/FPGA designs

CT 3616 Multi-core DSP

Homogeneous MP-SoC 문제점들
The hardware must be configurable for efficient execution across broad class of application. Each core consists of an array of homogenous processing execution nodes, a banked Instruction Cache, Data Cache, register file and block control logic. Some of the resources (called polymorphous resources) in the TRIPS architecture can be configured to operate differently depending on the mode (instruction, thread or data parallelism).

HiBRID-SoC Architecture
HIBRID-SoC multi-core system-on-chip Architecture Integrate a powerful on-chip communication structure A well-balanced memory system to account for the growing amount of data memory system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding (AVC)) Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing demend Three programmable cores Each adapted towards a specific class of algorithms Combination of the cores and their software development environment An extention of a programmable core with dedicated modules (e.g.,Trimedia) HIBRID-SoC multi core Developed at the University of Hannover

Multi-Core SoC Architecture
Instruction Level VLIW (Very long instruction word) Data Level SIMD (Single instruction multiple data) Task Level (Simultaneous multithreading) Hi-par DSP 16-datatath SIMD processor core controlled by VLIW, Particularly optimized towards high-throughput two dimensional DSP-style processing (FFT-intensive applications or filtering) Stream Processor (SP) 32-Bit RISC architecture that is more optimized to-wards control-dominated task Bitstream processing or global system control Macroblock processor(MP) Efficient processing of data blocks (Heterogeneous data path structure consisting of scalar and a vecture unit) Controlled by dual-issue VLIW, offers flexible subword parallelism, and contains instruction set extensions for typical processing computation steps

Figure 1. HiBRID-SoC multi-core architecture
64-bit AMBA AHB system bus Connects all cores SDRAM memory via a 64 Bit SDRAM interface Two versatile 32-Bit host interfaces for access (e.g., host PC via PCI and to serial flash memory) Figure 1. HiBRID-SoC multi-core architecture

Figure 2. HiPAR-DSP architecture
Highly paralled DSP core with a VLIW-controlled SIMD architecture Memory concept provides an easy data exchange between the data paths, which is required for many filter and image processing algorithms DMA unit serves all cache misses and performs data prefetch transfers to the matrix memory At the targeted clock frequency of 145 MHz, the HiPAR-DSP achieves a performance of 2.3 GMACs Figure 2. HiPAR-DSP architecture

Stream Processor Stream Processor
Sp has been optimized for high-level programmability and efficient processing of control-driven applications Harvard architecture with a 32-bit data path consisting of 5 pipeline stages and controlled by 32-Bit RISC instructions. Supports Conditional execution, forwarding interlocks, and provides full interrupt capability Convert the 64-Bit AMBA bus width to the 32-Bit internal

Figure 3. Macroblock processor data paths.
Heterogeneous data path structure consisting of a scalar and a vector data path The scalar data path operates on 32-Bit data words in a 32-entry register file and provides control instructions (jump,branch, and loop) The vector data path is equipped with a 64 entry register file of 64 bit width Special fuction unit(SFU) provide instruction set extensions for common video and multimedia core algorithms. MUL/MAC or ALU, incorporate SIMD-style subword parallelism by processing either two 32-Bit, four 16-Bit, or eight 8-Bit data entities in parallel within a 64-bit register operand Figure 3. Macroblock processor data paths.

Figure 3. Chip layout of the HiBRID-SoC.
HiBRID-SoC Implementations HiBRID-SoC is fabricated in a 0.18 um, 6LM standard-cell technology, 14 million tr’s 3.5W occupies 82 mm2, and operates at 145 MHz Table 1. MPEG-4 ASP decoder (full TV resolution) performance on MP and SP, Mbits: Figure 3. Chip layout of the HiBRID-SoC.

Analyzing On-chip Communication in MPSoC Enviroment
Proceedings of Design,Automation and Test’04 Mirko Loghi et al Analysis and trade-off exploration of on-chip communication architectures. Compare and analysis with two practical configurations : AHB-AMBA (ARM) and STBus (ST Microelectronics). Models hardware and software of MPSoC at high-level of accuracy and sufficient simulation speed. Provide realistic performance by stimulating communication system with functional traffic.

Multiprocessor simulalation platform
Hardware architecture: Homogeneous MPSoC platform. Configurable number of 32-bit ARM processors. Processor cores : GPL-licensed ARM Instruction Set Simulator (ISS) SWARM in C++ Private memories for each processor. A shared memory A hardware interrupt module. 32-bit interconnection All components are wrapped in SystemC

Multiprocessor simulalation platform
Benchmarks running with RTEMS-OS : Running on top of RTEMS Synchronization : Use OS queues to exchange matrices between processors. Benchmark 1: Independent matrix multiplication. Benchmark 2: Pipeline of matrix multiplication Benchmark 1: Independent matrix multiplication: Perform independent matrix multiplication at each processors Not require interprocessor communication. Operands are stored in private memories of each processor.

Multiprocessor simulation platform
Benchmark 2: Pipeline matrix multiplication: Platform receives a continous flow of input and out put Operation of every cores follows this partern : Copies input matrix from share memory to private space Multiplicate input matrix with a already matrix in private space Copies the resulting matrix back to shared space. Interrupt and semaphores slaves are queried to keep synchronization in all process.

Multiprocessor simulation platform
Code development and analysis tool : Development tool : GNU-cross compiler Allow flexible profiling by functions of simulator. Output of simulator : Statistics about processor and interconnect performance. VCD waveform of all bus signal Traces of memory accesses performed by every cores.

Features of communication architecture
AMBA-AHB Architecture: Traditional shared bus with pipelining. Distinct data and address/control bus Transfer with data phase and control phase. Support burst as streams of single transaction. “split/retry transfer” and “early burst termination” are used to solve high-latency slaves. STBus Architecture: Protocol type 3: simple load/store operation , pipelining and spliting transaction,out-of-order support. Flexible topology :from shared bus to full crossbar Overlapping transfer:Requesting new burst while previous ones are still completing without idle cycle. Fast arbitration with two cycles and minimum latency is three cycles.

Experimental Result Comparison of performance interconnection
Five interconnections : AMBA-AHB Shared-bus STBus Full crossbar STBus Partial crossbar STBus : ST-32 Partial crossbar STBus : ST-54

Experimental Result Performance comparison

Experimental Result Comparison of performance interconnection
Four benchmarks : Matrix multiplication independent : ASM-IND Matrix multiplication pipeline without OS : ASM-PIP Matrix multiplication with OS : OS-IND Matrix multiplication pipeline with OS : OS-PIP

Experimental Result Comparison of performance interconnection

MPSoC Clock and Power Olivier Franza, Intel
Increased uncertainty with process scaling Process, voltage, temperature variations, noise, coupling Affects design margin over design, power & performance loss Increased power constraints Increasing leakage, power (density, delivery) limitations More transistors mean: Larger clock distribution networks Higher capacitance (more load and parasitics) With each new technology: Gate delay decreases ~25% Wire delay increases ~100% Cross-chip communication increases Clock needs multiple cycles to cover die

Interconnect Delays & Density
Hannu Tenhunen & Dr. Li-Rong Zheng, Royal Institute of Technology

Multiple Clocks due to Interconnect limitation

At reduced performance, larger resource size

Noise in Mixed Signal Systems

Multiple clock domains
Low skew and jitter ALWAYS a must Clock modeling requires more accuracy Within-die variations, inductance, crosstalk, electromigration, self-heat, … Floor plan modularity Think adding/removing cores seamlessly! Hierarchical clock partitioning Reduce global clock and possibly relax its requirements Generate “locally”-used clock “locally” Implement clock domain deskewing techniques Bound clock problem into simple, reliable, efficient domains

DEC/Compaq Alpha more complex core to improve performance, more
complex clocks (?), Source: DEC/Compaq – Gronoski & al., JSSC 1998 – Xanthopoulos & al., ISSCC 2001 – Barroso & al., ISCA 2000

Clock and Power Convergence Intel® Itanium® Montecito
Each core split into 3 clock domains on variable power supply Each domain controlled by Digital Frequency Divider (DFD) generating low-skew variable-frequency clocks; fed by central PLL and aligned through phase detectors Regional Voltage Detector (RVD): supply voltage monitor Second level clock buffer (SLCB): digitally controlled delay buffer for active deskewing Regional Active Deskew (RAD): phase comparators monitoring and adjusting delay difference between SLCBs Clock Vernier Device (CVD): digitally controlled delay buffer Clock generation and distribution are essential Clock generation and distribution are essential enablers of microprocessor performance

On-Chip Interconnects: Circuits and Signaling, Wayne Burleson
• Using Vdd programmability • High Vdd to devices on critical path • Low Vdd to devices on non-critical paths • VddOff for inactive paths A – Baseline Fabric B – Fabric with Vdd Configurable Interconnect This work builds on a similar idea for FPGAs described in: Fei Li, Yan Lin and Lei He. Vdd Programmability to Reduce FPGA Interconnect Power, IEEE/ACM International Conference on Computer-Aided Design, Nov. 2004

Why Reconfigurable System?
GPP와 재구성 h/w 를 포함 목적: 전력 감축 및 유연성 동적인 환경에 따른 Quality of Service를 제공 알고리즘 진화에 따른 유연한 구조 개발 및 유지 보수해야 하는 플랫폼 감소 Task 1 Task N A W B C X Y D E Z W X Y Z A A B B H H I I J J D D D C C C E E E Reconfigurable Hardware

Energy Efficiency of Reconfigurability
system architecture communication protocol O/S and applications Partitioning of functions between wireless device and services on the network The mobiles must be flexible enough to accommodate a variety of multimedia services and communication capabilities and adapt to various operating conditions in an (energy) efficient way The way out is energy efficiency: doing more work with the same amount of energy. Traditionally, designers have been focused on low-power techniques for VLSI design. However, the key to energy efficiency in future mobile multimedia devices will be at the higher levels: energy-efficient system architectures, energy-efficient communication protocols, energy-cognisant operating systems and applications, and a well designed partitioning of functions between wireless device and services on the network. Mobile computers must remain usable in a variety of environments. They will require a large amount of circuits that can be customized for specific applications to stay versatile and competitive. Reconfigurability is thus an important requirement for mobile systems, since the mobiles must be flexible enough to accommodate a variety of multimedia services and communication capabilities and adapt to various operating conditions in an (energy) efficient way. Reconfigurability also has another more economic motivation: it will be important to have a fast track from sparkling ideas to the final design. If the design process takes too long, the return on investment will be less. It would further be desirable for a wireless terminal to have architectural reconfigurability whereby its capabilities may be modified by downloading new functions from network servers. Such reconfigurability would also help in field upgrading as new communication protocols or standards are deployed, and in implementing bug fixes [3]. One of the key issues in the design of portable multimedia systems is to find a good balance between flexibility and high-processing power on one side, and area and energy-efficiency of the implementation on the other side.

S/W configurable platform의 필요성
Doing More by Doing Less :다양한 표준을 다룰 수 있는 능력이 필요 (AM, FM, GSM, UMTS, digital broadcasting standards, analog and digital television and other data links. A fully software reconfigurable multi-channel broadband sampling receiver for standards in the 100 MHz band

Granularité dela reconfiguration Sébastien PILLEMENT - ENSSAT/LASTI
Reconfiguration au niveau système Lx, C62 (décomposition en cluster) Reconfiguration au niveau fonctionnel Pleiades, RaPiD, DART(2001) Reconfiguration au niveau opérateur Chameleon, Piperench, Morphosys(2000) Reconfiguration au niveau porte Napa, GARP, FPGA

The gain size of operations in Reconfigurable System Architectures
Fine gained operations : Multiply and addition Medium gained operations : reconfigurable modules Course gained operations : CPU, host . fine grained operations in the modules that perform functions like multiply and addition. . medium grained operations are the functions of the modules. The functional tasks are allocated to dedicated (reconfigurable) modules (e.g. display, audio, network interface, security, etc.) [2]. . course grained operations are those tasks that are not specific for a module and that can be performed by the CPU module, or even on a remote compute server. This partitioning is mainly a task of the operating system.

Design Space of Reconfigurable Architectures
Lilian Bossuet LESTER Lab Université de Bretagne Sud Lorient, France RECONFIGURABLE ARCHITECTURES (R-SOC) FINE GRAIN (FPGA) MULTI GRANULARITY (Heterogeneous) COARSE GRAIN (Systolic) Processor + Coprocessor Tile-Based Architecture Island Topology Hierarchical Topology Coarse Grain Coprocessor Fine Grain Coprocessor Mesh Topology Linear Topology Hierarchical Topology Xilinx Virtex Xilinx Spartran Atmel AT40K Lattice ispXPGA Altera Stratix Altera Apex Altera Cyclone Chameleon REMARC Morphosys Pleiades Garp FIPSOC Triscend E5 Triscend A7 Xilinx Virtex-II Pro Altera Excalibur Atmel FPSIC aSoC E-FPFA RAW CHESS MATRIX KressArray Systolix Pulsedsp Systolic Ring RaPiD PipeRench DART FPFA

Digital Signal Processing With FPGAs
Paul Ekas Jean-Charles Bouzigues

Multiplier Options In FPGAs
Resource Area Usage 1 Logic Multipliers Logic Elements (Traditional) 500 LEs per 18x18 Multiplier 2 Hard Multipliers DSP Blocks 4 18x18 Multipliers per DSP Block 3 Soft Multipliers RAM 1 to 2 Embedded Memory Blocks

Logic Elements Smallest Unit of Logic
Control Signals 4 LE1 Logic Element Smallest Unit of Logic Grouped into Logic Array Blocks (LABs) of Ten LEs Features Four-Input Look-Up Table (LUT) Configurable Register Dynamic Add/Subtract Control Carry-Select Chain Logic 4 LE2 4 LE3 4 LE4 4 LE5 Logic Array Block 4 LE6 4 LE7 Logic array structure is maintained from APEX architecture, however the MegaLAB architecture no longer exists The problem with MegaLAB structures was that anytime a signal propagated across one of these MegaLAB boundaries, a significant performance hit was incurred A more uniform structure without ‘hard’ boundaries maximizes performance, regardless of placement and aspect ratio 4 LE8 4 LE9 4 LE10 Local Interconnect

DSP Block: Optimized Hard MAC
36 37 + - S 36 144 Input Register Unit Optional Pipelining Output MUX Output Register Unit 38 144 + 36 37 + - S 36 Explain the different modes. 9 Bit x 9 Bit 8 Multiplies 2 Multiplies with Accumulate 2 Sum of 2 Multipliers (Complex Multipliers) 2 Sum of 4 Multiplies 18 Bit x 18 Bit 4 Multiplies 2 Multiplies with Accumulate 1 Sum of 2 Multipliers (Complex Multiply) 1 Sum of 4 Multiplies 36 Bit x 36 Bit 1 Multiply

Soft Multipliers: Lookup Based Multiplication
Use Embedded RAM Blocks as Look-Up Tables (LUTs) for Generating Partial Products Coefficient or Sum of Coefficients Values Stored in RAM Blocks MSB Partial Product Shifted & Added to LSB Partial Product Address 5 Multiplier Table ADDRESS MULT_RESULT 00000 00001 C 00010 2*C 00011 3*C … …. 11111 31*C 32*18 M512 Example Multiplication of 5-Bit Input with 13-Bit Coefficient All 18 Bit Possible Results Stored at 32*18 Look Up Table 18 Data Output C = Coefficient[12:0]

Altera FPGA Memory Architectures
Today’s applications need more high performance memory One size does not fit all Wide choice of modes and widths M512 Blocks M4K Blocks M-RAM External Memory Devices Rate Changing Embedded Shift Register Mode Operates Up to 312Mhz Mixed Clock Mode True Dual Port RAM Embedded Shift Register Mode Operates Up to 312Mhz Mixed Clock Mode True Dual Port RAM Embedded Shift Register Mode 512K bits 300 Mhz Operates Up to 300Mhz Mixed Clock Mode DDR SDRAM & SRAM SDR SDRAM QDR & QDRII SRAM ZBT SRAM DDR FCRAM More Bits For Larger Memory Buffering More Data Ports for Greater Memory Bandwidth

Soft Multiplier: Sum of Multiplications
16-Bit Serial Shift Registers 16-Bit Serial Shift Registers (Sample 16-Bit, Coefficient 16 Bit) 1 1 1 Input Sum of Multiplications Table M512 32*18 M512 32*18 ADDRESS MULT_RESULT 0000 0001 C0 0010 C1 0011 C0+C1 … …. 1111 C0+C1+C2+C3 4 4 18 18 + 35 19 + Example: FIR Filter Memory: 2 M512 Output

Example Direct Sequence Spread Spectrum (DSSS) Modem

DSSS Modem Five Independent Data Channels Spread to 3.84 Mcps
Three-Stage FIR Interpolation-by-32 Root-Raise Cosine Pulse Shaping with 22% Excess Bandwidth 112 dB SFDR MHz Quadrature Carriers MSPS Transmitter Output with 5 MHz Bandwidth & Over 78-dB Out–of-Band Rejection Automatic Gain Control (AGC) Compensating for Channel Attenuation of up to 30 dB Costas Loop Carrier Recovery 4x Oversampling Code Synchronization DCH0 DCH0 DSSS Modulator Channel Model DSSS Demodulator DCH1 DCH1 DCH2 DCH2 DCH3 DCH3 DCH4 DCH4

DSSS Modulator S S DCH0 Cch,16,0 DCH1 Cch,16,1 DCH2 Cch,16,2 SCH DCH3
FIR3 RRC 25-Tap FIR Filter Interpolation x4 Ex BW:22% Cch,16,0 Re[] S DCH1 Cch,16,1 gi DCH2 K FIR1 LPF 2-Channel 87-Tap FIR Filter Interpolation x2 FIR2 LPF 2-Channel 47-Tap FIR Filter Interpolation x4 Cch,16,2 NCO Frequency Resolution: 0.03Hz SFDR: 112dB Sin(wn) Length 256 Gold Code Spreader SCH Cos(wn) K DCH3 Carrier Phase Increment gq Cch,16,8 Im[] S FIR3 RRC 25-Tap FIR Filter Interpolation x4 Ex BW:22% DCH4 Cch,16,9 PCH Cch,16,10

DSSS Demodulator FIR Altera RRC 31-Tap FIR Filter Excess BW: 22% Fixed Rate pn_lock 8 Gold Code Correlator 4x Oversampling Peak Detector max_index NCO Frequency Resolution: 0.03Hz SFDR: 112dB AGC Carrier Recovery Loop Data Channels Output 1…5 Buffer Hadamard Despreader Free-Running Phase Increment I-Q Derotate FIR Altera RRC 31-Tap FIR Filter Excess BW: 22% Fixed Rate 8 Pilot Output Pilot Monitor

DSSS Modem Resources Resource Usage Summary Power Usage Estimates
Design Entity Logic Elements M512 RAM M4K Mega DSP Block Modulator 9943 1 8 12 Demodulator 12196 60 Power Usage Estimates Power mW Total Standby Internal Power 75 Total Logic Element Internal Power 283 Total Clocktree Internal Power 175 Total DSP Internal Power 23 Other Internal Power 92 Total Power 505

FIR Filter Example* – 16X Cost/Performance Improvement
Device Solution FIR Performance (MHz) Device Cost**** Cost per FIR MHz TI C 64 200MHz 3.125 $24.59 $7.87 TI C 32 600MHz 18.75 $160 $8.53 Altera 1C3-8 8 230MHz 28.75 $14 $0.49 Altera 1C12-8 1 170MHz 170 $84 * FIR 128 Tap, 16 bit data, 14 bit coefficients ** DSPLib Optimized Assembly Libraries from Texas Instruments *** MegaCore Optimized FIR Compiler from Altera **** Pricing in quantity of 100 at Arrow 6/25/03

Reconfigurable video processor for SDRAM access optimization (Henriss, Ernst et al.)

Reconfigurable video platform
· SDRAM memory centered design · FPGA based scheduler merges different streams and random accesses exploitation of SDRAM bank structure · supports 2 HDTV streams at 1.48 Gbit/s each plus DSP and filter unit access · reaches 700MByte/s in practical application for 4 Byte SDRAM memory word · extremly cost efficient design · used in professional video product line

Fine-Grained RSOCs: Triscend A7 CSOC
A7 Family 32-bit ARM 7 with 8kB Cache 3200 logic cells max. (40K gates) Up to 3800 FF’s Up to 300 Prog. I/O pins

Coarse-Grained RSOCs Chameleon Structure (2000)
Design a battery powered personal mobile computing device that has multimedia functionality and can operate in a dynamic environment. - Do just enough and not too much for a given task (QoS) 32-bit ARC control processor Up to bit Datapath Units DPU=a 32-bit ALU+a 32-bit barrel shifter Up to 24 of 16x24-bit multipliers Up to 48 of 128x32-bit local memory modules Up to 160 Prog. I/O pins Targeted at 3rd gen. wireless basestation, wireless local loop, SW radio, etc. 시스템 변경 및 환경의 변화로 효과적인 시스템을 현실화 하기 위해 다양하게 재구성가능한 시스템이 대두되고있다. 이에 다양하게 재구성가능한 이동통신 시스템 디자인을 목적으로 카멜레온 프로젝트를 실행하였다. 주된 2가지 목적이 있다. 첫째로 효율적 에너지 시스템 개발, 둘째로 적용을 위한 적당한 QoS를 지원 Design a battery powered personal mobile computing device that has multimedia functionality and can operate in a dynamic environment. 32-bit ARC processor 32-bit interface 64-bit high-performance memory controller 108 parallel computation units Each algorithm is loaded One at a time. Paul J.M. Havinga, Lodewijk T.smit, Gerard J.M. Smit, Martinus Bos, Paul M. Heysters,

Field Programmable Function Array
The FPFA concept has a number of advantage The FPFA has a highly regular organisation We use general purpose process core Its scalability stands in contrast to the dedicated chips designed nowadays The FPFA can do media processing tasks such as compression/decompression efficiently . The FPFA has a highly regular organisation, it requires the design and replication of a single processor tile, and hence the design and verification is rather straightforward. The verification of the software might be less trivial. Therefore, for less demanding applications we use a general-purpose processor core in combination with a FPFA. . Its scalability stands in contrast to the dedicated chips designed nowadays. In FPFAs, there is no need for a redesign in order to exploit all the benefits of a next generation CMOS process or the next generation of a standard. . The FPFA can do media processing tasks such as compression/decompression efficiently. Multimedia applications can for example benefit from such energy-efficient compression by saving (energy-wasting) network bandwidth.

Field Programmable Function Array
Processor tiles Consists of five identical blocks, which share a control unit and a communication unit An individual block contains an ALU, two memories and four register banks of four 20-bit wide register A crossbar-switch makes flexible routing between the ALUs, registers and memories This structure is convenient for the Fast Fourier Transform(6-input,4-output) and the Finite impulse response A FPFA processor tile in Figure 1 consists of five identical blocks, which share a control unit and a communication unit. An individual block contains an ALU, two memories and four register banks of four 20-bit wide registers. Because of the locality of reference principle, each ALU has two local memories. Each memory has bit entries. A crossbar-switch makes flexible routing between the ALUs, registers and memories possible. Figure 7 shows the crossbar interconnect between five blocks. This interconnect enables an ALU to write-back to any register or memory within a tile. Five blocks per processor tile seems reasonable. With five blocks there are ten memories available. This is convenient for the FFT algorithm, which has six inputs and four outputs. Also, we now have the ability to use 5×16=80-bit wide numbers, which enable us to use floating-point numbers (although some additional hardware is required). Some algorithms, like the FIR filter, can benefit substantially from additional ALUs. With five ALUs, a five-tap FIR filter can be implemented efficiently. The fifth ALU can also be used for complex address calculations and other control purposes.

DSP System Architecture Options
Dedicated Hardware Architecture Processor + Co-Processor DSP DSP Processor Array Stand-Alone Processor DSP Performance (MMACs/sec)

Optional Coprocessor Mappings
Processor On FPGA Processor External to FPGA FPGA FPGA Processor Processor Memory TI c6x (EMIF) Mot PPC (MPX) Mot Starcore (MPX, AHB) Intel 2850 (PCI Express) ARM (AHB) ….. Nios ARM (AHB)

Mapping of DSP Algorithms on the FPFA
Fast Fourier Transform FFT recursively divides a DFT into smaller DFTs Fourier transform enables the conversion of signals from the time domain to the frequency domain (and vice versa). For digital signal processing, we are particularly interested in the Discrete Fourier Transform (DFT). The Fast Fourier Transform (FFT) can be used to calculate a DFT efficiently. FFT recursively divides a DFT into smaller DFTs. Eventually only basic DFTs remain. These DFTs have a number of inputs that is equal to the radix of the FFT. This is illustrated in Figure 4 for a radix 2 FFT with N=8 input signals. A = a + W ×b ≡ (are + aim) + ( (Wre × bre .Wim ×bim) + (Wre ×bim +Wim ×bre)im ) B = a . W ×b ≡ (are + aim) . ( (Wre × bre .Wim ×bim) + (Wre ×bim +Wim ×bre)im ) (2) The resulting basic DFTs can be calculated by a structure called a butterfly. The butterfly is the basic element of a FFT. Figure 5 depicts the radix 2 butterfly; a and b are complex inputs and A and B are complex outputs. W is a complex constant called the twiddle factor. The radix 2 butterfly consists of a complex multiplication, a complex addition and a complex subtraction. The FFT butterfly depicted in Figure 5 can be written as Equation (2). A hardware algorithm for the radix 2 FFT butterfly has six inputs (are,aim,bre,bim,Wre,Wim) and four outputs (Are,Aim,Bre,Bim). Each input is used two times. Three subtraction, four multiplication and three addition operations are used. Recursion of a radix 2 FFT with 8 inputs The radix 2 FFT butterfly

Mapping of DSP Algorithms on the FPFA
Five-tap finite-impulse response filter The introduced FPFA architecture is aimed at fine-grained operations in hand-held multimedia computers. The architecture has a low design complexity, is scalable and can execute various algorithms energy efficiently, while maintaining a satisfactory level of performance. Several non-trivial algorithms have been mapped successfully on the FPFA processor tile. Examples from the digital signal-processing domain include linear interpolation, FIR filter and FFT. In contrast to the FPFA, which is aimed at fine-grained 16-bit wide operations, the FPGA is aimed at bit level logic functions. As a consequence, operations like multiply-add are relatively expensive on a FPGA (i.e. they take a large area and thus consume a large amount of energy). The ALU of a FPFA processor tile has four input operands. Most standard ALUs have two input operands. The extra inputs increase the functionality of the ALU and enable it to compute powerful functions like the linear interpolation efficiently. The FPFA design has still many open issues, which include the control of a processor tile and the interconnect between processor tiles.

MorphoSys (1999) Frame buffer & DMA controller
For high parallelism of RC array Frame buffer has two sets Enable overlap of data transfers with RC array execution Context Memory Store the configuration program for RC array 4-stage pipeline Fetch, Decode, ALU/Mem, Writeback 32-bit ALU, register file, data cache Add instructions Activate the DMA Controller to transfer data Provide control signals to Frame Buffer RC Array for executing applications

Reconfigurable cell

RC Array Array of reconfigurable cells 64 cells in a 2-D matrix
SIMD model Same row(column) share configuration Each RC operates on different data

TinyRISC (Cont’d)

Implementation & Performance
0.35 micron technology 4 metal layers Operation at 100MHz 170 mm2 Motion Estimation Block size : 16x16 pixel, Image size : 352x288 pixel

Lx de STMicroelectronics

DART, Raphael David, IRISA/ENSSAT With STMicroelectronics, UBO univ.
Reconfigurable multigrain= DPR+FPGA Reconfiguration Dynamique Faible Consommation Distribution hierarchique des ressources SCMD (Single Configuration Multiple Data) DART Cluster 11 GOPS/cluster 1.6 GMACS/cluster GOPS 16 11GOPS 0.18u CMOS

Cluster architecture DPR1 Segmented network DPR2 Data mem DPR3 DPR4
Control DPR2 DPR3 DPR4 DMA ctrl DPR5 Config mem. FPGA DPR6

DPR architecture Global bus Loop management AG1 AG2 AG3 AG4 Data mem1
Multibus network reg1 reg2 MUL1 ALU1 MUL2 ALU2

Platform-based Design

Similar presentations

Presentation on theme: "Platform-based Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Platform-based Design

Similar presentations

Presentation on theme: "Platform-based Design"— Presentation transcript:

Similar presentations

About project

Feedback