Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Software Hardware Architecture Platform for for Embedded Systems SHAPES at DATE 2007 Pier Stanislao PAOLUCCI chief technical officer – ATMEL Roma.

Similar presentations


Presentation on theme: "Scalable Software Hardware Architecture Platform for for Embedded Systems SHAPES at DATE 2007 Pier Stanislao PAOLUCCI chief technical officer – ATMEL Roma."— Presentation transcript:

1 Scalable Software Hardware Architecture Platform for for Embedded Systems SHAPES at DATE 2007 Pier Stanislao PAOLUCCI chief technical officer – ATMEL Roma & (part-time) permanent staff researcher – INFN Roma for the SHAPES Consortium

2 January, 2007 2 Introduction to SHAPES2 Project Motivation and Final Objective SHAPES Acronym: Scalable Software Hardware Architecture Platform for Embedded Systems Objective: Develop a prototype of Tiled Scalable HW & SW architecture for embedded applications characterized by inherent parallelism Experiment: “Small” Tiles (<10 MGate) connected by “short wires” weaving a packet switching on-chip and off-chip network  The HW architecture should scale on next deep-submicron technologies Challenges: how to program a tiled architecture Benchmarks  multi-loudspeaker multi-source wave field synthesis,  Multi-microphone voice extraction from noise on multi-microphone  Ultrasound scanners  Physical modelling of quantum chromo dynamics

3 January, 2007 3 Introduction to SHAPES3 HW HW Objectives  maintain profitable average selling prices  control NRE by IP reuse HW Solution  appropriate granularity: “Small” Tiles (<10 MGate) connected by “short (first neighbours) wires”  Inside the typical elementary Tile: Fully C programmable VLIW DSP for computing + RISC for control + Distributed Network Processor (a kind of generalized inter-tile DMA controller) for inter-tile communication   multi-tile Silicon area >40mm 2 <90mm 2  management of logic & place & route complexity through IP reuse  multi-level network Intra-tile: multi-layer bus matrix Inter-tile: NoC (intra-chip) + 3DT (inter-chip)  distributed routing fabric connects on-chip and off-chip tiles weaving a packet switching network

4 January, 2007 4 Introduction to SHAPES4 SW  Communication centric, real-time aware programming environment Application description: model based with explicit annotation of real-time constraints Provide automated optimized binding of processes to computing resources and binding of inter-process communication on communication resources + scheduling of processes and their communication Provide automated generation of hardware dependent software support Retargetable compilation managing intra-tile and inter- tile parallelism, bandwidth and latencies Fast simulation

5 January, 2007 5 Introduction to SHAPES5 Consortium Composition and Roles of the Partners System SW ETH Zurich - Distributed Operation Layer: manages application parallelism TIMA Lab and THALES - Hardware dependent Software Layer and RTOS TARGET Compiler Tech. - Retargetable Compilers RWTH Aachen Univ. – Fast Simulation of Heterogeneous Multi Proc. Systems System HW ATMEL Roma - Tile: Evolution of (Diopsis®: mAgicV VLIW DSP TM + RISC) + INFN DNP TM INFN Roma - DNP TM Distributed Network Processor + 3D Toroidal Eng.: Evolution of APE Massive Parallel Processors STMicrolectronics + Univ. of Cagliari and Pisa – Network on Chip: Evolution of Spidergon TM Packet Switching Network on Chip Parallel Application benchmarking Fraunhofer IDMT – multi-loudspeaker Audio Wave Field Synthesis ESAOTE, MedCom, Fraunhofer IGD - Ultrasound scanner INFN - Physical Modelling ATMEL – multi-microphone arrays for voice-extraction

6 January, 2007 6 Introduction to SHAPES6 Deep Sub-micron Architectures… ~ 160 MGate available on a 100 mm2 chip (45nm CMOS, 2008) Increasing GATES/CHIP  Design Complexity Management:  embedded processors use a few million gates only, IP reuse possible and needed; WIRING threatens Moore’s law:  Wiring delay increases on new CMOS silicon generations  The full chip cannot be reached in a single clock cycle  Classic monolithic processor architectures do not scale  Locally Synchronous, Globally Asynchronous needed  Communication Centric SW and HW Architecture needed … PROPOSED SOLUTION: … TILED ARCHITECTURE…BY SIMPLE GEOMETRIC DEMONSTRATION… IF CONSTANT LOGIC COMPLEXITY INSIDE EACH TILE… THEN (LENGTH OF INTRA-TILE WIRES SCALES DOWN AS THE TILE ITSELF… AND SHORT ~ FIRST NEIGHBOURS ON- CHIP AND OFF-CHIP INTER-TILE WIRES) QUEST OF BEST TILE, ON-CHIP AND OFF-CHIP INTERCONNECT. BUT HOW TO PROGRAM? EXPLICIT PARALLEL PROGRAMMING PARADIGM, and CULTURE NEEDED POWER DISSIPATION density approaching prohibitive values if higher clock speed used; much better Oper/Watt at moderate clock + parallelism (the human brain parallel architecture performs an excellent job at 50 HZ!... room for improvement)

7 January, 2007 7 Introduction to SHAPES7 Distributed Network Processor DNP: a generalized DMA controller for inter-tile or intra- tile packet routing DNP BUS Master (to read from intra-tile memories) BUS Slave (to receive commands from RISC & DSP) 3DT X+ (forward/receive inter-tile OFF-CHIP packets) 3DT X- 3DT Y+ 3DT Y- 3DT Z+ 3DT Z- NoC (to forward/receive inter-tile ON-CHIP packets) BUS Master (simultaneous intra-tile memory write) Collective communication

8 January, 2007 8 Introduction to SHAPES8 Different Types of Tiles DSP DNP Multi-Layer BUS NoC RISC POT 3DT DXM DNP Multi-Layer BUS NoC RISC POT 3DT DXM DNP Multi-Layer BUS NoC DSP POT 3DT DXM RDT: RISC + DSP Elementary Tile RET: RISC Elementary Tile DET: DSP Elementary Tile RDT RET DET DXM Mem Bus POT Pads DXM Mem Bus POT Pads

9 January, 2007 9 Introduction to SHAPES9 HW GLOSSARY HW GLOSSARY AT THE CHIP LEVEL MTC: Multiple Tile Chip (composed of multiple Tiles) NOC: Network On Chip (connecting Tiles) 3DT: 3 Dim Toroidal Connection (outside the chip ) FUNDAMENTAL TYPE OF TILE RDT includes:  RISC: (includes on chip memories RDM and RPM) +  DSP(includes on chip memories DDM and DPM)  DNP + DXM (off-chip mem) + POT (e.g. DAC/ADC conv) POSSIBLE TILE VARIANTS (subset of RDT) RET := RDT minus DSP DET:= RDT minus RISC DDT:= DET minus DXM INSIDE THE TILE Multilayer Bus Matrix sustains multiple simultaneous transfers RISC max one per tile DSP one or more per tile DNP: Distributed Network Processor (always one per tile) DDM: on-chip Distributed Data Mem (inside the DSP) DPM: on-chip Distributed Progr Mem (inside the DSP) DXM: Distributed eXternal Mem Interface (max one per tile, outside the RISC and DSP) POT: Peripherals On Tile RDM: Risc (tightly coupled) Data Memory RPM: Risc (tightly coupled) Program Memory RCM: Risc Cache Memory

10 January, 2007 10 WP 1.6 - RISC+ VLIW DSP + DNP Tile10 mAgicV IP Architecture (Fully C programmable Gigaflops VLIW DSP)

11 January, 2007 11 WP 1.6 - RISC+ VLIW DSP + DNP Tile11 Tile Complexity estimated through Synthesis & Place & Route trials mAgicV DSP:  915 Kgates + 1 Mbit Prog Mem + 640 Kbit Data Mem ARM926 & peripherals  <2 equivalent Mgate (including 640 Kbit mem) Tile Complexity   4230 equivalent Kgate + DNP gate count including on chip memories

12 January, 2007 12 WP 1.6 - RISC+ VLIW DSP + DNP Tile12 Silicon Floorplan Trial of RISC + mAgicV VLIW DSP Tile DSP Reg File DSP Data Mem (DDM) DSP Prog Mem (DPM) DSP Logic ARM926 ARM RDM Peripherals AMBA Multilayer

13 January, 2007 13 Introduction to SHAPES13 Spidergon NoC topology It’s a family of regular/symmetric topologies We look for a complexity/performance trade-off Low degree (router cost) Low number of links (wire cost) Symmetry (homogeneous building blocks; simple routing) Low diameter (performance) Good scalability (small network size granularity)

14 January, 2007 14 Introduction to SHAPES14 STNoC key components Network on Chip is a set of on-chip routers (up to layer 3), Network Interfaces (NI) (layer 4) and physical Link NI router IP link

15 January, 2007 15 Introduction to SHAPES15 Industrial Interest for Multi Tile Systems-on- Chip Large silicon area is needed for high Average Selling Price/unit: Multiple tile -> the way to efficient design of chips area > 40mm 2 on future technologies -> Industrial Interest of Semiconductor Manufacturers to avoid a low profit commodities-like industry $ and # of embedded processors / persons increasing faster than conventional processors / persons  # of (phones, games, pdas, cars, home, medical, wearable) vs PC Collision/convergence on architectures is going to happen:  Because of changes on key driving markets  Because full systems can be integrated on a chip  Because of deep submicron technological facts: WIRING, COMPLEXITY, POWER This time, …we are not in 1980, when a simpler solution was achievable through higher clock rate and monolithic architectures…we need multi-processor parallelism Embedded Systems versus Classical Computing

16 January, 2007 16 Introduction to SHAPES16 Background: APENext (2005) 2048 processor system, VLIW processors designed by INFN, manufactured by ATMEL

17 January, 2007 17 Introduction to SHAPES17 HW Background: Istituto Nazionale Fisica Nucleare APE family of Massive Parallel custom Very Long Instruction Word Floating- Point Procs. + 3D first neighbour toroidal communication for Numerical Physics Simulations APE (1984-1988) APE100 (1988-1993) APEmille (1994-1999) apeNEXT (2000-2005) Architecture SIMD SIMD++ # nodes 162048 4096 Topology flexible 1Drigid 3Dflexible 3D Aggregated memory 256 MB8 GB64 GB1 TB # registers (w.size) 64 (x32)128 (x32)512 (x32)512 (x64) LOW Clock frequency 8 MHz25 MHz66 MHz200 MHz Comp. Power/node 64 Mflops50 Mflops528 Mflops1600 Mflops Aggregated Comp. Power 1 GFlops100 GFlops1 TFlops7 TFlops

18 January, 2007 18 Introduction to SHAPES18 SW challenges from Tiled Architectures Facilitate expression of parallelism: e.g. Network of Actors Express real time constraints in a formal manner, feature missing in classical languages. This is a key cultural point!!! Avoid destroying information about available algorithm parallelism Compilation chain must fully aware of key architectural parameters: bandwidth, computational power, pipeline and latencies Exploit memory locality – efficient management of Distributed Memories – get rid of classical caches Manage Long delays between distant tiles Reduce Hot Spots in communications Reduce Tiled RTOS overhead (time and memory footprint) Introduce Hardware dependent Software and Hardware Abstraction Layers Capture scalability in a library of characterized SW/HW components Support for (semi)-automation of iterative design over HW, SW, Appl Monitor quality and real-time constraints on real HW and Simulators Simulation speed of multi-tiled architectures

19 January, 2007 19 Introduction to SHAPES19 SW Architecture Optimised compilation on tiles and comms network Distributed Operation Layer hardware platform specification Simulator trace information Model Compiler component interaction, properties and constraints component source code mapping information HdS Generator HdS source code Compiler component binary HdS binary Link Dispatch OS services binary glue binary Mapping Memory mapping RTOS application specs

20 January, 2007 20 Introduction to SHAPES20 Distributed Operation Layer – Application Specification Two parts: Application structure  @system level  processes  FIFO SW channels between processes  interconnection between processes Behavior of each process  process’ internals.c ….xml schema definition available AB C

21 January, 2007 21 Introduction to SHAPES21 Virtual SHAPES Platform (VSP) Enable early software development Explore different tile configurations Binary compatible with the SHAPES hardware Debugging capability Export performance information Scalability to multiple tiles VSP DOL Applications RTOSHdS HW SHAPES SW and app partners

22 January, 2007 22 Introduction to SHAPES22 VSP-DOL interfacing

23 January, 2007 23 Introduction to SHAPES23 TARGET Compiler Core related requirements TILE OFF- CHIP MEM TILE OFF- CHIP MEM TILE OFF- CHIP MEM TILE OFF- CHIP MEM mAgicV DSP ARM uP uP MEM COMM I/F DSP DATA MEM COMM I/F REG FILE DSP PROG MEM Communication related requirements Conv2 Div2 Sh/Log2 Conv1 Div1 Sh/Log1 RF0 0123 4567 FP/I * * RF1 0123 4567 FP/I * * - + - + + - Mul1 Mul2Mul4Mul3 Cadd1Cadd2 Add1Add2 Min Max2 Min Max1 P6_0 P5_0 P4_ 0 P3_0 P2_0 P6_1 P4_1 P5_1 P2_1 P3_1 Core_bus5 Core_bus7 Core_bus5 Core_bus7 mAgicV PCU PROGRAM MEMORY INSTR. DECODER INSTR. SEQUEN- CER DECOM- PACTION INTERRUPT CON- TROLLER INSTR. DECODER mAgicV core Support of VLIW instruction compaction Phase coupling: reg. allocation  SW pipelining Support of predicated execution Functional unit assignment for clustered VLIWs Communication latency aware scheduling Intra-tile multi- core on-chip debugging Inter-tile communication using DNP

24 January, 2007 24 Introduction to SHAPES24 TIMA - HdS & RTOS - Principles Hardware dependent Software: software directly dependent on the underlying hardware Communication differentiation  Intra-subsystem & inter-subsystem communications Networked operating system: Application HdS API RTOS (RT Linux) COMM HAL ARM ARM Subsystem Application HdS API MonitorCOMM HAL DSP DSP Subsystem HdS SW HW SW HW

25 January, 2007 25 Introduction to SHAPES25 SW Architecture hardware platform specification simulation environment (RWTH) WP 1.4 trace information model compiler (ETHZ, RWTH) WP 1.11, WP 1.4 component interaction, properties and constraints component source code mapping information HdS generator (TIMA) WP 1.10 HdS source code Compiler (TARGET) WP 1.9 component binary HdS binary Link Dispatch (TARGET) WP 1.9 OS services binary glue binary mapping (ETHZ) WP 1.11 Memory mapping RTOS (TIMA, THALES) WP 1.10 application specification

26 January, 2007 26 Introduction to SHAPES26 SHAPES SW Architecture: challenges High-level exploration, mapping, and simulation:  What is the degree of available parallelism? How can it be exposed to the mapping stage? What is suitable model-based specification formalism? What adaptations are necessary in order to expose the inherent parallelism?  Define a common Profiling Trace Interface (PTI) over which information can be exchanged. Hardware-dependent software and operation system:  To use the provided features of the HdS (i.e. platform abstraction) a generic interface API has to be defined. Compiler technology:  Modeling low-latency communication interfaces in the C source code that is the input for the C compiler, for the computational tiles.  Investigate how HdS can be modeled entirely in C source code, to be compiled by the C compiler for the computational tiles.

27 January, 2007 27 Introduction to SHAPES27 Tiled HW Architecture Communication Centric, not Processor Centric Homogeneous SW interface for on-chip and off-chip scalable connection and I/O 3D first-neighbour Toroidal System Eng. (3DT) for Off-Chip communication Virtual tunnelling on packed switching NoC (Network on Chip) and off- chip 3DT Parallelism Aware System SW: Manage memory distribution, capture real time constraints Explicit parallel programming/Network of Actors FPGAFPGA DAC actuator Tile OFF-CHIP MEM Tile OFF-CHIP MEM Tile OFF-CHIP MEM Tile OFF-CHIP MEM DSP DNP Multi-Layer BUS NoC RISC POT 3DT Off-chip communication DXM Tile ADC sensor DAC actuator ADC sensor ADC/DAC

28 January, 2007 28 Introduction to SHAPES28 The tile: DIOPSIS ® + DIOPSIS ® + DNP DNP

29 January, 2007 29 Introduction to SHAPES29 DOL – Input/Output Interfaces DOL Performance analysis Application Specification HW Architecture Specification Mapping constraints Application programmer HW architect Sys. SW designer Simulation framework Workload Specification Mapping Specification Performance Analysis Results Performance Queries Compiler & Linker HdS & RTOS Simulation framework Sys. SW designer Application functional Simulation Mapping Optimization

30 January, 2007 30 Introduction to SHAPES30 RISC 0 DOL - Mapping Specification bus DSP 0 RISC 1 bus DSP 1 MEM AB C 7 0 1 2 3 4 5 6 0 4 6 2 5 1 3 7 NoC Mapping = binding + scheduling.xml schema definition available

31 January, 2007 31 Introduction to SHAPES31 VSP abstraction levels Statistical Analysis Tasks are represented as timing budgets: - Very high simulation speed - No architecture modeling & verification SHAPES Hardware platform Virtual Processing Unit (VPU) Generic abstract processor simulator: - adaptable to arbitrary processor core - high simulation speed - functional validation - user-dependent accuracy simulation speed accuracy WP1.4 WP1.11 WP1.1 Cycle Accurate (CA) Model Cycle accurate Instruction Set Simulators (ISS): - ARM9 (commercially available) - mAgic VLIW DSP (Target ISS) - DNP - STM Spidergon Network-on-Chip (STM model) Instruction Accurate (IA) Model Instruction accurate Instruction Set Simulators (ISS): - ARM9 - mAgic VLIW DSP - DNP - STM Spidergon Network-on-Chip


Download ppt "Scalable Software Hardware Architecture Platform for for Embedded Systems SHAPES at DATE 2007 Pier Stanislao PAOLUCCI chief technical officer – ATMEL Roma."

Similar presentations


Ads by Google