Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Computer Architecture

Similar presentations


Presentation on theme: "Embedded Computer Architecture"— Presentation transcript:

1 Embedded Computer Architecture
Exploiting ILP VLIW architectures TU/e 5KK73 Henk Corporaal

2 What are we talking about?
ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture Instruction format example of 5 issue VLIW: operation 1 operation 2 operation 3 operation 4 operation 5 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

3 Single Issue RISC vs VLIW
op execute 1 instr/cycle instr RISC CPU op nop instr Compiler 3-issue VLIW execute 1 instr/cycle 3 ops/cycle 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

4 Topics Overview How to speed up your processor?
What options do you have? Operation/Instruction Level Parallelism Limits on ILP VLIW Examples Clustering Code generation (2nd slide-set) Hands-on 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

5 Speed-up Pipelined Execution of Instructions
IF: Instruction Fetch DC: Instruction Decode RF: Register Fetch EX: Execute instruction WB: Write Result Register CYCLE 1 2 3 4 5 6 7 8 1 IF DC RF EX WB 2 INSTRUCTION 3 4 Simple 5-stage pipeline Purpose of pipelining: Reduce #gate_levels in critical path Reduce CPI close to one (instead of a large number for the multicycle machine) More efficient Hardware Problems Hazards: pipeline stalls Structural hazards: add more hardware Control hazards, branch penalties: use branch prediction Data hazards: by passing required 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

6 Speed-up Pipelined Execution of Instructions
Superpipelining: Split one or more of the critical pipeline stages Superpipelining degree S: * S(architecture) =  f(Op) * lt (Op) Op I_set where: f(op) is frequency of operation op lt(op) is latency of operation op 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

7 Speed-up Powerful Instructions (1)
MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; or c = a + 5*b Assembly: set vl,64 ldv v1,0(r2) mulvi v2,v1,5 ldv v1,0(r1) addv v3,v1,v2 stv v3,0(r3) 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

8 Speed-up Powerful Instructions (1)
SIMD computing Nodes used for independent operations Mesh or hypercube connectivity Exploit data locality of e.g. image processing applications Dense encoding (few instruction bits needed) SIMD Execution Method time Instruction 1 Instruction 2 Instruction 3 Instruction n node1 node2 node-K 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

9 Speed-up Powerful Instructions (1)
Sub-word parallelism SIMD on restricted scale: Used for Multi-media instructions Examples MMX, SSX, SUN-VIS, HP MAX-2, AMD 3Dnow, Trimedia II Example: i=1..4|ai-bi| * 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

10 Speed-up Powerful Instructions (2)
MO-technique: multiple operations per instruction Two options: CISC (Complex Instruction Set Computer) VLIW (Very Long Instruction Word) FU 1 FU 2 FU 3 FU 4 FU 5 field sub r8, r5, 3 and r1, r5, 12 mul r6, r5, r2 ld r3, 0(r5) bnez r5, 13 instruction VLIW instruction example 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

11 VLIW architecture: central Register File
Shared, Multi-ported Register file Exec unit 1 Exec unit 2 Exec unit 3 Exec unit 4 Exec unit 5 Exec unit 6 Exec unit 7 Exec unit 8 Exec unit 9 Issue slot 1 Issue slot 2 Issue slot 3 Q: How many ports does the registerfile need for n-issue? 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

12 Philips oldie: TriMedia TM32A processor
D-cache I-Cache IFMUL1 IFMUL2 (FLOAT) DSPMUL1 DSPMUL2 FTOUGH1 SHIFTER1 ALU1 FCOMP2 DSPALU2 ALU2 ALU4 ALU0 ALU3 FALU0 FALU3 DSPALU0 SHIFTER0 I-CACHE D-CACHE TAG 32K 16K REGFILE 128 REGS X 32 BITS SEQUENCER / DECODE I/O INTERFACE 0.18 micron area : 16.9mm2 200 MHz (typ) 1.4 W 7 mW/MHz (MIPS processor: 0.9 mW/MHz) 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

13 Speedup: Powerful Instructions (2) VLIW Characteristics
Only RISC like operation support Short cycle times Flexible: Can implement any FU mixture Extensible Tight inter FU connectivity required Large instructions (up to 1024 bits) Not binary compatible !!! But good compilers exist 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

14 Speed-up Multiple instruction issue (per cycle)
Who guarantees semantic correctness? which can instructions be executed in parallel? User: he specifies multiple instruction streams Multi-processor: MIMD (Multiple Instruction Multiple Data) HW: Run-time detection of ready instructions Superscalar Compiler: Compile into dataflow representation Dataflow processors 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

15 Multiple instruction issue Three Approaches
Example code a := b + 15; c := 3.14 * d; e := c / f; Translation to DDG (Data Dependence Graph) ld + st &b 15 &a * / &f 3.14 &e &d &c 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

16 Generated Code 3 approaches:
Instr. Sequential Code I1 ld r1,M(&b) I2 addi r1,r1,15 I3 st r1,M(&a) I4 ld r1,M(&d) I5 muli r1,r1,3.14 I6 st r1,M(&c) I7 ld r2,M(&f) I8 div r1,r1,r2 I9 st r1,M(&e) Dataflow Code I1 ld(M(&b) -> I2 I2 addi 15 -> I3 I3 st M(&a) I4 ld M(&d) -> I5 I5 muli > I6, I8 I6 st M(&c) I7 ld M(&f) -> I8 I8 div -> I9 I9 st M(&e) 3 approaches: An MIMD may execute two streams: (1) I1-I3 (2) I4-I9 No dependencies between streams; in practice communication and synchronization required between streams A superscalar issues multiple instructions from sequential stream Obey dependencies (True and name dependencies) Reverse engineering of DDG needed at run-time Dataflow code is direct representation of DDG 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

17 Multiple Instruction Issue: Data flow processor
Token Matching Token Store Instruction Generate Instruction Store Result Tokens Reservation Stations FU-1 FU-2 FU-K 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

18 Instruction Pipeline Overview
(no pipelining) CISC IF DC RF EX WB RISC IF DC/RF EX WB IF1 DC1 RF1 EX1 ROB ISSUE WB1 IF2 DC2 RF2 EX2 WB2 IF3 DC3 RF3 EX3 WB3 IFk DCk RFk EXk WBk Superscalar Superpipelined IF1 IF2 IFs DC RF --- EX1 EX2 EX5 WB RF1 EX1 WB1 RF2 EX2 WB2 RFk EXk WBk IF DC RF1 EX1 WB1 RF2 EX2 WB2 RFk EXk WBk DATAFLOW VLIW 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

19 Four dimensional representation of the architecture design space <I, O, D, S>
Instructions/cycle ‘I’ Superpipelining Degree ‘S’ Operations/instruction ‘O’ Data/operation ‘D’ Superscalar MIMD Dataflow Superpipelined RISC VLIW 10 100 0.1 Vector SIMD CISC Note: MIMD should better be a separate, 5th dimension ! 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

20 Architecture design space
Typical values of K (# of functional units or processor nodes), and <I, O, D, S> for different architectures Architecture K I O D S Mpar CISC RISC VLIW Superscalar Superpipelined Vector SIMD MIMD Dataflow S(architecture) =  f(Op) * lt (Op) Op I_set Mpar = I*O*D*S 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

21 Overview Enhance performance: architecture methods
Instruction Level Parallelism (ILP) limits on ILP VLIW Examples Clustering Code generation Hands-on 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

22 General organization of an ILP architecture
Instruction memory Instruction fetch unit Instruction decode unit FU-1 FU-2 FU-3 FU-4 FU-5 Register file Data memory CPU Bypassing network 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

23 Motivation for ILP Increasing VLSI densities; decreasing feature size
Increasing performance requirements New application areas, like multi-media (image, audio, video, 3-D, holographic) intelligent search and filtering engines neural, fuzzy, genetic computing More functionality Use of existing Code (Compatibility) Low Power: P = fCVdd2 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

24 Low power through parallelism
Sequential Processor Switching capacitance C Frequency f Voltage V P = fCV2 Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 Voltage V’ < V P = f/2 2C V’2 = fCV’2 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

25 Measuring and exploiting available ILP
How much ILP is there in applications? How to measure parallelism within applications? Using existing compiler Using trace analysis Track all the real data dependencies (RaWs) of instructions from issue window register dependence memory dependence Check for correct branch prediction if prediction correct continue if wrong, flush schedule and start in next cycle 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

26 Trace analysis How parallel can you execute this code? Trace set r1,0
set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Trace analysis Compiled code set r1,0 set r2,3 set r3,&A Loop: st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop add r1,r5,3 Program For i := 0..2 A[i] := i; S := X+3; How parallel can you execute this code? Explain trace analysis using the trace on the right-hand side for different models. No renaming + oracle prediction + unlimited window size Full renaming + oracle prediction + unlimited window size (shown in next slide) Full renaming + 2-bit prediction (assuming back-edge taken) + unlimited window size etc. 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

27 Trace analysis Max ILP = Speedup = Lserial / Lparallel = 16 / 6 = 2.7
Parallel Trace set r1,0 set r2,3 set r3,&A st r1,0(r3) add r1,r1,1 add r3,r3,4 st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop brne r1,r2,Loop add r1,r5,3 Max ILP = Speedup = Lserial / Lparallel = 16 / 6 = 2.7 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman Note that with oracle prediction and renaming the last operation, add r1,r5,3, can be put in the first cycle.

28 Ideal Processor Assumptions for ideal/perfect processor:
1. Register renaming – infinite number of virtual registers => all register WAW & WAR hazards avoided 2. Branch and Jump prediction – Perfect => all program instructions available for execution 3. Memory-address alias analysis – addresses are known. A store can be moved before a load provided addresses not equal Also: unlimited number of instructions issued/cycle (unlimited resources), and unlimited instruction window perfect caches 1 cycle latency for all instructions (FP *,/) Programs were compiled using MIPS compiler with maximum optimization level 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

29 Upper Limit to ILP: Ideal Processor
Integer: FP: IPC 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

30 Window Size and Branch Impact
Change from infinite window to examine 2000 and issue at most 64 instructions per cycle FP: Integer: 6 – 12 IPC Perfect Tournament BHT(512) Profile No prediction 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

31 Limiting nr. of Renaming Registers
Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor) FP: Integer: IPC Infinite 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

32 Memory Address Alias Impact
Changes: instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers FP: (Fortran, no heap) IPC Integer: 4 - 9 Perfect Global/stack perfect Inspection None 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

33 Reducing Window Size IPC
Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window FP: IPC Integer: Infinite 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

34 How to Exceed ILP Limits of This Study?
WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory Unnecessary dependences compiler did not unroll loops so iteration variable dependence Overcoming the data flow limit: value prediction, predicting values and speculating on prediction Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

35 Conclusions Amount of parallelism is limited
higher in Multi-Media and Signal Processing appl. higher in kernels Trace analysis detects all types of parallelism task, data and operation types Detected parallelism depends on quality of compiler hardware source-code transformations 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

36 Overview Enhance performance: architecture methods
Instruction Level Parallelism VLIW Examples C6 TM IA-64: Itanium, .... TTA Clustering Code generation Hands-on 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

37 A VLIW architecture with 7 FUs
VLIW: general concept A VLIW architecture with 7 FUs Int Register File Instruction Memory Int FU Data Memory LD/ST FP FU Floating Point Register File Instruction register Function units 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

38 VLIW characteristics Multiple operations per instruction
One instruction per cycle issued (at most) Compiler is in control Only RISC like operation support Short cycle times Easier to compile for Flexible: Can implement any FU mixture Extensible / Scalable However: tight inter FU connectivity required not binary compatible !! (new long instruction format) low code density 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

39 VelociTIC6x datapath 4/16/2017
Embedded Computer Architecture H. Corporaal and B. Mesman

40 VLIW example: TMS320C62 TMS320C62 VelociTI Processor
8 operations (of 32-bit) per instruction (256 bit) Two clusters 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs) 2 x 16 registers One bus available to write in register file of other cluster Flexible addressing modes (like circular addressing) Flexible instruction packing All instruction conditional Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS 128 KB on-chip RAM 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

41 VLIW example: Philips TriMedia TM1000
Register file (128 regs, 32 bit, 15 ports) 5 constant 5 ALU 2 memory 2 shift 2 DSP-ALU 2 DSP-mul 3 branch 2 FP ALU 2 Int/FP ALU 1 FP compare 1 FP div/sqrt Exec unit Exec unit Exec unit Exec unit Exec unit Data cache (16 kB) Instruction register (5 issue slots) PC Instruction cache (32kB) 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

42 Intel EPIC Architecture IA-64
Explicit Parallel Instruction Computer (EPIC) IA-64 architecture -> Itanium, first realization 2001 Register model: bit int x bits, stack, rotating bit floating point, rotating bit boolean bit branch target address system control registers See 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

43 EPIC Architecture: IA-64
Instructions grouped in 128-bit bundles 3 * 41-bit instruction 5 template bits, indicate type and stop location Each 41-bit instruction starts with 4-bit opcode, and ends with 6-bit guard (boolean) register-id Supports speculative loads 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

44 Itanium organization 4/16/2017
Embedded Computer Architecture H. Corporaal and B. Mesman

45 Itanium 2: McKinley 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

46 EPIC Architecture: IA-64
EPIC allows for more binary compatibility then a plain VLIW: Function unit assignment performed at run-time Lock when FU results not available See other website (course 5MD00) for more info on IA-64: (look at related material) 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

47 ILP = Instruction Level Parallelism =
What did we talk about? ILP = Instruction Level Parallelism = ability to perform multiple operations (or instructions), from a single instruction stream, in parallel VLIW = Very Long Instruction Word architecture operation 1 operation 2 operation 3 operation 4 Example Instruction format (5-issue): operation 5 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

48 VLIW evaluation O(N2) O(N)-O(N2) FU-1 CPU FU-2 Instruction memory
Instruction fetch unit Instruction decode unit FU-1 FU-2 FU-3 FU-4 FU-5 Register file Data memory CPU Bypassing network Control problem O(N2) O(N)-O(N2) With N function units 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

49 VLIW evaluation Strong points of VLIW: Weak points:
Scalable (add more FUs) Flexible (an FU can be almost anything; e.g. multimedia support) Weak points: With N FUs: Bypassing complexity: O(N2) Register file complexity: O(N) Register file size: O(N2) Register file design restricts FU flexibility Solution: ? 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

50 Solution Mirroring the Programming Paradigm
TTA: Transport Triggered Architecture + - + - > * > * st st 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

51 Transport Triggered Architecture
General organization of a TTA FU-1 CPU FU-2 FU-3 Instruction fetch unit Instruction decode unit Bypassing network FU-4 Instruction memory Data memory FU-5 Register file 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

52 TTA structure; datapath details
Data Memory integer RF float boolean instruct. unit immediate load/store ALU Socket Instruction Memory 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

53 TTA hardware characteristics
Modular: building blocks easy to reuse Very flexible and scalable easy inclusion of Special Function Units (SFUs) Very low complexity > 50% reduction on # register ports reduced bypass complexity (no associative matching) up to 80 % reduction in bypass connectivity trivial decoding reduced register pressure easy register file partitioning (a single port is enough!) 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

54 TTA software characteristics
add r3, r1, r2 That does not look like an improvement !?! r1  add.o1; r2 add.o2; add.r  r3 o1 o2 + r More difficult to schedule ! But: extra scheduling optimizations 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

55 Program TTAs How to do data operations ? Trigger Operand
1. Transport of operands to FU Operand move (s) Trigger move 2. Transport of results from FU Result move (s) Trigger Operand Internal stage Result FU Pipeline Example Add r3,r1,r2 becomes r1  Oint // operand move to integer unit r2  Tadd // trigger move to integer unit …………. // addition operation in progress Rint  r3 // result move from integer unit How to do Control flow ? 1. Jumps: #jump-address  pc 2. Branch: #displacement  pcd 3. Call: pc  r; #call-address  pcd 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

56 add.r -> sub.o1, 95 -> sub.o2
Scheduling example add r1,r1,r2 sub r4,r1,95 VLIW load/store unit integer ALU integer ALU r1 -> add.o1, r2 -> add.o2 add.r -> sub.o1, 95 -> sub.o2 sub.r -> r4 TTA integer RF immediate unit 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

57 TTA Instruction format
General MOVE field: g : guard specifier i : immediate specifier src : source dst : destination g i src dst move 1 General MOVE instructions: multiple fields move 2 move 3 move 4 How to use immediates? Small, 6 bits Long, 32 bits g 1 imm dst Ir-1 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

58 Programming TTAs How to do conditional execution Each move is guarded
Example r1  cmp.o1 // operand move to compare unit r2  cmp.o2 // trigger move to compare unit cmp.r g // put result in boolean register g g:r3 r4 // guarded move takes place when r1=r2 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

59 Register file port pressure for TTAs
4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

60 Summary of TTA Advantages
Better usage of transport capacity Instead of 3 transports per dyadic operation, about 2 are needed # register ports reduced with at least 50% Inter FU connectivity reduces with 50-70% No full connectivity required Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs Flexible: Fus can incorporate arbitrary functionality Scalable: #FUS, #reg.files, etc. can be changed FU splitting results into extra exploitable concurrency TTAs are easy to design and can have short cycle times 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

61 TTA automatic DSE Move framework User intercation Pareto curve
(solution space) cost exec. time x Optimizer Architecture parameters feedback feedback Parametric compiler Hardware generator Move framework Parallel object code chip 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

62 Overview Enhance performance: architecture methods
Instruction Level Parallelism VLIW Examples C6 TM TTA Clustering and Reconfigurable components Code generation Hands-on 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

63 Level 1 Instruction Cache
Clustered VLIW Clustering = Splitting up the VLIW data path - same can be done for the instruction path – FU loop buffer register file Level 1 Instruction Cache Level 1 Data Cache Level 2 (shared) Cache 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

64 Clustered VLIW Why clustering? Timing: faster clock Lower Cost
silicon area T2M (Time-to-Market) Lower Energy What’s the disadvantage? Want to know more: see PhD thesis Andrei Terechko 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

65 Fine-Grained reconfigurable: Xilinx XC4000 FPGA
D Q Slew Rate Control Passive Pull-Up, Pull-Down Delay Vcc Output Buffer Input Q D Pad Programmable Interconnect I/O Blocks (IOBs) D Q SD RD EC S/R Control 1 F' G' H' DIN H Func. Gen. G F G4 G3 G2 G1 F4 F3 F2 F1 C4 C1 C2 C3 K Y X H1 DIN S/R EC Configurable Logic Blocks (CLBs) 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

66 Recent Coarse Grain Reconfigurable Architectures
SmartCell 2009 read Montium (reconfigurable VLIW) RAPID NIOS II RAW PicoChip PACT XPP64 ADRES (IMEC) many more …. 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

67 Xilinx Zynq with 2 ARM processors
4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

68 ADRES Combines VLIW and reconfig. Array
PEs have local registers Top-row PEs share registers 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

69 PACT XPP: Architecture
XPP (Extreme Processing Platform) A hierarchical structure consisting of PAEs PAEs Course grain PEs Adaptive Clustered in PACs PA = PAC + CM A hierarchical configuration tree Memory elements (aside PAs) I/O elements (on each side of the chip) PA PA PAE: Processing Array Elements PAC: Processing Array Cluster CM: Configuration Manager PA PA

70 RAW with Mesh network Compute Pipeline 8 32-bit channels
Registered at input  longest wire = length of tile

71 Granularity Makes Differences
4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman

72 HW or SW reconfigurable?
FPGA reset Spatial mapping Temporal mapping Reconfiguration time loopbuffer context configuration bandwidth going up The space of HW and SW reconfigurable systems is an almost continuous space. Starting from a VLIW we can go more spatial. The piperench architecture is an example. There only part of the data path is reconfigured each cycle. If we would have a single instruction loop we have a complete spatial mapping more fine grain This allows for more effective use of the available hardware, but in general gives more routing overhead. In the extreme case of a fine grain FPGA we have complete control at gate-level, however with substantial interconnect and reconfiguration overhead. So I expect to have a hybrid solutions for many application specific platforms. Note that in principal spatial mapping is worse for area, but good for low activation and configuration power. Pstatic may increase however. Power = Pact + Pconf + Pstatic Pconf ~ Nbits * configuration-rate ~ Nbits / config-time Nbits ~ granularity VLIW Subword parallelism 1 cycle fine coarse Data path granularity 4/16/2017 Embedded Computer Architecture H. Corporaal and B. Mesman


Download ppt "Embedded Computer Architecture"

Similar presentations


Ads by Google