Download presentation
Published byMay Little Modified over 9 years ago
1
Modular design and the importance of models of computation
Bart Kienhuis, Leiden University, LIACS Computer Systems Group Based on the presentation given at the 37th DAC tutorial on Embedded System Design
2
New Applications Stream oriented applications
Multi-media (Smart) imaging Bioinformatics Classical digital signal processing Ferocious appetite for compute power
3
Smart Imaging Camellia project
Core for ambient and mobile intelligent imaging applications IST project (fr5) Detection of a pedestrian walking in front of a car Renault Philips Target 1 Target 2 Huge compute Requirements Giga operations per second Real Time Embedded System Low cost Low power
4
Heterogeneous Architectures
Microprocessors (DSP, CPU) Reconfigurable Units (FPGA) Dedicated Hard (IP cores) Distributed Memory Memory banks Stream-based applications Autonomous operating components (task-level parallelism) Low bandwidth communication between components Programmable interconnect Distributed memory
5
Intrinsic Computational Efficiency (ICE)
[MOPS/W] 106 105 Intrinsic Computational Efficiency of Silicon Playing field 104 103 i386SX i486DX P5 68040 microsparc Super sparc 601 604 Ultra P6 604e 21164a Turbosparc 21364 7400 102 Microprocessors 101 100 0.07 2 1 0.5 0.25 0.13 Feature size [m] E. Roza, System-on-chip: what are the limits?, IEE Electronics Communication Engineering Journal, vol13, No6, Dec 2001, pp
6
Reconfigurable logic and memory blocks
Xilinx Virtex II Pro PowerPCs PowerPC based 420 Dhrystone MIPS at 300 MHz 1 to 4 PowerPCs Virtex-4 Already over a billion transistors 90nm technology Reconfigurable logic and memory blocks Source: Xilinx
7
Programmable Interconnect
SpaceCake Title Current Development in Philips Research. Idea is to make a platform that is Moore’s Law resilient. Programmable Interconnect CPU0 CPU1 CPU2 RPU IPcore Memory Homogenous Tiles If more transistors become available, more titles can be combined delivering more compute power. Stravers, P and Hoogerbrugge, J Homogeneous multiprocessoring and the future of silicon design paradigms. In proceedings of the Int. Symposium on VLSI Technology, Systems, and Applications.
8
PicoArray Startup, England
Processor with bit RISC cores on a single die. Projected Markets: WCDMA / Homogeneous Architecture Source:
9
Different Views Homogeneous architectures Heterogeneous architectures
Linear scaling More CPUs leading to an equal increase in available compute power Load balancing Each CPU is working at the same workload level Heterogeneous architectures Flexibility Match the computation to the correct component in terms of ICE; Take advantage of heterogeneity
10
Software Efficiency Logic transistors per chip Trans. / Staff . Month
(K) 10 100 1,000 10,000 100,000 1,000,000 10,000,000 Logic Tr./Chip 58% / Yr. compound complexity growth rate Trans. / Staff . Month Productivity 10 100 1,000 10,000 100,000 1,000,000 10,000,000 100,000,000 Productivity gap Tr./S.M 21% / Yr. compound productivity growth rate 1981 1985 1989 1993 1997 2001 2005 2009 Source: SEMATECH © Kreutzer
11
Problem “We know how to build billion transistor ICs, but we do not know how to program them” Current compiler technology is not capable to handle the heterogeneity of the architectures How come and how to solve?
12
Y-chart Approach Three different ways to improve the performance of a system. Applications Applications Architecture Instance Applications Suggest architectural improvements Mapping Use different Mapping strategies Rewrite the applications Performance Analysis Instead of making changes in the architecture, the Y-chart also shows that a better performing system can be obtained when changing the way we map applications onto an architecture and the way we describe the set of applications. It is not limited to architectures only. The central element in the Y-chart is performance analysis. Within performance analysis there is a fundamental trade-off between three elements. The cost of modeling The cost of evaluating the model The accuracy of the evaluation These three elements can be draw in an diagram we call the “Abstraction Pyramid” Performance Numbers Kienhuis,B., Deprettere, E., Van der Wolf, P., and Vissers, K A Methodology to Design Programmable Embedded Systems. LNCS, vol Springer Verlag, pages 18 – 37.
13
Mapping MAPPING + quantization control MPEG Demux Reorder VLD -1 Q
Applications Architecture Instance Mapping Performance Analysis Numbers Mapping MPEG Coded video Demux VLD Q -1 IDCT Motion Buffer Reorder ordering quantization control motion vectors & mode Decoded MPEG Decoding + MAPPING
14
Mapping Both described a network of components that perform
bus coproc CPU coproc. MPEG Coded video Demux VLD Q -1 IDCT Motion Buffer Reorder ordering quantization control motion vectors & mode Decoded MPEG Decoding + Both described a network of components that perform a particular function and that communication in a particular way Architecture: Resources ALUs, CORDICS, PEs Registers, SRAM, DRAM Busses, Switches Communication Bits, Signals Application: Computations IDCT, SQRT, Quantizer Communication Pixels, Blocks
15
Mapping Architecture Application Mapping
bus coproc CPU coproc. MPEG Coded video Demux VLD Q -1 IDCT Motion Buffer Reorder ordering quantization control motion vectors & mode Decoded MPEG Decoding + Can we formalize the description of these networks? “Models of Architecture” and “Models of Computation”
16
Model of Computation A Model of computation is a formal representation
of the operational semantics of networks of functional blocks describing the computations. A C D B The presents of a Token signals that something needs to be done by an Actor. The actor can reaction on this present by itself (this is an active actor), or being activated by a scheduler to react to the token (it is an passive actor)..
17
Model of Computation Terminology
B Actor token Actor Describes the functionality Relation The actors can communicate with each other using relations. Token The exchange of a quantum of information. It represents a signal Firing A quantum of computation Moment of interaction with other actors Relation fire { … token = get(); send(token); } The presents of a Token signals that something needs to be done by an Actor. The actor can reaction on this present by itself (this is an active actor), or being activated by a scheduler to react to the token (it is an passive actor).. Port Port (Active/Passive)
18
Active/Passive Actors
D B fire { token = get(); … send(token); } fire { while(1) { token = get(); send(token); } Two kinds of Actors: Exit Passive Actor: Scheduler needed. Schedule ABBCD A firing needs to terminate Fire-and-exit behavior Active Actor: Schedules itself A firing typically doesn’t terminate Endless while loop Process behavior
19
Communication Between Actors
fire { … get(); } port Token send(); Communication (Semantics) Actor 1. Actor 2. Now that we know the structure in which the actors are placed, the next thing to focus on is the interaction between the actors. Each time an actor fires, it will perform a finite amount of computation and will exchange information with other actors by reading and writing tokens. In the example given, we see two actors, actor 1. And actor 2. Actor 1 will perform a certain amount of computation when it fires and exchange date with actor 2. There are two issues with the way this exchange takes place. The first issues is that in the interaction, some kind of data is being exchanged between the actors. This data needs to be of the same type. Suppose, that if actor 1 wants to exchange a complete Matrix (in this case the Token represents a Matrix), while Actor 2 only expects a single integer value, a problem occurs. But given that both actors exchange a token of the same kind, The second issue is the way the exchange of data takes place. Precisely the way this exchange takes places, distinguishes the different models of computation. Does actor 2 sees the Token right away or is this token buffered. Is the exchange synchronized in some way and consequently is this token only available on specific instances. Furthermore, it there the notion of time. Operate actor 1 and 2 concurrent of each other, or is there a defined schedule. Data Type of the Token Integer, Double, Complex Matrix, Vector Record Way exchange takes place Buffered Timed Synchronized
20
The way the exchange of the token takes place, distinguishes a MoC.
Different Semantics discrete time: Analog computers (odes) Discrete time (difference equations) Discrete-event systems (DE) Process networks (Kahn) Sequential processes with rendezvous (CSP) Dataflow (Dennis) Synchronous-reactive systems (SR) Codesign finite state machines (CFSM) continuous time: discrete events: synchronous/ reactive: The way the exchange of the token takes place, distinguishes a MoC. - In case of Continuous time, the actors will represent Ordinary Differential Equation and tokens moving through a network will represent tentative solutions to ODE solvers until a fix-point solution is found. Communication between actors is instantaneous. - In case of Discreet Time, actors represent difference equations and time can only take discreet integer values. - In case of multi-rate discreet time, actors operates on a different discreet time scales. - In discreet event, Actors are fired at time instances by a global scheduler on the basis of a global notion of time. Tokens carry time-tags that express when a actor needs to be activated. - In partial ordered systems, the tokens represent the fulfillment of a data dependency. The tokens between these actors can be buffered or unbuffered. - In synchronous reactive models, the presents of a token represents that an event took place to which an actor has to react. The exchange of Tokens between actors can happened in many different way in different environments. The exchange can take place in the notion of continous time, of discreet time. An the actors can have different partitionings of time in case of multi-rate , partially-ordered events: E1 E2 E3 E4 E5 E6
21
Synchronous/reactive Models (SR)
Network of concurrent executing actors Passive actors Communication is unbuffered Computation and communication is instantaneous. A model progresses as a sequence of “ticks.” At a tick, the signals are defined by a fixed point equation: Characteristics of SR models Tightly synchronized Stable state points Control intensive systems Fixed point equation x A C D B fire { … get(); } port Token send(); y z
22
Process Network (PN) Network of concurrent executing processes
Active actors Communicate over unbounded FIFOs Performing some operation, a blocking read or a non-blocking write Characteristics of process networks Deterministic execution Doesn’t impose a particular schedule (Dynamic) dataflow fire { … get(); } port Token send(); A C D B Process Stream channel
23
Synchronous Dataflow (SDF)
Network of concurrent executing actors Passive actors Communication is buffered A model progresses as a sequence of “iterations.” A “firing rule” determines the firing condition of an actor. At each firing, a fixed number of tokens is consumes and produces. Characteristics of SDF Compile time analyzable. Memory/schedule/speed Static dataflow port fire { … get(); } Tokens send(); Schedule: ABBBC A C D B 1 3 3
24
Codesign Finite State Machine (CFSM)
Network of concurrent executing actors Passive actors Synchronous locally Asynchronous globally An “event” causes the evaluation (firing) of a FSM. Characteristics of CFSM Compile time analyzable. Reactive systems FSM port Token A C D B Timed Event
25
Finite State Machine (FSM)
OFF WAIT ALARM KEY=0N => START KEY=OFF or BELT=ON => ALARM=OFF END=5 => ALARM=ON END=10 or BELT=ON or KEY=OFF => Port_KEY Port_END Port_START Port_ALARM Port_BELT FSM may only have one state active at the time FSM has only a finite number of states. More efficient way to describe sequential control. Formal semantics which allows for verifying various properties like safety, liveness, and fairness.
26
Model of Architecture A Model of architecture is a formal representation of the operational semantics of networks of functional blocks describing architectures. A C D B A,B,C and D are now hardware resources like CPUs, busses, Memory, and dedicated coprocessors. The presents of a Token signals that something needs to be done by an Actor. The actor can reaction on this present by itself (this is an active actor), or being activated by a scheduler to react to the token (it is an passive actor).. Model of Architecture is similar to Model of Computation, but the focus is on the architecture instead of on the applications.
27
Programmable Communication Network
Examples CPU Bus Memory Control Dominated Tasks Sequential Low Bus Control/ Data Tasks Sequential Centralized computation Mutual Exclusive [1] CPU PE Memory Complexity Programmable Communication Network Data Dominated Tasks Parallel / DMA Data flow Distributed computation CPU PE1 PE2 PE3 Memory High Less mature then MoC [1] Wolf, Wayne. A Decade of Hardware/Software Codesign. IEEE Computer, Volume 36, April 2003, pages 38-43
28
Conclusion: Matching Models
Data Type Architecture Model of Architecture Application Model of Computation Natual Fit When the MoC and MoA match, a simple mapping results
29
Putting It Together Example 1.
Platform: microprocessor “von-Neumann architecture” Machine Description Applications Applications Micro Processor Application Compiler SPECInt Benchmarks GCC Pentium/Arm MIPS/Alpha Architecture Instances Performance Numbers
30
Putting It Together Example 1.
Micro Processor Picture in Picture for i=1:1:10 for j=1:1:10 A(i,j) =FIR(); end for i=1:1:10, for j=1:1:10, A(i,j) =SRC( A(i,j) ); Memory Compiler (address) Program Counter Instruction Decoder ALU Simulator Model of Architecture: Sequential (Program Counter) one item over the bus at the time. Shared Memory Model of Computation: Sequential Shared Memory Natural FIT Performance Numbers
31
Putting It Together Example 2.
Applications Applications Architecture Instance Applications Mapping %parameter N 8 16; %parameter K ; for k = 1:1:K, for j = 1:1:N, [ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) ); for i = j+1:1:N, [ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t ); end Matlab Code (QR Algorithm) Performance Analysis Performance Numbers Model of Architecture: Task Level Parallelism Heterogeneity Distributed Memory Model of Computation: Imperative Sequential Execution Global Memory NO Natural FIT
32
Putting It Together Example 2.
Applications Applications Architecture Instance Applications Mapping Performance Analysis P1 P2 S1 Source P3 P4 Sink Performance Numbers Model of Architecture: Task Level Parallelism Heterogeneity Distributed Memory Model of Computation: Process Networks Distributed Memory Distributed Control Natural FIT
33
Matlab Code (QR Algorithm)
Our Research Focus…. Application Process Network Matlab Code (QR Algorithm) P1 P2 S1 Source P3 P4 Sink %parameter N 8 16; %parameter K ; for k = 1:1:K, for j = 1:1:N, [ r(j,j), x(k,j), t ]=Vectorize( r(j,j), x(k,j) ); for i = j+1:1:N, [ r(j,i), x(k,i), t]=Rotate( r(j,i), x(k,i), t ); end Programming Compaan Programmable Interconnect (NoC) IPcore RPU Memory CPU Micro Processor ... Laura/ESPAM
34
Other Examples Other example of the “Natual Fit” concept
VSP architecture, CSDF Tangam, CSP Polis system, CFSM
35
Conclusions We will get billion transistor ICs
We already know how to build them, but not how to program them Current compilers do not take into account the notion of “natural fit” in terms of models of computation and models of architecture New compiler research is needed that takes into account different models of computation
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.