Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Systems and Software Ed F. Deprettere, Todor Stefanov, Hristo Nikolov {edd, stefanov, Leiden Embedded Research Center Spring.

Similar presentations


Presentation on theme: "Embedded Systems and Software Ed F. Deprettere, Todor Stefanov, Hristo Nikolov {edd, stefanov, Leiden Embedded Research Center Spring."— Presentation transcript:

1 Embedded Systems and Software Ed F. Deprettere, Todor Stefanov, Hristo Nikolov {edd, stefanov, nikolov}@liacs.nl Leiden Embedded Research Center Spring 2010; http://www.liacs.nl/~cserc/EMBSYST/ESSOFIA2010

2 Part II Process Networks More general than dataflow graphs are process networks. Communicating Sequential Processes (CSP) Kahn Process Networks (KPN) Dataflow Process Networks (DPN) Polyhedral process Networks (PPN)

3 What is the difference CSP : typical control-type applications, not necessarily determinate. Processes communicate by means of rendez-vous KPN : processes are fuctional when seen as maps from streams to streams. Are determinate. DPN : processes are functional maps from tokens to tokens PPN : special case of DPN (see later)

4 4/13/201504ESSOFIA Usage of KPNs The KPN model of computation is used to specify applications in a concurrent language. Processes are specified in a host language (C, C++, Java). The communication between processes is specified in a co-ordination language: blocking read. KPN is a convenient model for streaming data applications: audio, and video, multimedia in general. Processes operate on infinite streams of date, one quantum of data at a time, i.e., the streams need not be available as a whole.

5 4/13/201504ESSOFIA Dataflow and Kahn Process Networks Recall: Actors in Dataflow Graphs are functional. Dataflow Graphs that operate on (unbounded) streams are called Dataflow Process Networks. In Dataflow Process Networks, the processes are repetitively firing functional actors that are guided by firing rules. They are globally scheduled. In Kahn Process Networks, the processes are threads. There are no firing rules, and there is no global schedule.

6 4/13/201504ESSOFIA P2 P1 process Unbounded FIFO Process P1 (‘producer’) ProcessP2 (‘consumer’) While (1){ Read(C1, token); if (token != Token) { Write(C2, Execute(token)); { else{ Write(C3, token); } C1 C3 While(1){ Read(C2, token); Write(C4, Execute(token)); } C4 Characteristic operation triplet is {Read, Execute, Write}. Execute refers to some abstract computational operator; Communication is point-to-point. KPN: an example

7 4/13/201504ESSOFIA Stream Based Function Model Private mem A-gen {f} controller Channels channels store load execute get put select State. S = C x D, C U D = 0 Controller transition function. ω: C x D →C, ω(c, d) = c’ Binding function. μ: C → {f}, μ(c) = f. Function repertoire {f} Each f binds to its own unique subset of input and output channels

8 4/13/201504ESSOFIA Mapping An application modeled in terms of a KPN is to be transformed (mapped or deployed) to a parallel multi-processor architecture. PU Shared Memory Bus

9 4/13/201504ESSOFIA Part II: applying it all Overview FPGA Applications – / / platform KPN application model Sequential Process platform Communication Structure Mem PE... PE Mem Component

10 4/13/201504ESSOFIA Converting C to KPN Model Most applications are (still) specified as imperative sequential programs in C, C++, or other host languages. In some cases, they can be automatically converted to input-output equivalent Kahn or Dataflow Process Networks. Process Networks are better suited for mapping on multi-processor execution platflorms.

11 4/13/201504ESSOFIA Translating and Mapping Application Programmable Interconnect (NoC) IPcore RPU Memory CPU MicroProcessor Memory... Programming for j = 1:1:N, [x(j)] = Source1 ( ); end for i = 1:1:K, [y(i)] = Source2 ( ); end for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F ( y(i), x(j) ); end for i = 1:1:K, [Out(i)] = Sink ( y( I ) ); end Sequential Application Specification EASY to specify DIFFICULT to map Translator Mapper EASY to map Parallel Application Specification DIFFICULT to specify F sourse1 source2 sink

12 4/13/201504ESSOFIA Affine Nested Loop Programs From now on: given sequential programs are static affine nested loop Programs (for simplicity, some dynamic behavior als possible). nested loops: statements are surrounded by one or more loops → for k = k lower bound (parameters) : stride : k upper bound (parameters) for l = l lower bound (k, parameters) : stride : l upper bound (k, parameters) static: no data dependent conditions affine: loop bounds, conditions, and variable index functions (see next page) are affine functions of the iterators and parameters f(x, y, z) is affine if of the form ax + by +cz +d linear if of the form ax + by + cz

13 4/13/201504ESSOFIA Affine Nested Loop Programs (2) Loop bound: l = k+1 : 1 : N → lower bound is k+1 or l – k – 1 ≥ 0 upper bound is N or N – l ≥ 0 Condition: if l – k ≤ N → if k – l + N ≥ 0 Variable indexing function: x(f(k,l)) is variable with name x and indexing function f(k,l) → f(k,l) affine (ak + bl + c) Extensions to non-static conditions do exist.

14 4/13/201504ESSOFIA Extensions 1.Affine nested loop programs, except for the fact that conditions may be data dependent and of any form. E.g., if f(x) ≥ y. 2. Affine nested loop programs with non-static parameters. Values of parameters may change (possible from internally) during execution. E.g., [ v(i), w(j), N] = f(v(i), w(j), M); for i = 1 : 1 : N, for j = 1: 1 : M,

15 4/13/201504ESSOFIA Affine Nested Loop Programs (3) Structure of affine nested loop program. parameter range: % parameter N 20 100 initialization of input data, called sources [x(n)] = Read_SourceX(); loops, conditions, and functions for i = 1 : 1 : N, if i-2 ≥ 0. [y(i,j), x(i,j)] = f(y(i,j), x(i,j); collecting output data, called sinks [sink(y(i,j)] = Write(y(i,j));

16 4/13/201504ESSOFIA Affine NLP – Example (2) Main for n = 1 : 1 : N+M-1, if n < M, for m = 1 : 1 : n, [ y(n) ] = y(n) + h(m).x(n-m+1); end if n > N, for m = n – (N-1) : 1 : M, [ y(n) ] = y(n) + h(m).x(n-m+1); end for m = 1 : 1 : M, [ y(n) ] = y(n) + h(m).x(n-m+1); end yy h h x x y = y + h.x n = 1 : 1 : N+M-1, m = max(1, n-(N-1)) :1 : min (n,M),

17 4/13/201504ESSOFIA From ANLP to KPN Converting ANLPs to input/output equivalent KPNs provides (equivalent) concurrent processing specifications that facilitate mapping onto parallel architectures Because ANLPs are static, the corresponding KPNs are also static. They are in some sense similar to Cyclo-Static dataflow process networks. Global schedules can be derived, and sizes of buffers can be determined, at least an upper bound for them.

18 4/13/201504ESSOFIA From ANLP to PN (2) Requires three steps conversion to single assignment code (dependency analysis) variables in an ANLP may be assigned more than one value: e.g., x(i+j) may have different values for all i+j = c. In a SAC, each variable gets assigned only one value: e.g., x 1 (i+1, j-1). an intermediate compact dependence graph representation of the SAC construction of the PN from the intermediate format

19 4/13/201504ESSOFIA Steps involved: overview %parameter N 8 16; %parameter K 100 1000; for k = 1:1:K, for j = 1:1:N, [r(j,j), x(k,j), t ]=F( r(j,j), x(k,j) ); for i = j+1:1:N, [r(j,i), x(k,i), t]=G( r(j,i), x(k,i), t ); end Matlab Program (or C, C++, Java) Matlab Application Process Network Kahn Process Network DgParser PRDG Polyhedral Reduced Dependence Graph (PRDG) MatParser Data Dependency Analysis Panda Linearization outputR F initialR inputSamples G SAC Single Assignment Code

20 4/13/201504ESSOFIA Data Dependency Analysis j 12345 N=6 1 2 4 3 5 for i= 1 : 1 : N, for j= 1 : 1 : N, [ a(i+j) ] = f( a(i+j) ); end The for loops define a rectangular iteration domain. Each dot is an invocation of f(). i i+j=6 a(i,j)→ a(i-1,j+1) dependency Consumer reads from Producer Lexicographic schedule

21 4/13/201504ESSOFIA Data Dependency Analysis (2) i = 1 : 1 : N, j = 1 : 1 : M, | [ x(g(I))]= F 1 (); | [ ] = F 2 (x(f(I)); | end x(h(I)) variable with name x and indexing function h(I). Example: h(I) = [1 1]. i =(i+j) j consumer F 2 is dependent on producer F 1 iff - in the domain {i,j | 1 ≤ i ≤ N ^ 1 ≤ j ≤ M}, (a) g(I 1 ) = f(I 2 ) (b) I 1 < l I 2 (< l means lexicographic preceding = prod. before cons.) (c) I 1 is lexicographic largest iteration satisfying (a) and (b) Observe that [1 1] has a null space I = i j Is iteration vector (iterators i and j)

22 4/13/201504ESSOFIA Data Dependency Analysis (3) consumer F 2 is dependent on producer F 1 iff - in the domain (a) g(I 1 ) = f(I 2 ) (b) I 1 < l I 2 (< l means lexicographic preceding) (c) I 1 is lexicographic largest iteration satisfying (a) and (b) Dependency: d = I 1 – I 2 (consumer takes from producer) j i I1I1 I2I2 d Equations look like (integer) linear program problem, except for (b) which is not an affine expression. This problem can be overcome:

23 4/13/201504ESSOFIA Data Dependency Analysis (4) consumer F 2 is dependent on producer F 1 iff – in the domain (a) g(I 1 ) = f(I 2 ) (b) I 1 < l I 2 (< l means lexicographic preceding) (c) I 1 is lexicographic largest iteration satisfying (a) and (b) I 1 < l I 2 is either i 1 < i 2 or i 1 = i 2 and j 1 < j 2. This gives two sets of linear equations instead of one non-linear set. Of course, we have to add the range of the parameters, e.g., 30 ≤ N ≤ 100, M ≤ N.

24 4/13/201504ESSOFIA Single Assignment Code % parameter N 10 20; % parameter M 10 20; for i = 1 : 1 : N, for j = 1 : 1 : M, for j = 1 : 1 : M, [ a(i+j)] = f (a(i+j) )); if i -2 ≥ 0, end if j ≤ M - 1, end [in 0 ] = ipd (a 1 (i -1, j +1)); else [in 0 ] = ipd (a (i + j)); end else [in 0 ] = ipd (a (i + j)); end [out 0 ] = f (in 0 ); [a 1 (i,j)] = opd (out 0 ); end j a(4) i a 1 (1,3) a 1 (2,2) a 1 (3,1) i≥ 2 and j ≤ M-1 i ≥ 2 and j = M i = 1 ipd input port domain opd output port domain → identity functions

25 4/13/201504ESSOFIA Polyhedron Hyper-plane and half-space x a x b Hyper-plane Half-space vector.-l integral an is d and matrix,n xl integralan is C vector,-k integralan is b matrix,nk x integralan is where }|{ spaces-half closedmanyfinitely ofset a of intersection theis polyhedronA A dCxbAxQxP P n 

26 4/13/201504ESSOFIA Polytopes Informally: a multidimensional volume with flat faces (multidimensional extension of polygon) Formally: bounded N-dimensional figure whose faces are hyperplanes Example: k = 1 : 1 : K, j = 1 : 1 : N, i = j : 1 : N, 1 0 0 0 1 0 0 -1 1 -1 0 0 0 -1 0 0 0 -1 kjikji ≥ 1 0 -K -N k j i N N (1,1,1) → we only consider convex polytopes f(x) is convex if f(λx1 + (1-λ)x 2 ) ≤ λf(x 1 ) + (1-λ)f(x 2 ), x 1 and x 2 in domain of f, and λ є [0,1]. x1(λ=0) x2(λ=0) f(x1) f(x2) f(λx1+(1-λ)x2)

27 4/13/201504ESSOFIA Polytopes(2) Example: k = 1 : 1 : K, j = 1 : 1 : N, i = j : 1 : N, 1 0 0 0 1 0 0 -1 1 -1 0 0 0 -1 0 0 0 -1 kjikji ≥ 1 0 -K -N k j i N N (1,1,1) More general: P(p) = Ax ≥ Bp + d where x is rational. The points of interest are still integral points in the polytope, i.e., P(p) ∩ Is of the form P(p) = AI ≥ Bp + d with A, B, and d integral and p the parameter vector. Each row is a half space : a n I ≥ b n p + d n (rows of A are normals to half planes a n I = b n p + d n ) TT TT

28 4/13/201504ESSOFIA Polytopes (3) Wy x rational? one of the faces is the line l : x 1 = -1/2 x 2 + 6 coming in the polytope as 2x 1 ≤ -x 2 + 12 or 2x 1 + x 2 ≤ 12 the point (x1, x2) = (9/2, 3) satisfies this equation (lies on l) but is not an integral point (black dot) The nearest integral point is (4,3) Rational is sufficient because l goes at least to two integral points. x1x1 x2x2 (0,0) l

29 4/13/201504ESSOFIA Example for i= 1 : 1 : N, for j= 1 : 1 : N, [ a(i+j) ] = f( a(i+j) ); end j 12345 N=6 1 2 4 3 5 i 10 x 1 0 1 x 2 -1 0 0 -1 ≥ 1 -N dots are intersection with Z 2 x rational vector

30 4/13/201504ESSOFIA Dependence Graph In SAC, variables get assigned only once a value → ANLP : x(f(I)) → f(I): [ 1 1] i → x(I + j ) j [ 1 1] has a null space μ = 1 → f(I+aμ) = f(I) → SAC : x(f(I)) → x 1 (Φ(I)) → Φ(I) = I or I + d no null space; d is dependency vector Variables x 1 (Φ(I)) propagate from function call to function call example: [ x 1 (i,j) ] = F((x 1 (i-1, j+2)) F x 1 (i-1, j+2) x 1 (i,j) i j Can be visualized graphically → leads to dependence graph

31 4/13/201504ESSOFIA ANLP, SAC, and DG % parameter N 10 20; % parameter M 10 20; for i = 1 : 1 : N, for j = 1 : 1 : M, for j = 1 : 1 : M, [ a(i+j)] = f (a(i+j) )); if i -2 ≥ 0, end if j ≤ M - 1, end [in 0 ] = ipd (a 1 (i -1, j +1)); else [in 0 ] = ipd (a (i + j)); end else [in 0 ] = ipd (a (i + j)); end [out 0 ] = f (in 0 ); [a 1 (i,j)] = opd (out 0 ); end i≥ 2 and j ≤ M-1 j a(4) i a 1 (1,3) a 1 (2,2) a 1 (3,1) i ≥ 2 and j = M i = 1

32 4/13/2015 04ESSOFIA Other example %parameter N 8 16; %parameter K 100 1000; for k = 1:1:K, for j = 1:1:N, [ r(j,j), x(k,j), t ]=F( r(j,j), x(k,j) ); for i = j+1:1:N, [ r(j,i), x(k,i), t]=G( r(j,i), x(k,i), t ); end Matlab Code SAC

33 4/13/201504ESSOFIA Polyhedral Reduced Dependence Graph CA BD E r1r1 r ← k=1 x x t1t1 x1x1 r 1 (K,j,j)) r 1 (K,j,i) → t1→ t1 → x 1 → r 1 A function call with its surrounding loops forms a polytope and becomes a Node (in fact a node domain) in the reduced DG.

34 4/13/201504ESSOFIA PRDG (2) CA BD E r1r1 r ← k=1 x x t1t1 x1x1 r 1 (K,j,j)) r 1 (K,j,i) → t1→ t1 → x 1 → r 1 The Nodes in the PRDG have Ports (input and output) which are also polyhedral domains. Example: the input Port of (yellow) Node C for variable r 1 corresponds to all r 1 input ports of the atomic yellow function calls in the SAC or DG. Port domains are subsets of Node domain

35 4/13/201504ESSOFIA PRDG (3) The arrows are called Channels. A Channel is directed from an output Port (domain) of a Node (domain) to an input Port (domain) of another or the same Node (domain). There is an affine mapping function between points in the input Port to points in the output Port which is the dependency function from the SAC or DG (opposite of token flow direction). Mapping function + input Port domain defines output Port domain. out: x 1 (k,j,i) in: x 1 (k,j-1,i) mapping function: (k,j,i-1) = (k,j,i) + (0,-1, 0) (consumer (k,j,i) takes from producer (k,j-1,i)) Example: D x1x1 in out

36 4/13/201504ESSOFIA PRDG (4) The SAC is in output normal form : output variables are always of the form v(I) where I is the iteration vector. The SAC does not tell where they are sent. This follows from input Port domain and mapping function. Example: input Port domain { j = 2:1:N-1 ^ i = j+1:1:N} mapping function (-1, 0) output Port domain {j = 1:1:N-2 ^ I = j+1:1:N} Polytope “C” Polytope “D” x x1x1 r1r1 t1t1 out: x 1 (k,j,i) in: x 1 (k,j-1,i) mapping function: (k,j-1,i) = (k,j,i) + (0,-1, 0) (consumer (k,j,i) takes from producer (k,j-1,i)) Example: D x1x1 in out

37 4/13/201504ESSOFIA Producer Consumer Pair Producer with Node N p - domain 1≤ j 2 ≤ N ^ j 2 ≤ j 1 ≤ N – and Node function [ x 1 (j 2, j 1 ), r 1 (j 2, j 1) ] = f( ); Consumer with Node N c - domain 1 ≤ i 1 ≤ N ^ 1 ≤ i 2 ≤ i 1 – and Node function [ ] = g(x 1 (i 1, i 2 ), r 1 (i 1, i 2 )); With each input (output) variable corresponds an input (output) Port and Port domain. Shown here are output Port domain (left gray triangle) for variable x 1 and input Port domain (right gray triangle) for variable x 1 NpNp NcNc X 1 channel write read port N P N C j2 1 2 3 4 5 N=6 j1i1 i2 Mx1Mx1 Schedule: for j2 = 1 : 1 : N for i1 = 1 : 1 : N for j1 = j2 : 1 : N for i2 = 1 : 1 : i1

38 4/13/201504ESSOFIA Producer Consumer Pair (2) j 2 = M x 1 ( i 1 ) is the (dependency) affine mapping function, j 1 i 2 N P N C j2 1 2 3 4 5 N=6 j1i1 i2 Mx1Mx1 Schedule: for j2 = 1 : 1 : N for i1 = 1 : 1 : N for j1 = j2 : 1 : N for i2 = 1 : 1 : i1 Here: j 2 = 0 1 i 1 + -1 j 1 1 0 i 2 0 right (4,4) → left (3,4) Consumer takes from producer ‘function’ But – of course – producer tokens are sent to consumer NpNp NcNc X 1 channel write read port

39 4/13/201504ESSOFIA Linearization i=1:1:N j=i:1:N Tokens are sent from Producer to Consumer over a linear (FIFO) Channel buffer. However, the corresponding produced and consumed variables are multidimensional: [x1(j2, j1), r1(j1, j2)] = g(x1(i1-1, i2), r1(i1, i2)); This is because the P and C schedules are loop nests Schedule: for j2 = 1 : 1 : N for i1 = 1 : 1 : N for j1 = j2 : 1 : N for i2 = 1 : 1 : i1 a schedule is a linear ordering: { (j 2, j 1) } → {k} Nevertheless I shall show that we can get these maps by means of polynomials:

40 4/13/201504ESSOFIA Linearization (2) For the given domain {(i,j) | 1 ≤ i ≤ N ^ i ≤ j ≤ N}, and the given lexicographic order: for i = 1 : 1 : N, for j = i : 1 : N, there exists a (pseudo) polynomial E(i,j) such that, if (i’,j’ ) is the lexicographic k-th vector, then E(i’,j’ ) = k. Pseudo polynomial to be defined on next slide. Because the polynomial E(i,j) represents a ranking of vectors, we call it the ranking polynomial. Underlying theory is polynomial counting of integral points in polytopes. i=1:1:N j=i:1:N

41 4/13/201504ESSOFIA Polynomial counting It is a polynomial or a pseudo-polynomial and called Ehrhart polynomial E(p). Example: for p = 2q: E(p) = p/2+1; for p = 2q + 1: E(p) = p/2 + 3/2 The function c(p) from Z to Q : c(p) = c (p mod l) is called a periodic coefficient with period l. The l possible values are made explicit by representing c(p) as an indexed l-array: [ c 0, c 1, …, c l-1 ] p → if (p mod l) = k, then c k (p). of thecalled is (p)in points ofnumber then thein polytope edparameteriz a is (p) If PenumeratorZP QP d d  E(p) = ½*p + [1 3/2] p is a pseudo polynomial ( l =2) 0 1 2 3 4 p = 4 p=5

42 4/13/201504ESSOFIA Theorem The number of integer point in a parameterized polytope is given as a pseudo-polynomial iff the polytope is an affine-vertex polytope. P(p) with vertex set {v i (p)} is an affine-vertex polytope when V i (p) = M i p + m i With M i a rational matrix and m i a rational vector, and all v i (p) valid for the whole parameter range. If a polytope is not an affine-vertex polytope, then it has to be partitioned into a number of affine-vertex polytopes and a pseudo-polynomial can be derived for each of its affine-vertex polytopes. ax≥b v

43 4/13/201504ESSOFIA Theorem (2) The enumerator E(p) of P(p) is a pseudo-polynomial of degree d and pseudo-period equal to the denominator of P(p). The dimension of the pseudo-coefficients is equal to the dimension of p. The denominator of P(p) is the least common multiple of the denominators of its vertices. The denominator of a vertex V(p) is the least common multiple of the denominators of its co-ordinates. P(p) = {x є| Ax ≥ Bp + d}Letbe an affine-vertex polytope.

44 4/13/201504ESSOFIA Polynomial counting (2) P(p,q) = {(x 1, x 2 )| 0≤ x 2 ≤ 1/2q ^ 2x 2 ≤ x 1 ≤ 2x 2 + 1/2p}, p,q ≥ 0 Each 24 unknowns: set up set of 24 equations with 24 particular values of E(p,q) for particular values of p and q. For example E(p + Δp, q + Δq) with (p,q) = (0,0), (2,0), (4,0), (0,2), (0,4), and (2,2), and Δp,Δqє{0,1}. (0,0) (1/2p,0) (q,1/2q) (q+1/2p, 1/2q) x1 x2

45 4/13/201504ESSOFIA General polytope P(p) = {(x 1, x 2 ) є Q | 0 ≤ x 2 ≤ 4 ^ x 2 ≤ x 1 ≤ x 2 + 9 ^ x 1 ≤ p ^ p ≤ 40} 2 4 9 13 x1x1 x2x2 v1v1 v2v2 v3v3 v4v4 v5v5 v6v6 v7v7 v8v8 x 1 ≤ p Four affine-vertex polytopes: 0 ≤ p ≤ 4 {v 1, v 2, v 6 } = {(0,0), (p,0), (p,p)} 4 ≤ p ≤ 9 {v 1, v 2, v 7, v 8 } = {(0,0), (p,0), (4,4), (p,4)} 9 ≤ p ≤ 13 {v 1, v 3, v 4, v 7, v 8 } = {(0,0), (9,0), (p,p-9), (4,4), (p,4)} 13 ≤ p ≤ 40 {v 1, v 3, v 5, v 7 } = {(0,0), (9,0), (13,4), (4,4)} Four polynomials: 0≤p ≤4 4≤p ≤9 9≤p ≤13 13≤p ≤40

46 What to be counted? How many times has a function been fired before it is invocated in point (i,j) in its function domain D(i,j). N P N C j2 1 2 3 4 5 N=6 j1i1 i2 Mx1Mx1 Schedule: for j2 = 1 : 1 : N for i1 = 1 : 1 : N for j1 = j2 : 1 : N for i2 = 1 : 1 : i1

47 What is to be counted? If the m- th invocation of function f c in the consumer domain has to consume a token produced by the n-th invocation of function g p in the producer domain, what is n given m. N P N C j2 1 2 3 4 5 N=6 j1i1 i2 Mx1Mx1 Schedule: for j2 = 1 : 1 : N for i1 = 1 : 1 : N for j1 = j2 : 1 : N for i2 = 1 : 1 : i1 Recall that destination (address) is not given:

48 What is to be counted? Recall that producer-consumer communication is through FIFO buffers. 5 4 3 2 1 ?

49 4/13/201504ESSOFIA Ranking polynomials Ranking polynomial is polynomial counting lexicographic ordered points (j2,j1) resp. (i1,i2) Example: (j2,j1) = (3,4) → 13 (i1,i2) = (4,3) → 9 j2 N P N C 1 2 3 4 5 N=6 j1i1 i2 Schedule: for j2 = 1 : 1 : N for i1 = 1 : 1 : N for j1 = j2 : 1 : N for i2 = 1 : 1 : i1 Producer-consumer pair NpNp NcNc X 1 channel write read port

50 4/13/201504ESSOFIA Ranking polynomials (2) Take Producer # points in shaded triangle is ½ j 2 (j 2 + 1) # points in shaded rectangle is (j 2 – 1)(N – j 2 ) These are all lexicographic less than # remaining points up to and including(j 1 – j 2 ) j2 N P N C 1 2 3 4 5 N=6 j1i1 i2 Schedule: for j2 = 1 : 1 : N for i1 = 1 : 1 : N for j1 = j2 : 1 : N for i2 = 1 : 1 i1 Producer-consumer pair

51 4/13/201504ESSOFIA production and consumption polynomials The consumer reads as dictated by the consumer-producer mapping function (j2,j1) = M(i1,i2). Suppose that M is the skew identity, then the consumption polynomial c(i1,i2) = p(j2=i2,j1=i1) = Producer writes in order to the producer-consumer channel. Therefore, the production (or write) polynomial is the same as the ranking polynomial p(j2,j1) = rank(j2,j1) Recall: consumer reads from channel in same order as producer writes to channel because channel is FIFO buffer: reading order may be different from consuming order. N P N C 1 2 3 4 5 N=6 j1 j2 i1 i2 Mr1Mr1 Producer-consumer pair M: j 2 = 0 1 i 1 j 1 1 0 i 2 -1/2j 2 *j 2 +j 2 (1/2+N)+j 1 -N 1/2i 1 *i 1 + 1/2i 1 +i 2 -1/2i 2 *i 2 +i 2 (1/2+N)+i 1 -N

52 4/13/201504ESSOFIA production and consumption polynomials(2) N P N C 1 2 3 4 5 N=6 j1 j2 i1 i2 Mr1Mr1 Producer-consumer pair Recall: consumer reads from channel in same order as producer writes to channel because channel is FIFO buffer: reading order may be different from consuming order. This will be so when the consumer ranking polynomial is not equal to the consuming polynomial (k-th function call does not consume k-th sent token), as is the case here: c(i 1,i 2 ) =

53 4/13/201504ESSOFIA Consuming in-order/out-of-order If consumer ranking polynomial is equal to c(i1,i2), then consuming is in order, that is, a token read from the channel is immediately consumed. Otherwise, consuming is out of order, that is, a token read from the channel is not necessary immediately consumed, hence must be stored in private memory until it is needed for consumption.

54 4/13/201504ESSOFIA Consuming in-order/out-of-order(2) NpNp NcNc 8 7 6 5 4 3 2 1 c(1,1) = 1 → consume = 1 st token = read 1 st token c(2,1) = 2 → consume = 2 nd token = read 2 nd token c(2,2) = 7 → read and store tokens 3 -6, and read and consume token 7

55 4/13/201504ESSOFIA Consumer structure Private mem A-gen {f} controller channel store load execute get put select Various types of private memory

56 4/13/201504ESSOFIA Structure of P-C pair (gray areas are port-domains) N P N C 1 2 3 4 5 N=6 j1 j2 i1 i2 Mx1Mx1 I opd I ipd process P(double out wp1) process C(double in rp1) for j2 = 1 to N for i 1 = 1 to N for j1 = j2 to N for i 2 = 1 to i 1 if ( 2 ≤ i 2 ) while ( l < c(i 1,i 2 ) x(l++) = read(rp1); end in = x(c(i 1,i 2 )); end [out] = f(…); … = g(in); if (j2 + 1 ≤ j1 ) write (wp1, out); end end end wp1 is write port x 1, rp1 is read port x 1 P C Fifo buffer Producer Consumer Network N double channel ch1; P(ch 1 ) par C(ch 1 ); wp1 rp1

57 4/13/201504ESSOFIA Structure of P-C pair (2) The two if conditions define the gray (write resp read) subdomains of the Node domains. The while condition models the out-of-order consumption and empty channel blocking mechanism NpNp NcNc 8 7 6 5 4 3 2 1 X-array 1 2 3 4 5 6 7 8 read consume process P(double out wp1) process C(double in rp1) for j2 = 1 to N for i 1 = 1 to N for j1 = j2 to N for i 2 = 1 to i 1 if ( 2 ≤ i 2 ) while ( l <= c(i 1,i 2 ) x(l++) = read(rp1); end in = x(c(i 1,i 2 )); end [out] = f(…); … = g(in); if (j2 + 1 ≤ j1 ) write (wp1, out); end end end 1 2 3 4 5 N=6 j1 j2 i1 i2 Mx1Mx1 I opd I ipd

58 4/13/201504ESSOFIA Summary Number of integer points in affine-vertex polytope - lexically ordered - is a pseudo-polynomial, called Ehrhart polynomial. Three Ehrhart polynomials are important: ranking polynomial: rank(J) if integer points represent atomic functions, atomic function output ports, or atomic function input ports. production polynomial: p(J) equal to output Port ranking polynomial consumption polynomial: c(I) equal to p(J = MI + m) where J = MI+m is the consumer-to-producer affine mapping or dependency function. If output Port ranking function is equal to c(I), then consumption is in order: tokens are consumed in the order they have been produced. Otherwise, the consumption is out of order.

59 4/13/201504ESSOFIA Summary (2) Example N P N C 1 2 3 4 5 N=6 j1 j2 i1 i2 Mx1Mx1 I opd I ipd j 2 outer loop i 1 outer loop left gray area is output Port domain right gray area is input Port domain M: j 2 = 0 1 i 1 + -1 j 1 1 0 i 2 0 leftleft output Port ranking: rank (j 2, j 1 ) input Port ranking: rank (i 1, i 2 ) production polynomial: p (j 2, j 1 ) p(j 2, j 1 ) = rank (j 2, j 1 ) consumption polynomial: c(i 1, i 2 ) c(i 1, i 2 ) = p(j 2 = i 2 -1, j 1 = i 1 ) Consumption is out of order

60 4/13/201504ESSOFIA Multiplicity If p consecutive tokens sent by the producer have equal value, then this token is sent only once and said to have multiplicity p. The consumer, then, stores that token in private memory and consumes it p times, after which the storage location is released. There are thus 4 cases: in-order without multiplicity (IOM -) in-order with multiplicity (IOM +) out-of-order without multiplicity (OOM -) out-of-order with multiplicity (OOM +)

61 4/13/201504ESSOFIA IOM -, IOM +, OOM -, and OOM + Examples j2j2 i j i i j 11 1 1 i = 1 : 1 : 4 j = 1 : 1 : 4 i = 1: 1 : 4 j 1 = 1 : 1 : 4, j 2 = j 1 : 1 : 4 i i j1j1 j1j1 j2j2 1 1 IOM - IOM + OOM - OOM +

62 4/13/201504ESSOFIA Polynomial evaluation Is linear in i → c(i,j) = c(0,j) + i C(0,j) is not linear: how to avoid multiplications? Answer: use the method of differences First difference is of degree one less than the degree of the polynomial Second difference is of degree one less than degree of first difference Eventually, n-th difference is constant. c(i,j) = -1/2j*j + j*(1/2+N) +i - N

63 4/13/201504ESSOFIA Polynomial evaluation (2) Polynomials can be evaluated inexpensive by using the method of differences. 1- j),0( - )1j,0(,0( N j- c(0, - 1)jc(0, j)0,( Define i j)c(0, N- i 1/2)jN(j 1/2- j) c(i, 112 1 2     N = 6 Δ 1 (0,j+1) = Δ 1 (0,j) + Δ 2 (0,j) c(0,j+1) = c(0,j) + Δ 1 (0,j)

64 4/13/201504ESSOFIA register adder register adder load N if j=1 Load 0 if j=1 adder i Polynomial evaluation (3) → additions only c(0,j)

65 4/13/201504ESSOFIA Transformations Programmable Interconnect (NoC) IPcore RPU Memory CPU MicroProcessor Memory... Alternative Application Instances Generate Map Explore for j = 1:1:N, [x(j)] = Source1 ( ); end for i = 1:1:K, [y(i)] = Source2 ( ); end for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F ( y(i), x(j) ); end for i = 1:1:K, [Out(i)] = Sink ( y(i) ); end Alternatives ?

66 Alternatives Apply transformations on graphs or source code. To: increase parallelism reduce parallelism increase throughput reduce power consumption

67 Examples of transformations Unrolling or unfolding: data parallelism single instruction, multiple data Skewing: retiming postpone operation to next period Merging: sequentializing

68 4/13/201504ESSOFIA Unfolding/unrolling %parameter N 100 1000; %parameter K 8 48; for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F (y(i), x(j)); end U = [ u 1, u 2 ] → unroll outer loop with factor u 1, inner loop with factor u 2 Example: u 1 = 2, u 2 = 1 for j = 1 : 1 : N, if mod(j, 2) = 0, for i = 1 : 1 : K, …………. else % if mod(j, 2) = 1, for i = 1 : 1 : K, …………… end

69 4/13/201504ESSOFIA Unrolling/Unfolding (2) %parameter N 100 1000; %parameter K 8 48; for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F (y(i), x(j)); end F F F F F F F F F F F F x(1) x(2) x(3) x(4) y(1) y(2) y(3) j i F F F F F F F F F F F F x(1) x(2) x(3) x(4) y(1) y(2) y(3) j i Compaan U = [ N, K ] Difficult to derive for j = 1:1:N, if mod( j, 2 ) = 1, for i = 1:1:K, [y(i), x(j)] = F (y(i), x(j)); end if mod( j, 2 ) = 0, for i = 1:1:K, [y(i), x(j)] = F (y(i), x(j)); end MatTransform U = [ 2, 1]

70 4/13/201504ESSOFIA Retiming/skewing %parameter N 100 1000; %parameter K 8 48; for i = 1:1:N, for j = 1:1:K, [y(i), x(j)] = F (y(i), x(j)); end 0101 → 0101 1010 → 1 1 i’ = 1 1 i j’ 0 1 j j’ N+K K 2 i’ i j N K 1

71 Skewing for j = 2:1:N+K, for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i)); end %parameter N 100 1000; %parameter K 8 48; For j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F (y(i), x(j)); end j = 2 : 1 : N+K, if j < K, i = 1 : 1 : j, else if j < N, i=j-(N-1) :1 : K, else i=1 : 1 : K, i N+K K 2 j

72 4/13/201504ESSOFIA Skewing + Unfolding Skewing matrix                   10 11 2221 1211 mm mm M F F F F F F F F F F F F x(1) x(2) x(3) x(4) y(1) y(2) y(3) j i for j = 2:1:N+K, if mod( j, 2 ) = 1, for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i)); end if mod( j, 2 ) = 0, for i = max(1, j-N):1:min(j-1, K), [y(i), x(j-i)] = F (y(i), x(j-i)); end F F F F F F F F F F F F x(1) x(2) x(3) x(4) y(1) y(2) y(3) j i F F F F F F F F F F F F x(1) x(2) x(3) x(4) y(1) y(2) y(3) j i Unfolding vector U = [ u 1, u 2 ] = [2, 1] Compaan Difficult to derive %parameter N 100 1000; %parameter K 8 48; for j = 1:1:N, for i = 1:1:K, [y(i), x(j)] = F (y(i), x(j)); end

73 4/13/201504ESSOFIA Typical Architectures (1) program/data memory p1 communication controller1 communication memory program/data memory p-n communication Controller-n communication memory progr./data bus data/control (crossbar) communication component p-x can be ISA micro-processor or dedicated Read/Execute/Write module

74 4/13/201504ESSOFIA Typical Architectures (2) program/data memory p1 communication controller1 communication memory program/data memory p-n communication Controller-n communication memory progr./data bus communication component cc fifo IP1 OP1 OP2 IP2 readwriteexecute control

75 4/13/201504ESSOFIA Typical Architectures (3) Also Hierarchical Memory program/data memory p1 communication controller1 communication memory program/data memory p-n communication Controller-n communication memory progr./data bus data/control (crossbar) communication component communication memory communication memory level-2 memory level-2 data memory controller Large FIFO

76 4/13/201504ESSOFIA Daedalus Library of IP cores Platform in XML C/C++ code for processors IP cores in VHDL Mapping in XML Platform topology description Xilinx Platform Studio (XPS) Tool VirtexII-Pro FPGA Application Auxiliary files Program code Processor 1 Program code Processor 2 Program code Processor 3 ESPAM Sesame KPNgen KPN In XML High-Level Performance Analysis and Exploration Simulated Performance Numbers (1 hour) 0 1 2 3 4 1 2 3 4 0 50 100 150 200 250 300 350 400 450 500 Cycle number Millions Nr. of MicroBlazes Nr. of Processors Real Performance Numbers (1 day) 0 1 2 3 4 4 3 2 1 0 100 200 300 400 500 600 700 800 Million of cycles Nr. of MicroBlazes Nr. of Processors Performance Model Calibration/ Validation


Download ppt "Embedded Systems and Software Ed F. Deprettere, Todor Stefanov, Hristo Nikolov {edd, stefanov, Leiden Embedded Research Center Spring."

Similar presentations


Ads by Google