Presentation is loading. Please wait.

Presentation is loading. Please wait.

DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik

Similar presentations


Presentation on theme: "DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik"— Presentation transcript:

1 DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik
DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman

2 Unfolding Repetition Unfolding creates a program with more than one iteration, J=unfolding factor Unfolding is a structured way to achieve parallel processing Applications Reveal hidden concurrencies so that the program can be scheduled to a smaller iteration period Parallel processing Bit-serial and Digit-serial Unfolding = Loop unrolling assembly programming compiler theory

3 General Algorithm for unfolding
Repetition Step 1. For each node U in the original DFG, draw J nodes U0 , U1 , U2 ,…, UJ-1 U0 U3 U2 U1 V3 V2 V1 V0 J=4 9D i = 0, 1, …, J-1 37D U V 9D 10D Step 2. For each edge U  V with w delays in the original DFG, draw the J edges Ui  V(i + w)%J with (i+w)/J delays for i = 0, 1, …, J-1

4 Properties of unfolding
Repetition Unfolding preserves the number of delays in a DFG w/J + (w+1)/J + … + (w + J - 1)/J = w Unfolding preserves precedence constraints J-unfolding of a loop with wl delays in the original DFG a gcd(wl , J) loops in the unfolded DFG. Each loop contains wl/gcd(wl , J) delays and J/ gcd(wl , J) copies of each node. Unfolding a DFG with iteration bound T results in a J-unfolded DFG with iteration bound JT .

5 Wl + u = J( W’l + u/J ) + (u%J)
Unfolding of Switches Repetition The following assumptions are made when unfolding an edge UV containing a switch : The wordlength W is a multiple of the unfolding factor J, i.e. W = W’J. All edges into and out of the switch have no delays. If so, an edge UV can be unfolded as: Write the switching instance as Wl + u = J( W’l + u/J ) + (u%J) Draw an edge from the node Uu%J a Vu%J, which is switched at time instance ( W’l + u/J) . Wl+u U V

6 What about Switches with Delays?
Repetition Unfolding a DFG containing an edge having a switch and a positive number of delays is done by introducing a dummy node. A B C 2D 6l + 1, 5 6l + 0, 2, 3, 4 A B C D 2D 6l + 1, 5 6l + 0, 2, 3, 4 Inserting Dummy node

7 Unfolding (cont.) Chapter 5

8 Example: How to Unfold a Bit-serial Adder
Output X INPUTS D Bit-Serial i a b s cout D B Carry Reset Carry = 0 4 l +0 4 l +1,2,3 Z

9 Example: How to Unfold a Bit-serial Adder
Output X D INPUTS Dummy node B D Carry Reset Carry = 0 4 l +0 4 l +1,2,3 Z

10 Unfold Bit-serial Adder, J=2
Z0 D0 S0 X0 B0 A1 Z1 D1 S1 X1 B1 For each node U in the original DFG, draw J nodes U0 , U1 , U2 ,…, UJ-1

11 Unfold Bit-serial Adder, J=2
Z0 D0 S0 X0 B0 A1 Z1 D1 S1 X1 B1 If edge has w=0 a Ui  Vi with 0 delays For each edge U  V with w delays in the original DFG, draw the J edges Ui  V(i + w)%J with (i+w)/J delays for i = 0, 1, …, J-1

12 Unfold Bit-serial Adder, J=2
Z0 D0 S0 X0 B0 A1 Z1 D1 S1 X1 B1 XaD for i=0 a X0  D1 with 0 delays and XaD for i=1 a X1  D0 with 1 delays D For each edge U  V with w delays in the original DFG, draw the J edges Ui  V(i + w)%J with (i+w)/J delays for i = 0, 1, …, J-1

13 Unfold the Switch, J=2 4l+0 4l+1,2,3 A0 S0 X0 B0 D0 Z0 Z a X
4l+0 g 2(2l+0)+0 D a X 4l+1 g 2(2l+0)+1 4l+2 g 2(2l+1)+0 4l+3 g 2(2l+1)+1 4l+0 4l+1,2,3 Z0 Write the switching instance as Wl + u = J( W’l + u/J ) + (u%J)

14 Unfold the Switch, J=2 4l+0 4l+1,2,3 A0 S0 X0 B0 D0 Z0 Z a X
4l+0 g 2(2l+0)+0 D a X 4l+1 g 2(2l+0)+1 4l+2 g 2(2l+1)+0 4l+3 g 2(2l+1)+1 4l+0 4l+1,2,3 Z0 g X0 at time 2l+0 Z0 D0 g X0 at time 2l+1

15 Unfold the Switch, J=2 4l+0 4l+1,2,3 A0 S0 X0 B0 D0 Z0 Z a X
4l+0 g 2(2l+0)+0 D a X 4l+1 g 2(2l+0)+1 4l+2 g 2(2l+1)+0 4l+3 g 2(2l+1)+1 4l+0 4l+1,2,3 Z0 Z0 g X0 at time 2l+0 D0 g X0 at time 2l+1 D1 g X1 at time 2l+0,1 i.e. always closed

16 Unfold the Switch, J=2 2l+0 2l+1 D Dead Node A0 Z0 D0 S0 X0 B0 A1 Z1
D1 g X1 at time 2l+0,1 i.e. always closed Z0 g X0 at time 2l+0 2l+0 2l+1 D0 g X0 at time 2l+1 D Dead Node

17 Remove Dead and Dummy Nodes
X0 X1 Z0 g X0 at time 2l+0 2l+0 B0 B1 2l+1 D Z0 D1 g X1 at time 2l+0,1 i.e. always closed D0 g X0 at time 2l+1

18 The Digit Serial Adder 2l+0 2l+1 D Carry within iteration
X0 X1 digit2 digit1 B0 B1 a3 a2 a1 a0 2l+0 2l+1 D b3 b2 b1 b0 Z0 s3 s2 s1 s0 Carry within iteration Carry next iteration D=1

19 Fully Parallel Adder, i.e.J=4
LSB MSB A S A S A S A S 1 1 2 2 3 3 X X X X 1 2 3 B D B D B D B D 1 1 2 2 3 3 Z Z Z Z 1 2 3 D For each node U in the original DFG, draw J nodes U0 , U1 , U2 ,…, UJ-1 For each edge U  V with w delays in the original DFG, draw the J edges Ui  V(i + w)%J with (i+w)/J delays for i = 0, 1, …, J-1

20 Unfold the Switch, J=4 4l+0 4l+1,2,3 A0 S0 X0 B0 D0 Z0 Z a X
4l+0 g 4(1l+0)+0 D a X 4l+1 g 4(1l+0)+1 4l+2 g 4(1l+0)+2 4l+3 g 4(1l+0)+3 4l+0 4l+1,2,3 Z0 Write the switching instance as Wl + u = J( W’l + u/J ) + (u%J)

21 Unfold the Switch, J=4 4l+0 4l+1,2,3
A0 S0 X0 B0 D0 Z a X 4l+0 g 4(1l+0)+0 D a X 4l+1 g 4(1l+0)+1 4l+2 g 4(1l+0)+2 4l+3 g 4(1l+0)+3 4l+0 4l+1,2,3 Z0 Only 1 time instance 0, i.e. fully parallel Z0 g X0, D1 g X1, D2 g X2 and D3 g X3

22 Bit-parallel Adder D Only 1 time instance 0, i.e. fully parallel
1 1 2 2 3 3 X X X X 1 2 3 B D B D B D B D 1 1 2 2 3 3 Z Z Z Z 1 2 3 D Only 1 time instance 0, i.e. fully parallel Z0 g X0, D1 g X1, D2 g X2 and D3 g X3

23 Bit-parallel Adder D LSB MSB ”Dead” nodes A S A S A S A S X X X X B D
1 1 2 2 3 3 X X X X 1 2 3 B D B D B D B D 1 1 2 2 3 3 Z Z Z Z 1 2 3 D ”Dead” nodes

24 Remove ”Dead” and Dummy Nodes
1 1 2 2 3 3 X X X X 1 2 3 ”Dead” nodes can be removed Dummy nodes can be removed B D B D B D B D 1 1 2 2 3 3 Z Z Z Z 1 2 3 D

25 Bit-parallel Adder Carry from MSB as overflow or if to be used as a 4-bit module A S A S A S A S 1 1 2 2 3 3 X X X X 1 2 3 B B B B 1 2 3 Z Switch if to be used as a 4-bit module Carry = 0 Carry Ripple Adder a3 a2 a1 a0 b3 b2 b1 b0 s3 s2 s1 s0

26 If Wordlength is not a multiple of J
determine L=lcm{W,J}, lcm = least common multiple replace switching instance Wl+u with L/W instances Ll+u+wW, for w= 0..L/W-1 i.e. the switching periodicity has been changed from W to L perform the unfolding as previously identify the correspondence between original instances and expanded instances

27 Example: Unfold Bit-serial Adder by J=3
Wordlength W=4 not a multiple of the the unfolding factor J=3. Determine L=lcm{W,J}=lcm{4,3}=12 Replace Wl+u with Ll+u+wW for w= 0 to L/W-1 A S X D B D 4 l +0 4 l +1,2,3 Z

28 Example: Unfold Bit-serial Adder by J=3
Replace Wl+u with Ll+u+wW for w= 0 to L/W-1 4l+0 is now equivalent to 12l+0, 12l+4 and 12l+8 and so on… Unfold as for the regular case. A S X D B D 12l+0,4,8 12l+1,2,3,5,6,7, 9,10,11 Z

29 Example: Unfold Bit-serial Adder by J=3
Z0 Z2 D

30 Example: Unfold Bit-serial Adder by J=3
Write the switching instance as Wl + u = J( W’l + u/J ) + (u%J) 12l+0 = 3(4l+0)+0 12l+4 = 3(4l+1)+1 12l+8 = 3(4l+2)+2 Draw an edge from node Uu%J a Vu%J, which is switched at time instance ( W’l + u/J) A S Z0 gX0 X D Z1 gX1 Z2 gX2 B D 12l+0,4,8 Z

31 Example: Unfold Bit-serial Adder by J=3
Z0 Z1 Z2 D

32 Example: Unfold Bit-serial Adder by J=3
Write the switching instance as Wl + u = J( W’l + u/J ) + (u%J) 12l+1 = 3(4l+0)+1 12l+2 = 3(4l+0)+2 12l+3 = 3(4l+1)+0 12l+5 = 3(4l+1)+2 12l+6 = 3(4l+2)+0 12l+7 = 3(4l+2)+1 12l+9 = 3(4l+3)+0 12l+10 = 3(4l+3)+1 12l+11 = 3(4l+3)+2 A S X D D0 gX0 B D D0 gX0 D0 gX0 12l+1,2,3,5,6,7, 9,10,11 Z

33 Example: Unfold Bit-serial Adder by J=3
Z0 Z1 Z2 D

34 Example: Unfold Bit-serial Adder by J=3
Write the switching instance as Wl + u = J( W’l + u/J ) + (u%J) 12l+1 = 3(4l+0)+1 12l+2 = 3(4l+0)+2 12l+3 = 3(4l+1)+0 12l+5 = 3(4l+1)+2 12l+6 = 3(4l+2)+0 12l+7 = 3(4l+2)+1 12l+9 = 3(4l+3)+0 12l+10 = 3(4l+3)+1 12l+11 = 3(4l+3)+2 A S D1 gX1 X D B D D1 gX1 12l+1,2,3,5,6,7, 9,10,11 Z D1 gX1

35 Example: Unfold Bit-serial Adder by J=3
Z0 Z1 Z2 D

36 Example: Unfold Bit-serial Adder by J=3
Write the switching instance as Wl + u = J( W’l + u/J ) + (u%J) 12l+1 = 3(4l+0)+1 12l+2 = 3(4l+0)+2 12l+3 = 3(4l+1)+0 12l+5 = 3(4l+1)+2 12l+6 = 3(4l+2)+0 12l+7 = 3(4l+2)+1 12l+9 = 3(4l+3)+0 12l+10 = 3(4l+3)+1 12l+11 = 3(4l+3)+2 A S D2 gX2 X D D2 gX2 B D 12l+1,2,3,5,6,7, 9,10,11 Z D2 gX2

37 Example: Unfold Bit-serial Adder by J=3
Z0 Z1 Z2 c3 c2 c1 c0 a3 a2 a1 a0 d3 d2 d1 d0 b3 b2 b1 b0 D s’3 s’2 s’1 s’0 s3 s2 s1 s0

38 End of Unfolding

39 Folding Chapter 6

40 What is folding?? Folding is the ”Inverse” of Unfolding
Folding by N (N=folding factor) A Node A A0 A1 AJ-1 Unfolding by J

41 Folding? Used to minimize silicon area (trading area for time)!
A way to systematically determine the control circuits in DSP architectures by folding transformation, where multiple algorithm operations are time-multiplexed to a single functional unit. Use for synthesis of DSP architectures that can be operated at single or multiple clocks. Use to reduce the number of hardware functional units (FUs) such as adders and mults by a factor of N at the expense of increasing computation time by a factor of N. Bur Folding lead to an architecture that uses a large number of registers and thus a register minimization technique needs sometime to be applied.

42 Hardware Mapped vs. Time multiplexed
MUX c REG N cc/sample 1 generalized multiplier adders 1 coefficient memory + control 1 sample/cc N fixed multipliers N-1 adders

43 Hardware Mapped vs. Time multiplexed/Microcoded
Biquad Filter Hardware mapped 5 mult with fixed coeffecients 4 adders 2 delays Latency=1cc Microcoded 1 mult 1 adder Latency=5cc Coeff Memory 3 Registers Controller

44 Folding – Time-shared Architecture
Folding is a technique to reduce the silicon area by time-multiplexing many operations into single functional units. The right figure shows a 2 times folded architecture where additions are folded, or time-multiplexed, to a single adder Folding introduces registers/storage Computation time increased, e.g. one output sample every 2 cc (one input signal consumed every 2cc)

45 Folding Example – A more detailed look!
y(-1) a(0) b(0) 2l+0 Cycle 0 D c(0) 2l+1 Cycle 1 a(0)+b(0) D a(0)+b(0)+c(0) a(1) b(1) 2l+0 Cycle 2 D c(1) 2l+1 Cycle 3

46 Control Unit Finite State Machine -/1 S0 S1 -/0 control signal(s)
Control units can be complex in large systems!

47 Folding Reduce hardware by N-folding
Tcomputation increased by N a Latecy Extremes Fully parallel Time multiplexed = 1 unit per algorithmic operation Folding a extra registers, i.e. extra storage a more complex control unit more latency

48 Folding Transformation
N=folding factor Nr. of operations folded to a single unit l = iteration HW-unit V HW-unit U Level of Pipeline Delays in folded graph u and v are folding order, i.e. scheduled time

49 Folding Transformation
Hu is pipelined by Pu stages and its output is available at Nl + u + Pu. Edge UV has w(e) delays  the l-th iteration of U is used by (l+w(e)) th iteration of node V, which is executed at N(l + w(e)) + v. So, the result should be stored for : DF(UV) = [N(l + w(e)) + v] – [Nl + Pu + u]  DF(UV) = Nw(e) - Pu + v – u ( independent of l )

50 Folding Set A1 belongs to folding set S1 with folding order 0 N=3
A folding set is an ordered set of operations to be executed on the same functional unit. The folding set are typically obtained from a scheduling and allocation algorithm (ref. Appendix B) The folding set represents underlying folding transformation Each set contain N entries, N=folding factor. Folding order: (… N-1) A1 belongs to folding set S1 with folding order 0 Null operation N=3

51 Ex. Folding of Biquad filter
c d (S2|2) In Out b D 1 2 3 4 5 6 7 8

52 Ex. Folding of Biquad filter
(S1|3) a c d (S1|1) (S1|2) (S1|0) (S2|0) (S2|3) (S2|1) (S2|2) In Out b D 1 2 3 4 5 6 7 8 Additions Multiplication

53 Folding of Biquad filter, N=4
receive send A delay between two edges can not be negative! DF(UV) = Nw(e) - Pu + v – u DF(12) = -3 DF(15) = 0 DF(16) = 2 DF(17) = 7 DF(18) = 5 DF(31) = 0 DF(42) = 0 DF(53) = 0 DF(64) = -4 DF(73) = -3 DF(84) = -3 (S1|3) a c d (S1|1) (S1|2) (S1|0) (S2|0) (S2|3) (S2|1) (S2|2) In Out b D 1 2 3 4 5 6 7 8 Not Valid folding

54 Retiming: Folding of Biquad filter, N=4
Feedforward cutset a Pipelining Retiming Split and move delay (S1|3) a c d (S1|1) (S1|2) (S1|0) (S2|0) (S2|3) (S2|1) (S2|2) In Out b D 1 2 3 4 5 6 7 8

55 Folding of retimed Biquad filter
c d (S1|1) (S1|2) (S1|0) (S2|0) (S2|3) (S2|1) (S2|2) In Out b 1 2 3 4 5 6 7 8 (S1|3)

56 Systematic way of: Retiming for Folding
If D’F(UV) is the folded delays of the edge UV for the retimed graph then D’F(UV)  0 a Nwr(e) – PU + v – u  0 N(w(e) + r(V) – r(U) ) - PU + v – u  0 N(r(U) – r(V))  Nw(e) - PU + v – u r(U) – r(V)  DF(UV) /N r(U) – r(V)  DF(UV) /N (floor since retiming values are integers) w(e) r(U) U V r(V) wr (e) U V wr (e) = w(e) + r(V) - r(U) send receive Then solve the the system of inequalities!

57 Folding of retimed Biquad filter, N=4
c d (S1|1) (S1|2) (S1|0) (S2|0) (S2|3) (S2|1) (S2|2) In Out b 1 2 3 4 5 6 7 8 (S1|3) Additions Multiplication

58 Folding of retimed Biquad filter, N=4
send receive DF(UV) = Nw(e) - Pu + v – u DF(12) = 4(1) – – 3 = 1 DF(15) = 4(1) – – 3 = 0 DF(16) = 4(1) – – 3 = 2 DF(17) = 4(1) – – 3 = 3 DF(18) = 4(2) – – 3 = 5 DF(31) = 4(0) – – 2 = 0 DF(42) = 4(0) – – 0 = 0 DF(53) = 4(0) – – 0 = 0 DF(64) = 4(1) – – 2 = 0 DF(73) = 4(1) – – 3 = 1 DF(84) = 4(1) – – 1 = 1 (S1|3) (S1|1) In Out D 1 2 a D b 3 5 4 6 D (S1|2) (S1|0) (S2|0) (S2|2) D D c d 7 8 D (S2|3) (S2|1) Valid folding

59 Folding of retimed Biquad filter, N=4
send receive DF(UV) = Nw(e) - Pu + v – u DF(12) = 4(1) – – 3 = 1 DF(15) = 4(1) – – 3 = 0 DF(16) = 4(1) – – 3 = 2 DF(17) = 4(1) – – 3 = 3 DF(18) = 4(2) – – 3 = 5 DF(31) = 4(0) – – 2 = 0 DF(42) = 4(0) – – 0 = 0 DF(53) = 4(0) – – 0 = 0 DF(64) = 4(1) – – 2 = 0 DF(73) = 4(1) – – 3 = 1 DF(84) = 4(1) – – 1 = 1 D a c d (S1|1) (S1|2) (S1|0) (S2|0) (S2|3) (S2|1) (S2|2) In Out b 1 2 3 4 5 6 7 8 (S1|3) Valid folding

60 Folding of retimed Biquad filter, N=4
5 delays Additions Multiplication DF(UV) = Nw(e) - Pu + v – u DF(18) = 4(2) – – 3 = 5 a path from add to mult with 5D Node 8 has folding order 1 a switch close at 1

61 Folding of retimed Biquad filter
Node 1 is also connected to the input Additions Multiplication DF(31) = 4(0) – – 2 = 0 a path from add to add with 0D Node 1 has folding order 3 a switch close at 3

62 Folding of retimed Biquad filter
DF(4  2) = 4(0) – – 0 = 0 a path from add to add with 0D Additions Multiplication Execution of node 2 (input from node 1 and 4) : DF(12) = 4(1) – – 3 = 1a path from add to add with 1D Node 2 has folding order 1 a switch close at 1

63 Folding & Register Minimization Chapter 6.3

64 Register/Storage Minimization
Folding inserts register. Lifetime analysis is used for register minimization techniques in a DSP hardware. A variable is live from the time it is produced until the time it is consumed. After that it is dead. Linear lifetime chart : Represents the lifetime of the variables in a linear fashion. Convention: a variable is not live during the clock cycle when it is produced but live during the clock cycle when it is consumed. One iteratiom 6 cc a N=6

65 Register Minimization
Max. number of live variables a Min. number of registers Use previous iter. to avoid drawing lifetime chart over several iterations 2 live variables in iteration 2 live variables But 3 if several iterations

66 Register Minimization
Max. number of live variables a Min. number of registers 6cc 2 live variables in iteration 6cc 2 live variables But 3 if several iterations

67 One iteration = 9 clock cycles
Example of a systematic way of working with lifetime charts 3x3 Matrix Transpose Matrix Transposer i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a One iteration = 9 clock cycles

68 Lifetime Table - 3x3 Matrix Transpose
Transposer i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a Out before In Sample Tin Tzlout Tdiff Tout Life a b 1 3 c 2 6 d e 4 f 5 7 g h i 8 -2

69 Lifetime Table - 3x3 Matrix Transpose
Transposer i | h | g | f | e | d | c | b | a i | f | c | h | e | b | g | d | a Sample Tin Tzlout Tdiff Tout Life a b 1 3 2 c 6 4 d -2 e f 5 7 g -4 h i 8 Tdiff = Tzlout – Tinput, where Tzlout = zero latency

70 3x3 Matrix Transpose if Tdiff < 0 not causal a
Sample Tin Tzlout Tdiff Tout Life a b 1 3 2 c 6 4 d -2 e f 5 7 g -4 h i 8 5 if Tdiff < 0 not causal a add latency = | Tnegative diffmax| for all nodes

71 3x3 Matrix Transpose if Tdiff < 0 not causal a
Sample Tin Tzlout Tdiff Tout Life a 4 04 b 1 3 2 7 17 c 6 10 210 d -2 5 35 e 8 48 f 11 511 g -4 66 h 9 79 i 12 812 if Tdiff < 0 not causal a add latency Tlat= | Tnegative diffmax| for all nodes Tout = Tzlout +Tlat

72 Lifetime chart 3x3 Matrix Transpose
Sample Tin Tzlout Tdiff Tout Life a 4 04 b 1 3 2 7 17 c 6 10 210 d -2 5 35 e 8 48 f 11 511 g -4 66 h 9 79 i 12 812 cycle 1 2 3 4 5 6 7 8 9 10 11 12 a b c d e f g h i #live 1 2 3 4 +0 +1 +2 +3 Contribution from next iteration One iteration = 9 clock cycles

73 Lifetime chart 3x3 Matrix Transpose
Sample Tin Tzlout Tdiff Tout Life a 4 04 b 1 3 2 7 17 c 6 10 210 d -2 5 35 e 8 48 f 11 511 g -4 66 h 9 79 i 12 812 cycle 1 2 3 4 5 6 7 8 9 10 11 12 a b c d e f g h i #live 1 2 3 4 One iteration = 9 clock cycles +0=4 +1=4 +2=4 +3=4 The total

74 Lifetime chart One iteration = 9 clock cycles Contribution from next
1 2 3 4 5 6 7 8 9 10 11 12 a b c d e f g h i #live +0 +1 +2 +3 Contribution from next iteration cycle 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 a b c d e f g h i #live 1 2 3 4

75 Lifetime chart 3x3 Matrix Transpose
Sample Tin Tzlout Tdiff Tout Life a 4 04 b 1 3 2 7 17 c 6 10 210 d -2 5 35 e 8 48 f 11 511 g -4 66 h 9 79 i 12 812 max #live = 4 registers x

76 Circular lifetime chart
Useful to represent the periodic nature of the DSP programs. 0 – (N-1) pies Number in parantheses represents the number of live variables at each time instance

77 Forward Backward Register Allocation Technique
Steps for Forward-Backward Register allocation : Determine the minimum number of registers using lifetime analysis. Input each variable at the time step corresponding to the beginning of its lifetime. If multiple variables are input in a given cycle, these are allocated to multiple registers with preference given to the variable with the longest lifetime. Each variable is allocated in a forward manner until it is dead or it reaches the last register. In forward allocation, if the register i holds the variable in the current cycle, then register i + 1 holds the same variable in the next cycle. If (i + 1)-th register is not free then use the first available forward register. Being periodic the allocation repeats in each iteration. So hash out the register Rj for the cycle l + N if it holds a variable during cycle l. For variables that reach the last register and are still alive, they are allocated in a backward manner on a first come first serve basis. Repeat steps 4 and 5 until the allocation is complete.

78 Forward Backward Register Allocation Technique
Out Backward Forward Out

79 Lifetime chart 3x3 Matrix Transpose
Possible to use memory with 4 positions 4 registers

80 Controller for Folded Architecture
for Switches

81 Algorihm for Register Minimization Procedure
1. Perform retiming for folding. (6.2) 2. Write the folding equations. (6.2) 3. Construct a lifetime chart and determine the required number of registers. (6.3.1) 4. Perform forward-backward register allocation. (6.3.2) 5. Draw the register minimized folded architecture.


Download ppt "DSP Design – Lecture 7 Unfolding cont. & Folding Fredrik Edman fredrik"

Similar presentations


Ads by Google