Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithmic Transformations

Similar presentations


Presentation on theme: "Algorithmic Transformations"— Presentation transcript:

1 Algorithmic Transformations

2 Goals The goal: Get the DSP algorithm in an amenable form before heading off to synthesize the design on the selected platform (FPGA or PDSP) No changes to the actual algorithms, just changes to the way the algorithms are prepared for implementation. This will require understanding aspects of timing, pipelining, parallelism (C) Yu Hen Hu

3 Overview Algorithm Representations and Iteration Bound
Parallelism and Pipelining Retiming Unfolding Folding (C) Yu Hen Hu

4 (C) Yu Hen Hu

5 (C) Yu Hen Hu

6 (C) Yu Hen Hu

7 Data Flow Graph Node: Direct edge: Delay: iteration count Example
Computation Associated with a computing time. Direct edge: data path and delay Delay: iteration count Example y(n) = a*y(n-1) + b*u(n) The delay of 1 u.t. indicates that to compute y(n+1) in the next iteration depends on result y(n) of the present iteration. Delay labeled with D or positive integer on edges (C) Yu Hen Hu

8 DFG Intra-iteration dependency Inter-iteration dependency
x(n) D D Intra-iteration dependency A direct edge without any delay Inter-iteration dependency Direct edge with 1 or more delays Node computing delay labeled with parenthesis. Critical path: longest path between registers Example: critical path delay = = 8 t.u. M0 (4) M1 (4) M2 (4) y(n) A0 A1 (2) (2) Recursive DFG: contains loops. Must have at least one delay element along any loop. Otherwise, the algorithm is NON-computable! (C) Yu Hen Hu

9 Loop bound and Iteration bound
(2) (4) (5) A B C 2D (2) (4) A B T{A-B-A} = (2+4)/2 = 3 t.u. T = max{(2+4)/2, (2+4+5)/1} = max{3, 11} = 11 2D (C) Yu Hen Hu

10 (C) Yu Hen Hu

11 (C) Yu Hen Hu

12 Solution To achieve high-speed, the length of the critical path can be reduced by pipelining and parallel processing (C) Yu Hen Hu

13 Overview Algorithm Representations and Iteration Bound
Parallelism and Pipelining Retiming Unfolding Folding (C) Yu Hen Hu

14 Basic Ideas Parallel processing Pipelined processing time time P1 P2
Less inter-processor communication Complicated processor hardware More inter-processor communication Simpler processor hardware Colors: different types of operations performed a, b, c, d: different data streams processed (C) Yu Hen Hu

15 Data Dependence time time
Parallel processing requires NO data dependence between processors Pipelined processing will involve inter-processor communication P1 P2 P3 P4 P1 P2 P3 P4 time time (C) Yu Hen Hu

16 Usage of Pipelined Processing
By inserting latches or registers between combinational logic circuits, the critical path can be shortened. Consequence: reduce clock cycle time, increase clock frequency. Suitable for DSP applications that have (infinity) long data stream. Method to incorporate pipelining: Cut-set retiming Cut set: A cut set is a set of edges of a graph. If these edges are removed from the original graph, the remaining graph will become two separate graphs. Retiming: The timing of an algorithm is re-adjusted while keeping the partial ordering of execution unchanged so that the results correct (C) Yu Hen Hu

17 Pipelining (C) Yu Hen Hu

18 Pipelining of FIR filters
(C) Yu Hen Hu

19 Pipelining (C) Yu Hen Hu

20 Fine-grain pipelining
To further reduce TM. Critical Path = Max {TM1, TM2, TA} (C) Yu Hen Hu

21 Graphic Transpose Theorem
The transfer function of a signal flow graph remain unchanged if The directions of each arc is reversed The input and output labels are switched. z-1 x[n] y[n] h[2] h[1] h[0] y[n] z-1 u[n] z-1 = ? h[0] h[1] h[2] x[n] (C) Yu Hen Hu

22 Data broadcast structure
Algorithm transform may lead to pipelined structure without adding additional delays. Given a FIR filter SFG Critical path TM+2TA Use graph transposition theorem: Reverse all arcs Reverse input/output We obtain Critical path Max(TM, TA) No additional delay added! (C) Yu Hen Hu

23 Block Processing One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense) Block vector: [x(3k) x(3k+1) x(3k+2)] Clock cycle: can be 3 times longer Original (FIR filter): Rewrite 3 equations at a time: Define block vector Block formulation: (C) Yu Hen Hu

24 Block Processing (C) Yu Hen Hu

25 General approach for block processing
(C) Yu Hen Hu

26 (C) Yu Hen Hu

27 Timing Comparison x(1) x(2) x(3) x(4) MAC 1 2 3 4 y(1) y(2) y(3) y(4) Pipelining Block processing x(1) x(2) x(3) x(4) x(5) x(6) x(7) x(7) Add 1 2 3 4 5 6 7 8 y(1) y(2) y(3) y(4) y(5) y(6) y(7) y(7) a y(1) Mul 1 2 3 4 5 6 7 8 x(2) x(4) x(6) x(8) 2 2 4 4 6 6 8 8 x(1) x(3) x(5) x(7) 1 1 3 3 5 5 7 7 (C) Yu Hen Hu

28 Overview Algorithm Representations and Iteration Bound
Parallelism and Pipelining Retiming Unfolding Folding (C) Yu Hen Hu

29 Definitions Retiming Purposes
Retiming is a mapping from a given DFG, G to a retimed DFT, Gr such that the corresponding transfer function of G and Gr differ by a pure delay z-L. Purposes To facilitate pipelining to reduce clock cycle time To reduce number of registers needed. (C) Yu Hen Hu

30 Cut Set Retiming (C) Yu Hen Hu

31 Cut set delay transfer (C) Yu Hen Hu

32 Cut-set delay transfer failure
(C) Yu Hen Hu

33 Cut-set Retiming Delay transfer theorem Feed-forward cut-set:
Feed-back cut-set Delay transfer theorem Adding arbitrary non-negative number of delays to each edge of a feed-forward cut-set of a DFG will not alter its output, except the output timing will be delayed. Transfer the same amount of delays from edges of the same direction across a feed-back cut set of a DFG to all edges of opposing edges across the same cut set will not alter the output, but its timing. (C) Yu Hen Hu

34 Feed-forward Cut-Set Retiming
Consider the FIR digital filter and its DFG: y(n) = b0x(n) + b1x(n-1) Critical path length = TM+TA Select a cut set Insert a delay each to each edge in the cut set. Retiming: ynew(n) = b0x(n-1) + b1x(n-2) ynew(n) = y(n-1) Critical path = Max(TM, TA) D x(n) x(n-1) X b0 X b1 D x(n) x(n-1) + y(n) X b0 X b1 D D + y(n) (C) Yu Hen Hu

35 Feed-back Cut Set Retiming
Consider an IIR digital filter y(n) = a·y(n-2) + x(n) loop bound = (TM+TA)/2 clock cycle = TM+TA Shift 1 delay to the other edge across a feed-back cut set Filter remains unchanged. loop bound = (TM+TA)/2 clock cycle = Max(TM ,TA) x(n) y(n) x(n) y(n) + + 2D D D a a (C) Yu Hen Hu

36 Feed-back Cut Set Retiming
Consider an IIR digital filter y(n) = ay(n-1) + x(n) loop bound = (TM+TA) throughput = 1/(TM+TA) x(2k-1)=x(k) x(2k) = 0 Clock period = (TM+TA) Throughput = 1/[2(TM+TA)] x(n) y(n) + x(m) y(m) + D 2D a a (C) Yu Hen Hu

37 Time scaling (C) Yu Hen Hu

38 Slowing down the input rate
(C) Yu Hen Hu

39 Loss of Efficiency (C) Yu Hen Hu

40 Slowdown + Retiming   + + Start with y(n) = a y(n-1) + x(n)
clock cycle = Max(TM ,TA) Throughput = 1/[2max(TM,TA)] Start with y(n) = a y(n-2) + x(n) loop bound = (TM+TA)/2 clock cycle = Max(TM ,TA) throughput = 1/ Max(TM ,TA) x(n) y(n) x(m) y(m) + + D D D D a a (C) Yu Hen Hu

41 Slow Down for Cut-Set Retiming
(C) Yu Hen Hu

42 Example of retiming Node delay = 1 t.u. Before retiming:
Critical path: a3  a4  a5  a6 Clock cycle time = 4 2 delay units After cut-set retiming Critical path: a3  a5, a4  a6 Clock cycle time = 2 6 delay units After additional retiming Critical path: none Clock cycle time = 1 11 delay units a5 a3 D a1 a2 a3 a4 a5 a6 2D a4 a2 D D a6 2D a1 D D D 2D a3 a5 (C) Yu Hen Hu

43 Node Retiming v v … Retiming equation: e v u
Transfer delay through a node in DFG: r(v) = # of delays transferred from out-going edges to incoming edges of node v w(e) = # of delays on edge e wr(e) = # of delays on edge e after retiming Retiming equation: subject to wr(e)  0. Let p be a path from v0 to vk then e u v D 3D 2D r(v) = 2 v v 2D D 3D v0 e0 v1 e1 vk ek p (C) Yu Hen Hu

44 Invariant Properties Retiming does NOT change the total number of delays for each cycle. Retiming does not change loop bound or iteration bound of the DFG If the retiming values of every node v in a DFG G are added to a constant integer j, the retimed graph Gr will not be affected. That is, the weights (# of delays) of the retimed graph will remain the same. (C) Yu Hen Hu

45 Node Retiming Examples
(C) Yu Hen Hu

46 DFG Illustration of the Example
T = max. {(1+2+1)/2, (1+2+1)/3} = 2 Cr. Path delay = 2+1 = 3 t.u T = max. {(1+2+1)/2, (1+2+1)/3} = 2 Cr. Path Delay = max{2,2,1+1} = 2 t.u (C) Yu Hen Hu

47 Retiming for Minimizing Clock Period
Note that retiming will NOT alter iteration bound T. Iteration bound is the theoretical minimum clock period to execute the algorithm. Let edge e connect node u to node v. If the node computing time t(u) + t(v) > T, then clock period T > T. For such an edge, we require that To generalize, for any path from v0 to vk, we have In other words, for any possible critical path in the DFG that is larger than T, we require wr(e)  1. (C) Yu Hen Hu

48 Retiming Example Revisited
wr(e21)  0, since t(2)+t(1) = 2 = T. wr(e13)  1, since t(1)+t(3) = 3 > T. wr(e14)  1, since t(1)+t(4) = 3 > T. wr(e32)  1, since t(3)+t(2) = 3 > T. wr(e42)  1, since t(4)+t(2) = 3 > T. Use eq. wr(euv) = w(e) + r(v) – r(u), w(e21) + r(1) – r(2) = 1 + r(1) – r(2)  0 w(e13) + r(3) – r(1) = 1 + r(3) – r(1)  1 w(e14) + r(4) – r(1) = 2 + r(4) – r(1)  1 w(e32) + r(2) – r(3) = 0 + r(2) – r(3)  1 w(e42) + r(2) – r(4) = 0 + r(2) – r(4)  1 (C) Yu Hen Hu

49 Solution continues Since the retimed graph Gr remain the same if all node retiming values are added by the same constant. We thus can set r(1) = 0. The inequalities become 1 – r(2)  0 or r(2)  1 1 + r(3)  1 or r(3)  0 2 + r(4)  1 or r(4)  –1 r(2) – r(3)  1 or r(3) r(2) - 1 r(2) – r(4)  1 or r(2)  r(4) + 1 Since one must have r(2) = +1. This implies r(3)  0. But we also have r(3)  0. Hence r(3)=0. These leave –1  r(4)  0. Hence the two sets of solutions are: r(3) = 0, r(2) = +1, and r(4) = 0 or -1. (C) Yu Hen Hu

50 Systematic Solutions Given a systems of inequalities:
r(i) – r(j)  k; 1  i,j  N Construct a constraint graph: Map each r(i) to node i. Add a node N+1. For each inequality r(i) – r(j)  k, draw an edge eji such that w(eji) = k. Draw N edges eN+1,i = 0. The system of inequalities has a solution if and only if the constraint graph contains no negative cycles If a solution exists, one solution is where ri is the minimum length path from the node N+1 to the node i. Shortest path algorithms: Bellman-Ford algorithm Floyd-Warshall algorithm (C) Yu Hen Hu

51 Overview Algorithm Representations and Iteration Bound
Parallelism and Pipelining Retiming Unfolding Folding (C) Yu Hen Hu

52 Definitions Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration. Also known as Loop unrolling (in compilers for parallel programs) Block processing Applications Reducing sampling period to achieve iteration bound (desired throughput rate) T. Parallel (block processing) to execute several iterations concurrently. Digit-serial or bit-serial processing (C) Yu Hen Hu

53 An example Block processing formulation J = 3, 9/J = 3 (an integer)
Before unfolding: For n = 0 to N-1, y(n)=a*y(n-9)+x(n) end Unfolding once (J = 2) For k = 0 to N/2-1, y(2k)=a*y(2k-9)+x(2k) y(2k+1)=a*y(2k-8)+x(2k+1) Unfolding twice (J = 3) For k = 0 to N/3-1, y(3k)=a*y(3k-9)+x(3k) y(3k+1)=a*y(3k-8)+x(3k+1) y(3k+2)=a*y(3k-7)+x(3k+2) Block processing formulation J = 3, 9/J = 3 (an integer) X(k) = [x(3k) x(3k+1) x(3k+2)]T Y(k) = [y(3k) y(3k+1) y(3k+2)]T Y(k) = a*Y(k- 3 ) + X(k) J = 2, 9/J = ? (not an integer) X(k) = [x(2k) x(2k+1)]T Y(k) = [y(2k) y(2k+1)]T Y(k) = a*Y(k- ? ) + X(k) (C) Yu Hen Hu

54 Unfolding the DFG Rewrite the algorithm formulation:
y(2k)=a*y(2k-9)+x(2k) y(2k+1)=a*y(2k-8)+x(2k+1) y(2k)=a*y(2(k-5)+1)+x(2k) y(2k+1)=a*y(2(k-4))+x(2k+1) After J-folded unfolding, the clock period T = J Ts, where Ts is the data sampling period. T=Ts T=J Ts (C) Yu Hen Hu

55 General DFG Unfolding Method
Define Step 1. For each node U in original DFG, draw J nodes {Ui; 0 iJ-1} in the unfolded DFG Step 2. For each edge from U to V with w delays, draw J edges from Ui to V(i+w)%J with (i+w)/J delays (C) Yu Hen Hu

56 Another DFG Unfolding Example
J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S R0 Q T 3D 2D S1 R Q1 T1 T=3 R1 Step 1. Duplicate J copies of each node (C) Yu Hen Hu

57 Another DFG Unfolding Example
J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S R0 Q T 3D 2D S1 R Q1 T1 T=3 R1 Step 2. Add all edges with 0 delay on them. (C) Yu Hen Hu

58 Another DFG Unfolding Example
J=2 S0 i w (i+w)%J 2 1 3 Q0 T0 S D R0 Q T 2D D 3D 2D S1 R Q1 T1 T=3 D R1 Step 3. Use table on the left to figure out edges with delays. T=6 (C) Yu Hen Hu

59 Properties of Unfolding
Unfolding preserves the number of registers (delays) in a DFG For a loop with w delays in a DFG that has been unfolded J times, it leads to g.c.d.(w, J) loops in the unfolded DFG, with each of these loops containing w/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of each node that appear in the original loop. Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT. A path with w (< J) delays in a DFG will lead to J-w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG. Any path in the original DFT containing J or more delays leads to J paths 2ith 1 or more delay in each path. Therefore, it can not create a critical path in the J-unfolded DFT Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding. (C) Yu Hen Hu


Download ppt "Algorithmic Transformations"

Similar presentations


Ads by Google