Download presentation

Presentation is loading. Please wait.

Published byRafael Prigge Modified about 1 year ago

1
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-1 VLSI Signal Processing Lecture 2 Unfolding Transformation

2
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-2 Multiple-Data Processing Create a program with more than one iteration, e.g. J loops unrolling Example: Loop unrolling + software pipelining 1 2 3 4 5 6 7 8 clock cycle operation 1 2 3 1 2 3 1 2 1 1 1 2 2 2 3 3 3 1 2 3 4 5 6 7 8 clock cycle

3
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-3 Basic Ideas Parallel processingPipelined processing a1a2a3a4 b1b2b3b4 c1c2c3c4 d1d2d3d4 a1b1c1d1 a2b2c2d2 a3b3c3d3 a4b4c4d4 P1 P2 P3 P4 P1 P2 P3 P4 time

4
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-4 Data Dependence Parallel processing requires NO data dependence between processors Pipelined processing will involve inter-processor communication P1 P2 P3 P4 P1 P2 P3 P4 time

5
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-5 Parallel Processing In a J-unfolded system, each delay is J-slow. That is, if input to a delay element is x(kJ+m), then the output is x((k-1)J+m) = x(kJ+m-J)

6
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-6 Parallel Processing Block processing –the number of inputs processed in a clock cycle is referred to as the block size –at the k-th clock cycle, three inputs x(3k), x(3k+1), and x(3k+2) are processed simultaneously to generate y(3k), y(3k+1), and y(3k+2)

7
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-7 I/O Conversion Serial to parallel converter Parallel to serial converter

8
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-8 General approach for block processing

9
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-9 Mathematical Formulation e.g. y(n) = ay(n-9) + x(n) 2-parallel Y(2k) = ay(2k-9) + x(2k) Y(2k+1) = ay(2k-8) + x (2k+1) In 2-parallel SDFG, one active clock edge leads two samples Y(2k) = ay(2(k-5)+1) + x(2k) Y(2k+1) = ay(2(k-4)+0) + x(2k+1) Dependency with less than # parallelism of sample delays can be implemented with internal routing

10
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-10 Unfolding the DFG T=T s T=J T s Not trivial, even for a simple graph

11
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-11 Block Processing for FIR Filter One form of vectorized parallel processing of DSP algorithms. (Not the parallel processing in most general sense) Block vector: [x(3k) x(3k+1) x(3k+2)] Clock cycle: can be 3 times longer Original (FIR filter): Rewrite 3 equations at a time:

12
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-12 Block Processing

13
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-13 Block Processing for IIR Digital Filter Original formulation: Rewrite: Vector formulation: n: sample period k: processor period T sample ≠T clk

14
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-14 Block IIR Filter D D S/PP/S + + x(2k) x(2k+1) y(2k+1) y(2k) x(n)y(n) y(2(k 1)) y(2(k 1)+1) clock period not equal to sampling period

15
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-15 Timing Comparison Pipelining Block processing 1234 x(1)x(2)x(3)x(4) y(1)y(2)y(3)y(4) 12345678 x(1)x(2)x(3)x(4)x(5)x(6)x(7) MAC 12345678 y(1)y(2)y(3)y(4)y(5)y(6)y(7) Add a y(1) Mul 11335577 22446688 x(2)x(4)x(6)x(8) x(1)x(3)x(5)x(7)

16
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-16 Definitions Unfolding is the process of unfolding a loop so that several iterations are unrolled into the same iteration. Also known as (a.k.a.) –Loop unrolling (in compilers for parallel programs) –Block processing Applications –Reducing sampling period to achieve iteration bound (desired throughput rate) T . –Parallel (block processing) to execute several iterations concurrently. –Digit-serial or bit-serial processing

17
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-17 Unfolding the DFG y(n)=ay(n-9)+x(n) Rewrite the algorithm formulation: y(2k)=ay(2k-9)+x(2k) y(2k+1)=ay(2k-8)+x(2k+1) y(2k)=ay(2(k-5)+1)+x(2k) y(2k+1)=ay(2(k-4))+x(2k+1) After J-folded unfolding, the clock period T = J T s, where T s is the data sampling period.

18
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-18 Timing Diagram Above timing diagram is obtained assuming that the sampling period T s remains unchanged. Thus, the clock period T is increased J-fold. Since 9/2 is not an integer, output (y(0), y(1)) will be needed by two different future iterations, 4T and 5T later. y(0)y(1)y(2)y(3)y(4)y(5)y(6)y(7)y(8)y(9)y(10)y(11)y(12)y(13) T=T s y(0)y(2)y(4)y(6)y(8)y(10)y(12) y(1)y(3)y(5)y(7)y(9)y(11)y(13) T=2T s 9 T 4T 5T 9 T

19
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-19 Another DFG Unfolding Example Q S T R 3D 2D Q0Q0 S0S0 T0T0 R0R0 Q1Q1 S1S1 T1T1 R1R1 J=2 T =3 iw(i+w)%J 0000 0201 0311 1010 1211 1302 Step 1. Duplicate J copies of each node

20
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-20 Another DFG Unfolding Example Q S T R 3D 2D Q0Q0 S0S0 T0T0 R0R0 Q1Q1 S1S1 T1T1 R1R1 J=2 T =3 iw(i+w)%J 0000 0201 0311 1010 1211 1302 Step 2. Add all edges with 0 delay on them.

21
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-21 Another DFG Unfolding Example Q S T R 3D 2D Q0Q0 S0S0 T0T0 R0R0 D Q1Q1 S1S1 T1T1 R1R1 D D J=2 T =3 T =6 iw(i+w)%J 0000 0201 0311 1010 1211 1302 Step 3. Use table on the left to figure out edges with delays.

22
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-22 Unfolding Transformation For each node U in the original DFG, draw J node U 0, U 1,…, U J-1 For each edge UV with w delays in the original DFG, draw the J edges U i V (i + w)%J with floor[(i+w)/J] delays for i=0,1,…, J-1 Example Unfolding of an edge with w delays in the original DFG produces J- w edges with no delays and w edges with 1delay in J-unfolded DFG for w < J Unfolding preserves precedence constraints of a DSP algorithm

23
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-23 Precedence Preservation

24
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-24 Delay Preservation Unfolding preserves the number of delays in a DFG Let, where

25
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-25 Example Unfold the following DFG using folding factor 2 and 5

26
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-26 Properties of Unfolding Unfolding preserves the number of registers (delays) in a DFG For a loop with w delays in a DFG that has been unfolded J times, it leads to –g.c.d.(w, J) loops in the unfolded DFG, with each of these loops containing W/(g.c.d.(w,J)) delays and J/(g.c.d.(w,J)) copies of each node that appear in the original loop. Unfolding a DFG with iteration bound T results in a J-folded DFG with iteration bound JT . A path with w (< J) delays in a DFG will lead to J - w paths with no delays, and w paths with 1 delay each in the J-unfolded DFG. Any clock period that can be achieved by retiming a J-unfolded DFG can be achieved by retiming the original DFG and followed by J-unfolding.

27
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-27 When a Loop is Unfolded A loop ℓ with w delays in a DFG Travel the loop A~>A p times also a loop with pw delays In J-unfolded DFG, consider the path A i A (i+pw)%J. It is a loop if i=(i+ pw)%J. This implies that J | pw The smallest p = J/gcd(J, w). That is, in J-unfolded DFG, one can travel the loop A~>A J/gcd(J, w) times. Recall that there are totally J copies of node A. Hence, there are J/(J/gcd(J,w))=gcd(J, w) loops and each loop contains w/ gcd(J, w) delays. The iteration bound in J-unfolded DFG is then

28
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-28 When a Path is Unfolded If w

29
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-29 When a Path is Unfolded Suppose r’ is a legal retiming for the J-unfolded DFG, G J, which leads to critical path c. Let r(U) = i r’(U i ), 0≤i≤J-1. –r is a feasible retiming for the original DFG, G. –The retiming leads to a critical path c 0≤i≤J-1 ii

30
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-30 Sample Period Reduction Case1: A node in the DFG having computation time greater than T ∞ Case2: Iteration bound is not an integer Case3: Longest node computation is larger than the iteration T ∞, and T ∞ is not an integer

31
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-31 Case 1 Critical path dominates, since a node computation time is more than iteration bound Retiming cannot be used to reduce sample period

32
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-32 Sample Period Reduction Rule of Thumb: T ∞ =6, T critical =6

33
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-33 Case 2 Iteration period cannot not achieve the iteration bound

34
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-34 Sample Period Reduction

35
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-35 Case 3

36
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-36 Parallel Processing Parallel processing can be performed by unfolding

37
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-37 Bit-Level Parallel Processing

38
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-38

39
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-39 Bit-Serial Adder

40
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-40 Unfolding of Switches

41
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-41 Example

42
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-42 Example

43
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-43 Example

44
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-44 Example

45
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-45 Switches with Delays

46
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-46 Switch with Delays

47
ADSP Lecture2 - Unfolding (cwliu@twins.ee.nctu.edu.tw)2-47 If Wordlength is not a Multiple of J

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google