Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

Similar presentations


Presentation on theme: "Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling."— Presentation transcript:

1 Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling

2 Latencies in stream computing Scheduling algorithms Stream offsets 2 Overview

3 Consider a simple arithmetic pipeline Each operation has a latency – Number of cycles from input to output – May be zero – Throughput is still 1 value per cycle, L values can be in-flight in the pipeline 3 Latencies in Stream Computing (A + B) + C

4 4 + + Output Input A Input B Input C Basic hardware implementation

5 + + Output Input A Input B Input C Data propagates through the circuit in “lock step”

6 + + Output Input A Input B Input C

7 + + Output Input A Input B Input C X Data arrives at wrong time due to pipeline latency

8 8 + + Output Input A Input B Input C Insert buffering to correct

9 + + Output Input A Input B Input C Now with buffering

10 + + Output Input A Input B Input C

11 + + Output Input A Input B Input C

12 + + Output Input A Input B Input C

13 + + Output Input A Input B Input C

14 + + Output Input A Input B Input C Success!

15 A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations Can be automatically applied on a large dataflow graph (many thousands of nodes) Can try to optimize for various metrics – Latency from inputs to outputs – Amount of buffering inserted  generally most interesting – Area (resource sharing) 15 Stream Scheduling Algorithms

16 16 ASAP As Soon As Possible

17 17 Input A Input A Input B Input C 000 Build up circuit incrementally Keeping track of latencies

18 18 + Input A Input A Input B Input C 000 1

19 Input A Input A Input B Input C Input latencies are mismatched

20 Input A Input A Input B Input C Insert buffering

21 Output Input A Input A Input B Input C

22 22 ALAP As Late As Possible

23 23 Output 0 Start at output

24 24 + Output 0 Latencies are negative relative to end of circuit

25 Output Input C -2 0

26 Output Input A Input A Input B Input C -2 0

27 Output Input A Input A Input B Input C -2 0 Buffering is saved

28 Output 1 Input A Input A Input B Input C Output 2 Sometimes this is suboptimal What if we add an extra output?

29 Output 1 Input A Input A Input B Input C -2 0 Output 2 Unnecessary buffering is added 0 Neither ASAP nor ALAP can schedule this design optimally

30 ASAP and ALAP both fix either inputs or outputs in place More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP 30 Optimal Scheduling

31 Consider: We can see that we might need some explicit buffering to hold more than one data element on-chip We could do this explicitly, with buffering elements 31 Buffering data on-chip a = a + (buffer(a, 1) + buffer(b, 1)) a[i] = a[i] + (a[i - 1] + b[i - 1])

32 Output Input A Input B Buffer(1) The buffer has zero latency in the schedule

33 Output Input A Input B Buffer(1) This will schedule thus Buffering =

34 Accessing previous values with buffers is looking backwards in the stream This is equivalent to having a wire with negative latency – Can not be implemented directly, but can affect the schedule 34 Buffers and Latency

35 Output Input A Input B Offset wires can have negative latency Offset(-1)

36 Output Input A Input B This is scheduled Buffering = 0 Offset(-1)

37 A stream offset is just a wire with a positive or negative latency Negative latencies look backwards in the stream Positive latencies look forwards in the stream The entire dataflow graph will re-schedule to make sure the right data value is present when needed Buffering could be placed anywhere, or pushed into inputs or outputs  more optimal than manual instantiation 37 Stream Offsets

38 38 + Output Input A 0 Offset(1) a = a + stream.offset(a, +1) a[i] = a + a[i + 1]

39 39 + Output Input A Scheduling produces a circuit with 1 buffer 0 Offset(1) 1 1 2

40 For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. 1.Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph 2.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: a)c = ( (a1 + a2) + a3) + a4 b)c = (a1 + a2) + (a3 + a4) 3.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: a)c = ((a1 * a2) + (a3 * a4)) + a1 b)c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)? 40 Exercises


Download ppt "Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling."

Similar presentations


Ads by Google