Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling.

Presentation on theme: "Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling."— Presentation transcript:

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling

Latencies in stream computing Scheduling algorithms Stream offsets 2 Overview

Consider a simple arithmetic pipeline Each operation has a latency – Number of cycles from input to output – May be zero – Throughput is still 1 value per cycle, L values can be in-flight in the pipeline 3 Latencies in Stream Computing (A + B) + C

4 + + Output Input A Input B Input C Basic hardware implementation

+ + Output Input A Input B Input C 5 3 3 2 2 1 1 Data propagates through the circuit in “lock step”

+ + Output Input A Input B Input C 6 3 3 2 2 1 1

+ + Output Input A Input B Input C 7 3 3 2 2 1 1 X Data arrives at wrong time due to pipeline latency

8 + + Output Input A Input B Input C Insert buffering to correct

+ + Output Input A Input B Input C 9 1 1 2 2 3 3 Now with buffering

+ + Output Input A Input B Input C 10 1 1 2 2 3 3

+ + Output Input A Input B Input C 11 3 3 3 3

+ + Output Input A Input B Input C 12 3 3 3 3

+ + Output Input A Input B Input C 13 6 6

+ + Output Input A Input B Input C 14 6 6 Success!

A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations Can be automatically applied on a large dataflow graph (many thousands of nodes) Can try to optimize for various metrics – Latency from inputs to outputs – Amount of buffering inserted  generally most interesting – Area (resource sharing) 15 Stream Scheduling Algorithms

16 ASAP As Soon As Possible

17 Input A Input A Input B Input C 000 Build up circuit incrementally Keeping track of latencies

18 + Input A Input A Input B Input C 000 1

19 + + Input A Input A Input B Input C 1 000 Input latencies are mismatched

20 + + Input A Input A Input B Input C 000 1 1 2 Insert buffering

21 + + Output Input A Input A Input B Input C 000 1 1 2

22 ALAP As Late As Possible

23 Output 0 Start at output

24 + Output 0 Latencies are negative relative to end of circuit

25 + + Output Input C -2 0

26 + + Output Input A Input A Input B Input C -2 0

27 + + Output Input A Input A Input B Input C -2 0 Buffering is saved

28 + + Output 1 Input A Input A Input B Input C Output 2 Sometimes this is suboptimal What if we add an extra output?

29 + + Output 1 Input A Input A Input B Input C -2 0 Output 2 Unnecessary buffering is added 0 Neither ASAP nor ALAP can schedule this design optimally

ASAP and ALAP both fix either inputs or outputs in place More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP 30 Optimal Scheduling

Consider: We can see that we might need some explicit buffering to hold more than one data element on-chip We could do this explicitly, with buffering elements 31 Buffering data on-chip a = a + (buffer(a, 1) + buffer(b, 1)) a[i] = a[i] + (a[i - 1] + b[i - 1])

32 + + Output Input A Input B Buffer(1) The buffer has zero latency in the schedule

33 + + Output Input A Input B Buffer(1) This will schedule thus Buffering = 3 00 00 1 1 2

Accessing previous values with buffers is looking backwards in the stream This is equivalent to having a wire with negative latency – Can not be implemented directly, but can affect the schedule 34 Buffers and Latency

35 + + Output Input A Input B 00 0 1 Offset wires can have negative latency Offset(-1)

36 + + Output Input A Input B 00 0 1 This is scheduled Buffering = 0 Offset(-1)

A stream offset is just a wire with a positive or negative latency Negative latencies look backwards in the stream Positive latencies look forwards in the stream The entire dataflow graph will re-schedule to make sure the right data value is present when needed Buffering could be placed anywhere, or pushed into inputs or outputs  more optimal than manual instantiation 37 Stream Offsets

38 + Output Input A 0 Offset(1) a = a + stream.offset(a, +1) a[i] = a + a[i + 1]

39 + Output Input A Scheduling produces a circuit with 1 buffer 0 Offset(1) 1 1 2

For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. 1.Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph 2.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: a)c = ( (a1 + a2) + a3) + a4 b)c = (a1 + a2) + (a3 + a4) 3.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: a)c = ((a1 * a2) + (a3 * a4)) + a1 b)c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)? 40 Exercises

Similar presentations