Download presentation

Presentation is loading. Please wait.

Published byHerbert Mayers Modified over 2 years ago

1
Presenter MaxAcademy Lecture Series – V1.0, September 2011 Stream Scheduling

2
Latencies in stream computing Scheduling algorithms Stream offsets 2 Overview

3
Consider a simple arithmetic pipeline Each operation has a latency – Number of cycles from input to output – May be zero – Throughput is still 1 value per cycle, L values can be in-flight in the pipeline 3 Latencies in Stream Computing (A + B) + C

4
4 + + Output Input A Input B Input C Basic hardware implementation

5
+ + Output Input A Input B Input C 5 3 3 2 2 1 1 Data propagates through the circuit in “lock step”

6
+ + Output Input A Input B Input C 6 3 3 2 2 1 1

7
+ + Output Input A Input B Input C 7 3 3 2 2 1 1 X Data arrives at wrong time due to pipeline latency

8
8 + + Output Input A Input B Input C Insert buffering to correct

9
+ + Output Input A Input B Input C 9 1 1 2 2 3 3 Now with buffering

10
+ + Output Input A Input B Input C 10 1 1 2 2 3 3

11
+ + Output Input A Input B Input C 11 3 3 3 3

12
+ + Output Input A Input B Input C 12 3 3 3 3

13
+ + Output Input A Input B Input C 13 6 6

14
+ + Output Input A Input B Input C 14 6 6 Success!

15
A stream scheduling algorithm transforms an abstract dataflow graph into one that produces the correct results given the latencies of the operations Can be automatically applied on a large dataflow graph (many thousands of nodes) Can try to optimize for various metrics – Latency from inputs to outputs – Amount of buffering inserted generally most interesting – Area (resource sharing) 15 Stream Scheduling Algorithms

16
16 ASAP As Soon As Possible

17
17 Input A Input A Input B Input C 000 Build up circuit incrementally Keeping track of latencies

18
18 + Input A Input A Input B Input C 000 1

19
19 + + Input A Input A Input B Input C 1 000 Input latencies are mismatched

20
20 + + Input A Input A Input B Input C 000 1 1 2 Insert buffering

21
21 + + Output Input A Input A Input B Input C 000 1 1 2

22
22 ALAP As Late As Possible

23
23 Output 0 Start at output

24
24 + Output 0 Latencies are negative relative to end of circuit

25
25 + + Output Input C -2 0

26
26 + + Output Input A Input A Input B Input C -2 0

27
27 + + Output Input A Input A Input B Input C -2 0 Buffering is saved

28
28 + + Output 1 Input A Input A Input B Input C Output 2 Sometimes this is suboptimal What if we add an extra output?

29
29 + + Output 1 Input A Input A Input B Input C -2 0 Output 2 Unnecessary buffering is added 0 Neither ASAP nor ALAP can schedule this design optimally

30
ASAP and ALAP both fix either inputs or outputs in place More complex scheduling algorithms may be able to develop a more optimal schedule e.g. using ILP 30 Optimal Scheduling

31
Consider: We can see that we might need some explicit buffering to hold more than one data element on-chip We could do this explicitly, with buffering elements 31 Buffering data on-chip a = a + (buffer(a, 1) + buffer(b, 1)) a[i] = a[i] + (a[i - 1] + b[i - 1])

32
32 + + Output Input A Input B Buffer(1) The buffer has zero latency in the schedule

33
33 + + Output Input A Input B Buffer(1) This will schedule thus Buffering = 3 00 00 1 1 2

34
Accessing previous values with buffers is looking backwards in the stream This is equivalent to having a wire with negative latency – Can not be implemented directly, but can affect the schedule 34 Buffers and Latency

35
35 + + Output Input A Input B 00 0 1 Offset wires can have negative latency Offset(-1)

36
36 + + Output Input A Input B 00 0 1 This is scheduled Buffering = 0 Offset(-1)

37
A stream offset is just a wire with a positive or negative latency Negative latencies look backwards in the stream Positive latencies look forwards in the stream The entire dataflow graph will re-schedule to make sure the right data value is present when needed Buffering could be placed anywhere, or pushed into inputs or outputs more optimal than manual instantiation 37 Stream Offsets

38
38 + Output Input A 0 Offset(1) a = a + stream.offset(a, +1) a[i] = a + a[i + 1]

39
39 + Output Input A Scheduling produces a circuit with 1 buffer 0 Offset(1) 1 1 2

40
For the questions below, assume that the latency of an addition operation is 10 cycles, and a multiply takes 5 cycles, while inputs/outputs take 0 cycles. 1.Write pseudo-code algorithms for ASAP and ALAP scheduling of a dataflow graph 2.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and draw the buffering introduced by ASAP scheduling to: a)c = ( (a1 + a2) + a3) + a4 b)c = (a1 + a2) + (a3 + a4) 3.Consider a MaxCompiler kernel with inputs a1, a2, a3, a4 and an output c. Draw the dataflow graph and write out the inequalities that must be satisfied to schedule: a)c = ((a1 * a2) + (a3 * a4)) + a1 b)c = stream.offset(a1, -10)*a2 + stream.offset(a1, -5)*a3 + stream.offset(a1, +15)*a4 How many values of stream a1 will be buffered on-chip for (b)? 40 Exercises

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google