Presentation is loading. Please wait.

Presentation is loading. Please wait.

ACTORS: Adaptivity & Control of Resources in Embedded Systems What does it take to realize CAL networks efficiently in software? Carl von Platen, Ericsson.

Similar presentations


Presentation on theme: "ACTORS: Adaptivity & Control of Resources in Embedded Systems What does it take to realize CAL networks efficiently in software? Carl von Platen, Ericsson."— Presentation transcript:

1 ACTORS: Adaptivity & Control of Resources in Embedded Systems What does it take to realize CAL networks efficiently in software? Carl von Platen, Ericsson AB, Mobile Platforms Artist2 - ACTORS workshop in Pisa, Italy, May 5, 2008

2 Adaptivity & Control of Resources in Embedded Systems Our facts of life Mobile-phone chipsets manufactured in very long series Cents of manufacturing cost matters Power consumption (and heat dissipation) matters …and performance, of course Resource utilization is everything Development costs are secondary Carefully tuned low-level implementations

3 Adaptivity & Control of Resources in Embedded Systems Current Practices are Challenged Multiple processor cores Vector instructions Complex execution models Neon Load and Store with Alignment Instruction Decode Instruction Execute and Load/Store ALU Multiply Pipe0 E1E3E4 M1 E2 M2M3N1 16entry Inst Queue + Dec N6N2N3N4N5 E5 Load/Store Pipe0or1 AGU RAM + TLB Format Fwd WB BP update ALU Pipe1 Shft ALU + Flags Sat BP update WB Shft ALU + Flags Sat BP update WB MUL ACCWB Instruction Fetch F1F2 AGU 12entry Fetch Queue F0 RAM + TLB BTB GHB RS Integer ALU, Multiply, and Shift Pipes Non-IEEE FMAC Pipe Non-IEEE FADD Pipe IEEE Single/Double Precision VFP Load/Store and Permute VFP FMUL WB FMTALUABS SHIFT DUP MUL 1 2 ACC 1 2 WB FADD WB PERM entry Store Queue WB D1D2D3D4 Instruction Decode Score Board + Issue Logic Reg Read + Fwd Mux WB Align mux w/RF L2cache data to L1instruction cache L3 Memory System AXI Req AXI Resp Data Fwd Unified L2Cache+RAM L2Data ArrayL2Tag Array ArbRAM2 Tag Miss Data Fmt RAM1 2 1 L1L2L3L4L5L6aL7 L1data cache miss L1instruction cache miss Branch mispredict penelty L1data cache Neon data L2 Neon data to L1data cache Integer Register Writeback Neon Register Writeback Pending and Replay Queue Dec/ Seq Dec/ Queue Read/ Write Score Board + Issue Logic Early Dec Early Dec RegFile ID Remap Replay penalty A r c h i t e c t u r a l R e g i s t e r F i l e D0E0 RAM3 Bank Mux L8L6b Embedded Trace Macrocell Addr Gen1 Cmp Out TrcEn 1 2 ISync Addr Gen2 Phead 1 2 FIFO Align Rotate FIFO Data Pack FIFO Data Txfr T10/T11T3T0T4T5T6T7T8T9T2T1T12/T13 Stage Integer Pipeline 10Stage NEON Pipeline External Trace Port Dec +R/W Hzd Check M0 mux L1/ MRC 8entry Load Queue ARM Register Writeback Source: Presentation at ARMDevelopers’Conference‘05 L1: ADD r3,r1,r4 VMOV.I16 q2,#0x14 ADD r6,r3,#1 VLD1.8 {d2},[r12],r2 SUB r8,r3,#1 ADD r9,r3,#2 VLD1.8 {d3},[r6] SUB r7,r3,#2 ADD r3,r3,#3 ADD r5,r5,#1 VADDL.U8 q1,d2,d3 VLD1.8 {d0},[r8] CMP r5,#8 ADD r4,r4,r2 VLD1.8 {d1},[r9] VMUL.I16 q1,q2,q1 VLD1.8 {d7},[r7] VADDL.U8 q0,d0,d1 VLD1.8 {d6},[r3] VMOV.I16 q2,#0x5 VMUL.I16 q0,q2,q0 VSUB.I16 q0,q1,q0 VMOVL.U8 q1,d7 VADD.I16 q1,q0,q1 VMOVL.U8 q0,d6 VADDL.S16 q2,d2,d0 VADDL.S16 q1,d3,d1 VRSHRN.I32 d0,q2,#5 VRSHRN.I32 d1,q1,#5 VCLT.S16 q1,q0,#0 VMOVN.I16 d4,q1 VMOV.I16 q1,#0xff VMIN.S16 q0,q0,q1 VMOVN.I16 d1,q0 VMOV.I16 d0,#0 VBSL d4,d0,d1 VST1.8 {d4},[r0],r2 BLT L1

4 Adaptivity & Control of Resources in Embedded Systems Why is vectorization hard? for (i=0; i<64; i++) x[i] = x[i] + y[i]; for (i=0; i<64; i+=L) x[i:i+L-1] += y[i:i+L-1]; x[0]y[0] x[1]y[1] x[0] x[1] x[2]y[2] x[3]y[3] x[2] x[3] y[0]y[1]x[0]x[1]x[0]x[1]y[2]y[3]x[2]x[3]x[2]x[3] What if x = &y[1]

5 Adaptivity & Control of Resources in Embedded Systems Wouldn’t it be easier if the compiler decided where to put the results?

6 Adaptivity & Control of Resources in Embedded Systems Yes.

7 Adaptivity & Control of Resources in Embedded Systems Why is parallelization hard? /*******************************/ /* Decode luma AC coefficients */ /*******************************/ for(BlockNr4x4 = 0 ; BlockNr4x4 < 4 ; BlockNr4x4++) { x = ((BlockNr & 0x1) << 1) + (BlockNr4x4 & 0x1); y = (BlockNr & 0x2) + ((BlockNr4x4 & 0x2) >> 1); Block = Ctemp + (BlockNr4x4<<4); nC = ComputeNCLuma(MbData_p, Sessiondata_p->MbDataLeft_p, Sessiondata_p->MbDataUp_p, y, x); nonZeroCoeffsLumaYX[y][x] = Residual(Sessiondata_p, Block, maxNumCoeffAC, nC); } /************************************/ /* Average interpolated half-pixels */ /************************************/ for(y=PartitionHeight; y!=0; y--) { Result_p = ReconstructionPtr_p; for(x=PartitionWidth; x!=0; x--) { P1 = CLIP((*hpixel1_p )>>5); P2 = CLIP((*hpixel2_p )>>5); *Result_p++ = (uint8)((P1+P2+1)>>1); } ReconstructionPtr_p += Width; } We just know there is no data dependence The compiler has to prove it ?

8 Adaptivity & Control of Resources in Embedded Systems btype acdc out signed data idct2d out in video mbd mba mca motion mv btype mcd parser b btype mv bitstream DDR ra wa wd rd display decoder tex dependence is explicit Local state is retained within actors

9 Adaptivity & Control of Resources in Embedded Systems btype acdc out signed data idct2d out in video mbd mba mca motion mv btype mcd parser b btype mv bitstream DDR ra wa wd rd display decoder tex We could run all actors concurrently, couldn’t we..?

10 Adaptivity & Control of Resources in Embedded Systems …and we’ve got lots of them

11 Adaptivity & Control of Resources in Embedded Systems Isn’t this just function partitioning? - We already do that in C with threads

12 Adaptivity & Control of Resources in Embedded Systems Isn’t this just function partitioning? We haven’t said anything about processor assignment nor thread mapping In fact, we do no need any concurrency (context switches, preemtion etc.) Purely an issue of load balancing! Typical actors are too fine-grain to be mapped to a processor We need to form larger-grain actors

13 Adaptivity & Control of Resources in Embedded Systems btype acdc out signed data idct2d out in video mbd mba mca motion mv btype mcd parser b btype mv bitstream DDR ra wa wd rd display decoder tex Actors are too fine-grain… Sorry… I’m out of them, hang on a sec. Hey guys! Gimme some pixels and motion vectors. Pronto! yeah, yeah … No need to shout! Oh sorry, here you go! Are you crazy? How should I know where to look? And finally, the pixel is served Here, have all you can eat! Would you mind passing me a pixel?

14 Adaptivity & Control of Resources in Embedded Systems Points made so far Compilers have a hard time restructuring C programs –Much of this boils down to dependence analysis –C programs tend to be over-specified Manual optimization is becoming harder –Increasingly complex execution models –Only realistic to fine-tune a tiny fraction of the code

15 Adaptivity & Control of Resources in Embedded Systems Points made so far CAL does not over-specify sequencing of the computations (true data dependence) CAL says nothing at all about –Buffers (size, location, layout, alignment etc.) The FIFOs could use other mechanisms… –Mapping to threads/processors Toolchain has many degrees of freedom –Parallelization and vectorization appear practical –Naive mapping actors→threads inefficient

16 Adaptivity & Control of Resources in Embedded Systems Intro SDF DDF Efficient CAL s/w Realization

17 Adaptivity & Control of Resources in Embedded Systems Synchronous Dataflow (SDF) [Lee87] Actors consume and produce a fixed number of tokens in each firing Expressiveness is sacrificed Allows for extensive compile-time analysis –Static scheduling (no risk of deadlock) –Static allocation of buffers (no unbounded buffering) –Possible to reason about performance metrics Possible to generate tight code

18 Adaptivity & Control of Resources in Embedded Systems SDF Example A D ECBF Token rates are shown at input/output ports There may be sources and sinks There may be cycles, but initial tokens (delays) are required to avoid deadlock The “dots” are duplicators 1 1 1

19 Adaptivity & Control of Resources in Embedded Systems Finding a static schedule A D ECBF Non-terminating execution normally assumed We want to repeat the schedule indefinitely Two requirements on such a schedule: –Balanced token production/consumption (consistency) –No deadlock (sufficient delay on cycles)

20 Adaptivity & Control of Resources in Embedded Systems Balance equations A D ECBF (E,F) (E,D) (D,E) (C,E) (B,D) (B,C) (A,B) FEDCBA (E,D) (D,E) (C,E) (B,D) (B,C) (A,B) rFrF rErE rDrD rCrC rBrB rArA = 0 0 0

21 Adaptivity & Control of Resources in Embedded Systems A D ECBF r F = 2n r E = 2n r D = 1n r C = 10n r B = 5n r A = 10n repetitions vector A2A2 A1A1 precedence graph B1B1 B3B3 A A5A5 C5C5 C6C6 D C2C2 C1C1 B2B2 A A3A3 A C3C3 B4B4 B5B5 A A9A9 A A7A7 A C9C9 A C7C7 E2E2 F1F1 F2F2 E1E1 for any positive integer n

22 Adaptivity & Control of Resources in Embedded Systems Constructing a schedule Any topological ordering of the precedence graph is a valid schedule –Fire as soon as enabled A A B C C A A B C C A A B C C E F A A B C C A A B C C D E F –Minimize buffers –Minimize appearances (in looped schedule) (A 2 B C 2 ) 5 E D E F 2 –Other criteria…

23 Adaptivity & Control of Resources in Embedded Systems Code synthesis A A B C C A A B C C A A B C C E F A A B C C A A B C C D E F A B C A C A B C A C E F A … A (A 2 B C 2 ) 5 E D E F 2 for i=1,2,…,5 A for j=1,2 B C E D E F precedence graph also good starting point for multi-processor scheduling algorithms

24 Adaptivity & Control of Resources in Embedded Systems Limitations of the SDF model Fixed token rates ≈ one CAL action only In SDF all tokens must be consumed and produced in a single firing SDF can’t handle conditional actors –Fixed iteration supported by SDF –Data-dependent iteration is not Delays required on feedback loops –CAL actors can use state variables –Avoid reading tokens from loop until tokens produced (e.g. initialization phase) A B fixed iteration

25 Adaptivity & Control of Resources in Embedded Systems Conditional actors Conditional actors are not SDF SDF + Conditions = Boolean Dataflow –Turing complete language –Interesting properties no longer decidable action In:[x],Ctrl:[true]  T:[x]; action In:[x],Ctrl:[false]  F:[x]; action T:[x],Ctrl:[true]  Out:[x]; action F:[x],Ctrl:[false]  Out:[x]; switch T F ctrl in select T F ctrl out

26 Adaptivity & Control of Resources in Embedded Systems “Well-behaved” dataflow [Gao92] Restricted use of conditional actors switch T F cond in select T F out “conditional schema” X Y out T F T F C B false in ≈ ”loop schema” switch select if (cond) then out := X(in); else out := Y(in); in cond out x := in; while (C(x)) do x := B(x); end; out = x; in cond out these “clusters” of actors are SDF

27 Adaptivity & Control of Resources in Embedded Systems Cyclo-Static dataflow [Bilsen96] Actors have periodic token rates out (1,1,1,1,1,1,1,1,0) (2,0,0,0,0,0,0,0,0) mode (0,1,1,1,1,1,1,1,1) in Allows more flexible scheduling –Avoids excessive buffer sizes –Models dataflow that would deadlock in SDF this actor has period 9 each “phase” within the period has fixed rates 8 delays required on feedback loop

28 Adaptivity & Control of Resources in Embedded Systems next: Dynamic Dataflow (DDF) SDF boolean dataflow dynamic dataflow “universe” of CAL programs

29 Adaptivity & Control of Resources in Embedded Systems Dynamic dataflow (DDF) [Lee95] A determinate model of computation –outputs depend only on past inputs Can be implemented using blocking reads from FIFO channels –infinite capacity and non-blocking writes assumed May have several firing rules (≈CAL actions) –conditions on token availability and values (≈guards) Mapping from input to output functional –but state variables can be thought of as feedback

30 Adaptivity & Control of Resources in Embedded Systems Sequential firing rules Firing rules are evaluated (as if) using blocking reads out x y NDMerge out x y FairMerge state state’ action X:[x]  Out:[x]; action Y:[y]  Out:[y]; action X:[x], State:[0]  Out:[x], State’:[1]; action Y:[y], State:[1]  Out:[y], State’:[0]; 0 Read State =0=1 Read X Read Y Reading either X or Y may block -although other rule may fire. Not a DDF actor!

31 Adaptivity & Control of Resources in Embedded Systems Scheduling DDF Can’t be scheduled statically in general –Absence of deadlock undecidable –Buffer bounds undecidable Dynamic (run-time) scheduling Deadlock is a property of a dataflow graph –Unaffected by execution order Boundedness of buffers is not (necessarily) –Unfortunate order  unbounded buffers

32 Adaptivity & Control of Resources in Embedded Systems Avoiding unbounded buffering Limiting channel capacities –Bound can generally not be determined –Setting a too low capacity leads to deadlock Purely data-driven or demand-driven policy B C A D sink D always demands tokens, but are tokens on (B,C) channel consumed at same rate? source B always enabled, does C consume tokens on (B,C) channel at same rate? More clever regulation needed!

33 Adaptivity & Control of Resources in Embedded Systems Bounded Scheduling [Parks95] Start with (arbitrarily) bounded buffers Block on write to full buffer Use a simple basic scheduling algorithm –data-driven and demand-driven both work OK Grow smallest buffer on “artificial” deadlock deadlock-free graphs execute indefinitely with bounded buffers when possible

34 Adaptivity & Control of Resources in Embedded Systems Hybrid static/dynamic scheduling Schedule statically when possible Use run-time techniques when necessary Practical to identify statically schedulable clusters of actors in a CAL network? –We believe so and intend to explore this option within ACTORS –Novel analysis techniques required –“SDF actors” and “switch/select” likely to be rare –CAL doesn’t provide notation for cyclo-static actors The actor clusters useful building blocks of a “fully dynamic” multi-processor schedule? –This is our working assumption

35 Adaptivity & Control of Resources in Embedded Systems An example from the MPEG4 SP decoder Interpolate halfPel MOT RD qoqo q1q1 start row_col_0 other done start: action halfpel:[f]  done: action  guard y = 9; row_col_0: action RD:[d]  guard (x=0) or (y=0); other: action RD:[d]  MOT:[p] priority done > row_col_0 > other;

36 Adaptivity & Control of Resources in Embedded Systems An example from the MPEG4 SP decoder Interpolate halfPel MOT RD qoqo q1q1 start row_col_0 other done state=q 0 start y=9 row_col_0other y=0 x=0 done decision diagram of action firings (≈firing rules) state=q 1 y≠9 y≠0 x≠0

37 Adaptivity & Control of Resources in Embedded Systems An example from the MPEG4 SP decoder Interpolate halfPel MOT RD qoqo q1q1 start row_col_0 other done state=q 0 start y=9 row_col_0other y=0 x=0 done fsa-state transitions state=q 1 y≠9 y≠0 x≠0

38 Adaptivity & Control of Resources in Embedded Systems An example from the MPEG4 SP decoder Interpolate halfPel MOT RD qoqo q1q1 start row_col_0 other done state=q 0 start x := 0; y := 0; y=9 row_col_0other y=0 x=0 done state-variable updates x := x+1; x := 0; y := y+1; x<8 x := x+1; x := 0; y := y+1; x<8 state=q 1 y≠9 y≠0 x≠0

39 Adaptivity & Control of Resources in Embedded Systems An example from the MPEG4 SP decoder Interpolate halfPel MOT RD qoqo q1q1 start row_col_0 other done start x := 0; y := 0; done row_col_0 x := x+1; x := 0; y := y+1; x<8 other x := x+1; x := 0; y := y+1; x<8 y=0 y=9 y≠9 y≠0 x≥8 We have constructed the control-flow graph standard program analysis techniques (e.g. loop analyses) apply

40 Adaptivity & Control of Resources in Embedded Systems Cyclo-static behavior x=0 start row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 done x=1 row_col_0 other other other other other other other other x=8 row_col_0 other other other other other other other other … y=0 y=9 y=2 y=3 y=4 y=5 y=6 y=7 y=8 y=1 Cyclo-static period, N=64 1 input on halfPel, 81 on RD 64 output tokens on port MOT Interpolate (1,0,…,0) halfPel MOT (1,1,…,1) RD (11,1,1,1,1,1,1,1, 2,1,1,1,1,1,1,1, … 2,1,1,1,1,1,1,1)

41 Adaptivity & Control of Resources in Embedded Systems Vectorization [Ritz93] Aggregation of multiple firings Limited by feedback loops Interpolate (1,0,…,0) halfPel RD (11,1,1,1,1,1,1,1, 2,1,1,1,1,1,1,1, … 2,1,1,1,1,1,1,1) MOT (1,1,…,1) Interpolate ?

42 Adaptivity & Control of Resources in Embedded Systems Data-dependent behavior Add VID MOT BTYPE TEX state=q 0 Read BTYPE newVop texture only motion only combine state=q 3 state=q 4 state=q 2 state=q 1

43 Adaptivity & Control of Resources in Embedded Systems Data-dependent behavior Add VID MOT BTYPE TEX Read BTYPE newVop texture only motion only combine Sometimes distinct “operational modes” with static behavior are identifiable

44 Adaptivity & Control of Resources in Embedded Systems Data-dependent behavior Sometimes distinct “operational modes” with static behavior are identifiable Add VID (1,1,…1) MOT (1,0,…,0) BTYPE (1,1,…,1) TEX N=64 “texture only” Add VID (1,1,…1) (1,1,…,1) MOT (1,0,…,0) BTYPE (1,1,…,1) TEX N=64 “combine” Add VID MOT 3 BTYPE TEX N=1 “new VOP” Add VID (1,1,…1) (1,1,…,1) MOT (1,0,…,0) BTYPE TEX N=64 “motion only”

45 Adaptivity & Control of Resources in Embedded Systems Clustering Integration of adjacent actors Interpolate halfPel RD Add VID BTYPE TEX Cluster The cluster inherits its “operational modes” and cyclo-static behavior from the original actors The Interpolate actor will be fired in two of the modes only: “motion only” and “combine”

46 Adaptivity & Control of Resources in Embedded Systems Proposed Tools Infrastructure opendf Compiler CAL Actors (XLIM) CAL network/model WP1 Model Compiler WP2 ARM Compiler larger-grain actors (CAL) model-level analysis, annotations Cal2C Cal2HDL source-to-source transformation allows existing tools to leverage from model compilation

47 Adaptivity & Control of Resources in Embedded Systems Summary Dataflow programming and CAL offers a promising alternative to current practices –Making better use of parallelism Naive mapping actors  threads won’t do the trick We need larger-grain actors –Reduced overhead of run-time scheduling There is an extensive body of work on efficient realization of SDF (+extensions) CAL requires additional, novel techniques –Some initial ideas were presented today

48 Adaptivity & Control of Resources in Embedded Systems Some references [Lee87] E. A. Lee and D. G. Messerschmitt, “Static scheduling of synchronous dataflow programs for digital signal processing,” IEEE Trans. Comput., vol. 36, no. 1, pp. 24–35, [Lee95] E. A. Lee and T. M. Parks, “Dataflow process networks,” Proceedings of the IEEE, vol. 83, no. 5, pp. 773–801, [Gao92] G. R. Gao, R. Govindarajan, and P. Panangaden, “Well-behaved dataflow programs for dsp computation,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-92, vol. 5, pp. 561–564, IEEE, March [Bilsen96] G. Bilsen, M. Engels, R. Lauwereins, and J. Peperstraete, “Cyclo-static dataflow,” IEEE Trans. Signal Processing, vol. 44, no. 2, pp. 397–408, [Parks95] T. M. Parks, Bounded Scheduling of Process Networks. PhD thesis, EECS Department, University of California, Berkeley, [Bhat95] S. Bhattacharyya, P. Murthy, and E. Lee, “Optimal parenthesization of lexical orderings for dsp block diagrams,” in Proceedings of the InternationalWorkshop on VLSI Signal Processing, pp. 177–186, October [Ritz93] S. Ritz, M. Pankert, V. Živojnovi´c, and H. Meyr, “Optimum vectorization of scalable synchronous dataflow graphs,” in Intl. Conf. on Application-Specific Array Processors, pp. 285–296, Prentice Hall, IEEE Computer Society, 1993.


Download ppt "ACTORS: Adaptivity & Control of Resources in Embedded Systems What does it take to realize CAL networks efficiently in software? Carl von Platen, Ericsson."

Similar presentations


Ads by Google