What does it take to realize CAL networks efficiently in software?

Name: What does it take to realize CAL networks efficiently in software?
Uploaded: 2017-11-28T03:16:47+00:00
Duration: PTM42S7
Channel: Catherine Darter
Description: What does it take to realize CAL networks efficiently in software?

What does it take to realize CAL networks efficiently in software?
Carl von Platen, Ericsson AB, Mobile Platforms Artist2 - ACTORS workshop in Pisa, Italy, May 5, 2008

Our facts of life Mobile-phone chipsets manufactured in very long series Cents of manufacturing cost matters Power consumption (and heat dissipation) matters …and performance, of course Resource utilization is everything Development costs are secondary Carefully tuned low-level implementations

Current Practices are Challenged
L1: ADD r3,r1,r4 VMOV.I16 q2,#0x14 ADD r6,r3,#1 VLD {d2},[r12],r2 SUB r8,r3,#1 ADD r9,r3,#2 VLD {d3},[r6] SUB r7,r3,#2 ADD r3,r3,#3 ADD r5,r5,#1 VADDL.U8 q1,d2,d3 VLD {d0},[r8] CMP r5,#8 ADD r4,r4,r2 VLD {d1},[r9] VMUL.I16 q1,q2,q1 VLD {d7},[r7] VADDL.U8 q0,d0,d1 VLD {d6},[r3] VMOV.I16 q2,#0x5 VMUL.I16 q0,q2,q0 VSUB.I16 q0,q1,q0 VMOVL.U8 q1,d7 VADD.I16 q1,q0,q1 VMOVL.U8 q0,d6 VADDL.S16 q2,d2,d0 VADDL.S16 q1,d3,d1 VRSHRN.I32 d0,q2,#5 VRSHRN.I32 d1,q1,#5 VCLT.S16 q1,q0,#0 VMOVN.I16 d4,q1 VMOV.I16 q1,#0xff VMIN.S16 q0,q0,q1 VMOVN.I16 d1,q0 VMOV.I16 d0,#0 VBSL d4,d0,d1 VST {d4},[r0],r2 BLT L1 Multiple processor cores Vector instructions Complex execution models Neon Load and Store with Alignment Instruction Decode Instruction Execute and Load / Store ALU Multiply Pipe E 1 3 4 M 2 N 16 entry Inst Queue + Dec 6 5 Load or AGU RAM TLB Format Fwd WB BP update Shft Flags Sat MUL ACC Instruction Fetch F 12 Fetch BTB GHB RS Integer , and Shift Pipes Non - IEEE FMAC Pipe FADD Pipe Single Double Precision VFP and Permute VFP FMUL FMT ABS SHIFT DUP FADD PERM 8 D Score Board Issue Logic Reg Read Mux Align mux w RF L cache data to L instruction cache Memory System AXI Req Resp Data Unified L Cache Data Array Tag Array Arb Tag Miss Fmt a 7 data cache miss instruction cache miss Branch mispredict penelty data cache Neon data Neon data to L data cache Integer Register Writeback Neon Register Writeback Pending and Replay Queue Seq Write Early RegFile ID Remap Replay penalty A r c h i t e u l R g s Bank b Embedded Trace Macrocell Addr Gen Cmp Out TrcEn ISync Phead FIFO Rotate FIFO Data Pack FIFO Txfr T 10 11 9 13 Stage Integer Pipeline Stage NEON Pipeline External Trace Port W Hzd Check MRC ARM Register Writeback Source: Presentation at ARM Developers ’ Conference ‘05

Why is vectorization hard?
for (i=0; i<64; i++) x[i] = x[i] + y[i]; for (i=0; i<64; i+=L) x[i:i+L-1] += y[i:i+L-1]; y[0] x[0] x[0] y[0] y[1] x[0] x[1] x[0] x[1] y[1] x[1] x[1] x[2] y[2] x[3] y[3] y[2] y[3] x[2] x[3] What if x = &y[1]

Wouldn’t it be easier if the compiler decided where to put the results?

Why is parallelization hard?
/*******************************/ /* Decode luma AC coefficients */ for(BlockNr4x4 = 0 ; BlockNr4x4 < 4 ; BlockNr4x4++) { x = ((BlockNr & 0x1) << 1) + (BlockNr4x4 & 0x1); y = (BlockNr & 0x2) ((BlockNr4x4 & 0x2) >> 1); Block = Ctemp + (BlockNr4x4<<4); nC = ComputeNCLuma(MbData_p, Sessiondata_p->MbDataLeft_p, Sessiondata_p->MbDataUp_p, y, x); nonZeroCoeffsLumaYX[y][x] = Residual(Sessiondata_p, Block, maxNumCoeffAC, nC); } /************************************/ /* Average interpolated half-pixels */ for(y=PartitionHeight; y!=0; y--) { Result_p = ReconstructionPtr_p; for(x=PartitionWidth; x!=0; x--) { P1 = CLIP((*hpixel1_p )>>5); P2 = CLIP((*hpixel2_p )>>5); *Result_p++ = (uint8)((P1+P2+1)>>1); } ReconstructionPtr_p += Width; ? We just know there is no data dependence The compiler has to prove it

dependence is explicit
decoder acdc idct2d data out in btype signed out motion display parser b tex video bitstream btype btype mbd mv mv mba mcd mca DDR Local state is retained within actors ra rd wa wd

We could run all actors concurrently, couldn’t we..?
decoder acdc idct2d data out in btype signed out display parser motion b tex video bitstream btype btype mbd mv mv mba mcd mca DDR ra rd wa wd

…and we’ve got lots of them

Isn’t this just function partitioning
Isn’t this just function partitioning? - We already do that in C with threads

Isn’t this just function partitioning?
We haven’t said anything about processor assignment nor thread mapping In fact, we do no need any concurrency (context switches, preemtion etc.) Purely an issue of load balancing! Typical actors are too fine-grain to be mapped to a processor We need to form larger-grain actors

Actors are too fine-grain…
yeah, yeah… decoder And finally, the pixel is served acdc idct2d data out in btype Sorry… I’m out of them, hang on a sec. signed Would you mind passing me a pixel? Hey guys! Gimme some pixels and motion vectors. Pronto! No need to shout! out parser motion display b tex video bitstream btype btype mbd mv mv mba mcd mca Are you crazy? How should I know where to look? Oh sorry, here you go! DDR ra rd wa wd Here, have all you can eat!

Points made so far Compilers have a hard time restructuring C programs
Much of this boils down to dependence analysis C programs tend to be over-specified Manual optimization is becoming harder Increasingly complex execution models Only realistic to fine-tune a tiny fraction of the code

Points made so far CAL does not over-specify sequencing of the computations (true data dependence) CAL says nothing at all about Buffers (size, location, layout, alignment etc.) The FIFOs could use other mechanisms… Mapping to threads/processors Toolchain has many degrees of freedom Parallelization and vectorization appear practical Naive mapping actors→threads inefficient

Efficient CAL s/w Realization
SDF Intro DDF

Synchronous Dataflow (SDF) [Lee87]
Actors consume and produce a fixed number of tokens in each firing Expressiveness is sacrificed Allows for extensive compile-time analysis Static scheduling (no risk of deadlock) Static allocation of buffers (no unbounded buffering) Possible to reason about performance metrics Possible to generate tight code

SDF Example There may be cycles, but initial tokens (delays) are required to avoid deadlock Token rates are shown at input/output ports 1 C 1 2 2 5 B E 1 1 F 2 1 1 5 The “dots” are duplicators 1 A 1 D 2 There may be sources and sinks 2

Finding a static schedule
1 C 1 2 2 5 B E 1 1 F 2 1 1 5 A 1 D 2 2 Non-terminating execution normally assumed We want to repeat the schedule indefinitely Two requirements on such a schedule: Balanced token production/consumption (consistency) No deadlock (sufficient delay on cycles)

Balance equations = 0 C B E F A D (E,F) (E,D) (D,E) (C,E) (B,D) (B,C)
1 C 1 2 2 5 B E 1 1 F 2 1 1 5 A 1 D 2 2 (E,F) (E,D) (D,E) (C,E) (B,D) (B,C) (A,B) F E D C B A 1 -2 rF rE rD rC rB rA 2 -1 1 -5 = 0 1 -5 2 -1 -2 1 1 -1

C B E F A D precedence graph B2 A A3 C3 A1 B3 A A5 C5 C6 A2 B4 B5 A A9
repetitions vector B1 rA = 10n rB = 5n C1 C2 D rC = 10n rD = 1n E1 E2 rE = 2n rF = 2n F1 for any positive integer n F2

Constructing a schedule
Any topological ordering of the precedence graph is a valid schedule Fire as soon as enabled A A B C C A A B C C A A B C C E F A A B C C A A B C C D E F Minimize buffers Minimize appearances (in looped schedule) (A2 B C2)5 E D E F2 Other criteria…

A A B C C A A B C C A A B C C E F A A B C C A A B C C D E F
Code synthesis A A B C C A A B C C A A B C C E F A A B C C A A B C C D E F (A2 B C2)5 E D E F2 A for i=1,2,…,5 for j=1,2 A B A C B C for j=1,2 A C precedence graph also good starting point for multi-processor scheduling algorithms A E B D C E C for j=1,2 E F F A A …

Limitations of the SDF model
Fixed token rates ≈ one CAL action only In SDF all tokens must be consumed and produced in a single firing SDF can’t handle conditional actors Fixed iteration supported by SDF Data-dependent iteration is not Delays required on feedback loops CAL actors can use state variables Avoid reading tokens from loop until tokens produced (e.g. initialization phase) A 100 1 B fixed iteration

Conditional actors Conditional actors are not SDF
switch select T T out in F F ctrl ctrl action In:[x],Ctrl:[true]  T:[x]; action In:[x],Ctrl:[false]  F:[x]; action T:[x],Ctrl:[true]  Out:[x]; action F:[x],Ctrl:[false]  Out:[x]; Conditional actors are not SDF SDF + Conditions = Boolean Dataflow Turing complete language Interesting properties no longer decidable

“Well-behaved” dataflow [Gao92]
Restricted use of conditional actors 1 B X 1 switch select switch in T T out T out F F select F 1 Y 1 T in C cond F false “conditional schema” ≈ ”loop schema” these “clusters” of actors are SDF if (cond) then out := X(in); else out := Y(in); x := in; while (C(x)) do x := B(x); end; out = x; in out in out cond cond

Cyclo-Static dataflow [Bilsen96]
Actors have periodic token rates this actor has period 9 each “phase” within the period has fixed rates (2,0,0,0,0,0,0,0,0) mode out (1,1,1,1,1,1,1,1,0) (0,1,1,1,1,1,1,1,1) in Allows more flexible scheduling Avoids excessive buffer sizes Models dataflow that would deadlock in SDF 2 8 8 delays required on feedback loop 8

next: Dynamic Dataflow (DDF)
cyclo-static dataflow SDF well-behaved dataflow “universe” of CAL programs boolean dataflow dynamic dataflow

Dynamic dataflow (DDF) [Lee95]
A determinate model of computation outputs depend only on past inputs Can be implemented using blocking reads from FIFO channels infinite capacity and non-blocking writes assumed May have several firing rules (≈CAL actions) conditions on token availability and values (≈guards) Mapping from input to output functional but state variables can be thought of as feedback

Sequential firing rules
Firing rules are evaluated (as if) using blocking reads NDMerge FairMerge x x out out y y state state’ action X:[x]  Out:[x]; action Y:[y]  Out:[y]; action X:[x], State:[0]  Out:[x], State’:[1]; action Y:[y], State:[1]  Out:[y], State’:[0]; Reading either X or Y may block -although other rule may fire. Not a DDF actor! Read State =0 =1 Read X Read Y

Scheduling DDF Can’t be scheduled statically in general
Absence of deadlock undecidable Buffer bounds undecidable Dynamic (run-time) scheduling Deadlock is a property of a dataflow graph Unaffected by execution order Boundedness of buffers is not (necessarily) Unfortunate order  unbounded buffers

Avoiding unbounded buffering
Limiting channel capacities Bound can generally not be determined Setting a too low capacity leads to deadlock Purely data-driven or demand-driven policy A C source B always enabled, does C consume tokens on (B,C) channel at same rate? sink D always demands tokens, but are tokens on (B,C) channel consumed at same rate? B D More clever regulation needed!

Bounded Scheduling [Parks95]
Start with (arbitrarily) bounded buffers Block on write to full buffer Use a simple basic scheduling algorithm data-driven and demand-driven both work OK Grow smallest buffer on “artificial” deadlock deadlock-free graphs execute indefinitely with bounded buffers when possible

Hybrid static/dynamic scheduling
Schedule statically when possible Use run-time techniques when necessary Practical to identify statically schedulable clusters of actors in a CAL network? We believe so and intend to explore this option within ACTORS Novel analysis techniques required “SDF actors” and “switch/select” likely to be rare CAL doesn’t provide notation for cyclo-static actors The actor clusters useful building blocks of a “fully dynamic” multi-processor schedule? This is our working assumption

An example from the MPEG4 SP decoder
Interpolate start: action halfpel:[f]  done: action  guard y = 9; row_col_0: action RD:[d]  guard (x=0) or (y=0); other: action RD:[d]  MOT:[p] priority done > row_col_0 > other; halfPel MOT RD row_col_0 start qo q1 done other

Interpolate state=q0 state=q1 halfPel MOT start RD y=9 y≠9 x=0 x≠0 done y=0 y≠0 row_col_0 row_col_0 other start qo q1 decision diagram of action firings (≈firing rules) done other

Interpolate state=q0 state=q1 halfPel MOT start RD y=9 y≠9 x=0 x≠0 done y=0 y≠0 row_col_0 row_col_0 other start qo q1 done fsa-state transitions other

Interpolate state=q0 state=q1 halfPel MOT start x := 0; y := 0; RD y=9 y≠9 done x=0 x≠0 y=0 y≠0 row_col_0 row_col_0 other start x<8 x<8 qo q1 x := x+1; x := 0; y := y+1; x := x+1; x := 0; y := y+1; done other state-variable updates

We have constructed the control-flow graph Interpolate start x := 0; y := 0; halfPel MOT RD standard program analysis techniques (e.g. loop analyses) apply row_col_0 x<8 x≥8 x := x+1; x := 0; y := y+1; row_col_0 y=0 start y≠0 other qo q1 x<8 x≥8 done other x := x+1; x := 0; y := y+1; y=9 y≠9 done

Cyclo-static behavior
Cyclo-static period, N=64 1 input on halfPel, 81 on RD 64 output tokens on port MOT Interpolate (1,0,…,0) halfPel MOT (1,1,…,1) (11,1,1,1,1,1,1,1, 2,1,1,1,1,1,1,1, … 2,1,1,1,1,1,1,1) RD x=0 start row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 row_col_0 done x=1 row_col_0 other other other other other other other other x=1 row_col_0 other other other other other other other other x=1 row_col_0 other other other other other other other other x=1 row_col_0 other other other other other other other other … x=8 row_col_0 other other other other other other other other y=0 y=1 y=2 y=3 y=4 y=5 y=6 y=7 y=8 y=9

Vectorization [Ritz93] Aggregation of multiple firings
Limited by feedback loops Interpolate Interpolate (1,0,…,0) halfPel 1 MOT (1,1,…,1) (11,1,1,1,1,1,1,1, 2,1,1,1,1,1,1,1, … 2,1,1,1,1,1,1,1) 64 RD 81 ?

Data-dependent behavior
state=q1 state=q2 state=q0 state=q3 state=q4 Add Read BTYPE BTYPE TEX VID MOT newVop texture only motion only combine

Sometimes distinct “operational modes” with static behavior are identifiable Add Read BTYPE BTYPE TEX VID MOT newVop texture only motion only combine

Sometimes distinct “operational modes” with static behavior are identifiable Add Add Add Add (1,0,…,0) BTYPE (1,0,…,0) BTYPE (1,0,…,0) BTYPE 3 BTYPE (1,1,…,1) TEX VID (1,1,…1) TEX VID (1,1,…1) (1,1,…,1) TEX VID (1,1,…1) TEX VID (1,1,…,1) MOT (1,1,…,1) MOT MOT MOT N=64 “texture only” N=64 “motion only” N=64 “combine” N=1 “new VOP”

Clustering Integration of adjacent actors Cluster Add
The cluster inherits its “operational modes” and cyclo-static behavior from the original actors BTYPE TEX VID Interpolate halfPel The Interpolate actor will be fired in two of the modes only: “motion only” and “combine” RD

Proposed Tools Infrastructure
CAL network/model CAL Actors (XLIM) larger-grain actors (CAL) opendf Compiler WP1 Model Compiler WP2 ARM Compiler Cal2C model-level analysis, annotations source-to-source transformation allows existing tools to leverage from model compilation Cal2HDL

Summary Dataflow programming and CAL offers a promising alternative to current practices Making better use of parallelism Naive mapping actorsthreads won’t do the trick We need larger-grain actors Reduced overhead of run-time scheduling There is an extensive body of work on efficient realization of SDF (+extensions) CAL requires additional, novel techniques Some initial ideas were presented today

Some references [Lee87] E. A. Lee and D. G. Messerschmitt, “Static scheduling of synchronous dataflow programs for digital signal processing,” IEEE Trans. Comput., vol. 36, no. 1, pp. 24–35, 1987. [Lee95] E. A. Lee and T. M. Parks, “Dataflow process networks,” Proceedings of the IEEE, vol. 83, no. 5, pp. 773–801, 1995. [Gao92] G. R. Gao, R. Govindarajan, and P. Panangaden, “Well-behaved dataflow programs for dsp computation,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP-92, vol. 5, pp. 561–564, IEEE, March 1992. [Bilsen96] G. Bilsen, M. Engels, R. Lauwereins, and J. Peperstraete, “Cyclo-static dataflow,” IEEE Trans. Signal Processing, vol. 44, no. 2, pp. 397–408, 1996. [Parks95] T. M. Parks, Bounded Scheduling of Process Networks. PhD thesis, EECS Department, University of California, Berkeley, 1995. [Bhat95] S. Bhattacharyya, P. Murthy, and E. Lee, “Optimal parenthesization of lexical orderings for dsp block diagrams,” in Proceedings of the InternationalWorkshop on VLSI Signal Processing, pp. 177–186, October 1995. [Ritz93] S. Ritz, M. Pankert, V. Živojnovi´c, and H. Meyr, “Optimum vectorization of scalable synchronous dataflow graphs,” in Intl. Conf. on Application-Specific Array Processors, pp. 285–296, Prentice Hall, IEEE Computer Society, 1993.

What does it take to realize CAL networks efficiently in software?

Similar presentations

Presentation on theme: "What does it take to realize CAL networks efficiently in software?"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

What does it take to realize CAL networks efficiently in software?

Similar presentations

Presentation on theme: "What does it take to realize CAL networks efficiently in software?"— Presentation transcript:

Similar presentations

About project

Feedback