Presentation is loading. Please wait.

Presentation is loading. Please wait.

Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB.

Similar presentations

Presentation on theme: "Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB."— Presentation transcript:

1 Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB

2 3/6/012 The Compilation Problem Programming ModelExecution Model Communicating EFSM operators Communicating page configs - unrestricted size, # IOs, timing- fixed size, # IOs, timing Paged virtual hardware Compile memory segment TDF operator stream memory segment compute page stream Compilation is a resource-binding xform on state machines + data-paths

3 3/6/01Eylon Caspi – Qualifying Exam3 Overview  Motivation  Paged virtual hardware – software survival + scalability  SCORE programming model  Compilation methodology  New page partitioning techniques  Automatic synthesis & partitioning of communicating FSMs  Evaluation + Architectural Studies  Timeline

4 3/6/01Eylon Caspi – Qualifying Exam4 Reconfigurable Computing  Programmable logic + Programmable interconnect (e.g. FPGA)  10x-100x gain vs. microprocessors in:  Performance  Functional density (work per area-time)  Spatial Computing  Parallelism; custom data paths  Programmability  Custom execution sequence; specialization  BUT current models expose resource constraints to the programmer  Programmer has to target a specific device  Limits software longevity Graphics copyright by their respective company

5 3/6/01Eylon Caspi – Qualifying Exam5 Solution: Virtual Hardware  Compute model with unbounded resources  Programmer no longer targets a specific device  Enables software longevity, scalability  Requires efficient hardware virtualization  Large device  concurrent spatial execution  Small device  time multiplexing  Paging model

6 3/6/01Eylon Caspi – Qualifying Exam6 Previous Approaches to Paging  WASMII: Register IO  [Ling+Amano, FCCM ‘93]  Page IO via registers  Evaluate each page for a cycle, then reconfigure  Reconfiguration time dominates execution  DPGA: Configuration Cache  [DeHon, FPGA ‘94], TM-FPGA [Xilinx, FCCM ‘97]  Fast reconfiguration  area, power  Reconfiguration power dominates execution  PipeRench: Stripes  [CMU, FPGA ‘98]  Pipelined reconfiguration  Feed-forward computation only time

7 3/6/01Eylon Caspi – Qualifying Exam7 Paging + Streaming  Streaming allows efficient, useful virtualization  Amortizes reconfiguration cost over a larger epoch  Exploits program structure  Less restrictive communication topology  Compiler and scheduler’s joint responsibility buffers Swap

8 3/6/01Eylon Caspi – Qualifying Exam8 SCORE Compute Model  Program = DFG of compute nodes  Kahn process network  blocking read, non-blocking write  Compute: SFSM (Streaming Finite State Machine)  Concretely: page + FSM to implement token-flow semantics  Abstractly: task with local control  Communication: Stream  Abstraction of wire, with buffering  Storage: Memory Segment  Dynamics:  Dynamic local behavior in SFSM  Unbounded resource usage: stream buffer expansion  Dynamic graph allocation in STM (Streaming Turing Machine)

9 9 SCORE Programming Model: TDF  TDF = intermediate, behavioral language for:  EFSM Operators Static operator graphs  State machine for:  Firing signatures Control flow (branching)  Firing semantics:  When in state X, wait for X’s inputs, then fire (consume, act) select ( input boolean s, input unsigned[8] t, input unsigned[8] f, output unsigned[8] o ) { state S (s) : if (s) goto T; else goto F; state T (t) : o=t; goto S; state F (f) : o=f; goto S; } stf o select

10 3/6/01Eylon Caspi – Qualifying Exam10 SCORE Hardware Model  Paged FPGA  Compute Page (CP)  Fixed-size slice of RC hardware  Fixed number of I/O ports  Distributed, on-chip memory  Configurable Memory Block (CMB)  Stream access  High-level interconnect  Microprocessor  Run-time support + user code

11 3/6/01Eylon Caspi – Qualifying Exam11 SCORE Software Infrastructure  Device Simulator  Cycle-accurate behavioral simulation  Parameterized (e.g. #pages)  Interact with concurrent user processes (STMs) via stream API  Page Scheduler  Version 1: dynamic, list-based scheduling (by input availability)  Version 2: static, precedence-based  TDF Compiler  Compiles to working C++ simulation code  No partitioning (page = 1 TDF operator)  Applications  Wavelet, JPEG, MPEG, IIR Device size Run time

12 3/6/01Eylon Caspi – Qualifying Exam12 Communication is King  With virtualization, Inter-page delay is unknown, sensitive to:  Placement  Interconnect implementation  Page schedule  Technology – wire delay is growing  Inter-page feedback is SLOW  Partitionto contain FB loops in page  Scheduleto contain FB loops on device

13 3/6/01Eylon Caspi – Qualifying Exam13 Structural Partitioning is Not Enough  Structural partitioning does not address feedback loops  Wire min-cut  FM, flow-based  Minimum wire length  Spectral  Delay-optimal DAG mapping  DAGON, FlowMap, Wong  Structural partitioning does not address communication rates, dynamics  All loops are NOT created equal

14 3/6/01Eylon Caspi – Qualifying Exam14 FSM Decomposition is not enough  Ashar+Devadas+Newton (ICCAD ‘89)  Minimize logic  Kuo+Liu+Cheng (ISCAS ‘95)  Minimize wires  Benini+DeMicheli+Vermeulen (ISCAS ‘98)  Minimize power  None consider inter-page delay  None consider cutting / scheduling data-path separately from FSM Ma Mb Ma Mb Ma Mb Fa Fb

15 3/6/01Eylon Caspi – Qualifying Exam15 Outline  Motivation  Compilation Methodology  Evaluation + Architectural Studies  Time Line

16 3/6/01Eylon Caspi – Qualifying Exam16 Compilation – Scope  Synthesis + Partitioning of SFSMs  TDF  Pages  Resource binding  Target  Parameterized hardware model / simulation  Constrained optimization problem  Constraints  page area, IO, timing  Optimality Criteria  Primary:Communication delay  Secondary:Communication bandwidth, Area Compile memory segment TDF operator stream memory segment compute page stream

17 3/6/01Eylon Caspi – Qualifying Exam17 Compilation Flow Overview (1) Optimizations (2) Data path timing + scheduling (3) Partitioning  Ignore:  Place / route / retime in page  Known solutions in the community  Page scheduling  Responsibility of separate scheduler

18 3/6/0118 Synthesis + Partitioning Flow Pipeline Extraction Data Path Mapping Partition Large States Schedule DF into States Cluster States Page Packing Synthesize Page FSMs Compiler Optimizations Optimization Preliminary Code Data-path Partitioning p p p p p

19 3/6/01Eylon Caspi – Qualifying Exam19 How Big is an Operator? Wavelet Decode Wavelet Encode JPEG Encode MPEG Encode JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

20 3/6/0120 Partitioning Tasks (1)Decompose/ shrink SFSMs (2)Pack SFSMs onto page Pipeline Extraction Data Path Mapping Partition Large States Schedule DF into States Cluster States Page Packing Synthesize Page FSMs Compiler Optimizations p p p p

21 21 Pipeline Extraction  Hoist uncontrolled FF data-flow out of FSMD  Benefits:  Shrink FSM cyclic core  Extracted pipeline has more freedom for scheduling and partitioning Extract state foo(x): if (x==0)... state foo(xz): if (xz)... x state DF CF x==0 xz x pipeline

22 3/6/01Eylon Caspi – Qualifying Exam22 Pipeline Extraction – Extractable Area JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

23 3/6/01Eylon Caspi – Qualifying Exam23 Pipeline Extraction – Residual SFSM JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR

24 3/6/01Eylon Caspi – Qualifying Exam24 Data-path Mapping / Scheduling  Task:  Bind technology-specific area/time to data-path primitives  Schedule data-path primitives in state machine  Fixed-frequency target  Decompose primitives into multi-cycle operations  Data-path module library / tree matching  Pipeline linearized sequences / loops  DAG mapping state logic is insufficient  Compiler technology  Code motion  Software pipelining

25 3/6/01Eylon Caspi – Qualifying Exam25 Delay-Oriented State Clustering  Indivisible unit: state (CF+DF)  Spatial locality in state logic  Cluster states into page-size sub-machines  Inter-page communication for data flow, state flow  Sequential delay is in inter-page state transfer  Cluster to maintain local control  Cluster to contain state loops  Similar to:  VLIW trace scheduling [Fisher ‘81]  FSM decomp. for low power [Benini/DeMicheli ISCAS ‘98]  VM/cache code placement  GarpCC HW/SW partitioning [Callahan ‘00]

26 3/6/01Eylon Caspi – Qualifying Exam26 State Clustering Formulation  Min-cut transition probabilities in state flow graph  Probabilities from profiling  Area-constrained  Balanced min-cut partitioning [Yang+Wong, ACM ‘94]  Iterate to desired partition area (1-  )A ≤ a(X) ≤ (1+  )A  IO-constrained  Add wire edges  Mix edge weights: (c)w wire + (1-c)w SF  Use smallest IO-feasible c  Requires all states to be smaller than page p1p1 p2p2 p3p3 p4p4 p5p5 w1w1 w2w2 w4w4 w5w5 w6w6 w8w8 w9w9 w3w3 w7w7 a2a2 a1a1 a3a3 a4a4

27 3/6/01Eylon Caspi – Qualifying Exam27 Page Packing  Cluster SFSMs + pipelines  Avoid page fragmentation  Min-cut streams of top-level DFG  Allow cutting pipelines, not SFSMs  Area and IO constrained (Wong balanced min-cut partition)  Disallow certain topologies  No dynamic-rate streams in page  Data-flow feedback?

28 3/6/01Eylon Caspi – Qualifying Exam28 Outline  Motivation  Compilation Methodology  Evaluation + Architectural Studies  Time Line

29 3/6/01Eylon Caspi – Qualifying Exam29 Evaluating Paging Overhead  Applications  Must be rewritten in TDF  Existing: Wavelet, JPEG, MPEG, IIR  To do: ADPCM, BABAR particle detector  Metrics  Circuit area(#pages x page-size)  Page delay(LUT depth per firing)  Performance(total run-time, “makespan”)  Baseline comparison  “Unpartitioned”: page = 1 TDF operator  Ideal virtualization with zero partitioning cost – cannot do better

30 3/6/0130 Page Size Studies  Paging overhead varies with:  Application Page size, IO Match thereof  Is paging overhead robust to a mismatch?  Vary page parameters, measure:  (1) Pure area overhead  (2) Pure performance overhead  Execute spatially in expanded hardware  (3) Virtualized performance overhead  Execute in fixed device size (1) (2)(3)

31 3/6/01Eylon Caspi – Qualifying Exam31 Outline  Motivation  Compilation Methodology  Evaluation + Architectural Studies  Time Line

32 3/6/01Eylon Caspi – Qualifying Exam32 Status  SCORE compiler / simulator / scheduler  Compile+execute unpartitioned (page = 1 TDF op)  Preliminary synthesis + partitioning work  Pipeline extraction  FSM synthesis to SIS  Area-constrained state clustering  To do  Complete initial implementation  Evaluate  Improve – secondary implementation

33 3/6/01Eylon Caspi – Qualifying Exam33 To Complete Initial Implementation  IO-constrained state clustering  Decompose large states  Page packing  Data-path scheduling in states  Synthesize partitioned SFSMs

34 3/6/01Eylon Caspi – Qualifying Exam34 Secondary Implementation – Possibilities  Optimizations  SW pipelining  Use SUIF  State clustering with replication  Unified state clustering + page packing  Cluster states of all operators simultaneously  Finer-grained clustering  Recast as BDF, min-cut stream rates

35 3/6/01Eylon Caspi – Qualifying Exam35 Time Line 345678910111212345678 Impl. 1 Eval Impl. 2 Eval Thesis writing Month: Year: 20012002

36 3/6/01Eylon Caspi – Qualifying Exam36 Summary  Partitioning and paging enables  Software survival / scaling  Efficient use of small HW for dynamic apps  My Contributions  Methodology for page synthesis + partitioning  Necessary for efficient virtualization  Evaluation framework  Verify that paging can be efficient  Architectural studies

37 3/6/01Eylon Caspi – Qualifying Exam37 Supplemental Material  SFSMs + transforms  SCORE simulation + scaling results  Page hardware model  Synthesis observations  Architectural studies

38 3/6/01Eylon Caspi – Qualifying Exam38 TDF  Dataflow Process Network  Dataflow Process Network [Parks+Lee, IEEE May ‘95]  Process enabled by set of firing rules:R = {R 1, R 2, …, R N }  Firing rule = set of patterns:R i = {R i,1, R i,2, …, R i,p }  DF process for a TDF operator:  Feedback arc for state  One firing rule per state  Patterns match state value + presence of desired inputs  E.g. for state i:R i = {R i,1, R i,2, …, [i]}  Patterns:R i,j = [*]if input j is in state i’s input signature R i,j =  if input j is not in state i’s input signature R i,p = [i]for final input, representing state arc  These are sequential firing rules  Partitioned SFSM adds “wait” state process state

39 3/6/01Eylon Caspi – Qualifying Exam39 SFSM Partitioning Transform  Only 1 partition active at a time  Transform to activate via streams  New state in each partition: “wait”  Used when not active  Waits for activation from other partition(s)  Has one input signature (firing rule) per activator  Firing rules are not sequential, but determinism guaranteed  Only 1 possible activator  Activation streams from given source to given dest. partitions can be merged + binary-encoded A B C D A B Wait AB C D Wait CD {A,B} {C,D}

40 3/6/01Eylon Caspi – Qualifying Exam40 Distributing/Collecting Shared Streams  Requires inter-page synchronization for ordering  Two schemes for input distribution  (1) send token to all pages –Inactive pages must discard tokens, must know how many to discard  (2) send token only to active page –Distributor must know state –(a) present state requests token OR –(b) previous state pre-fetches token  One scheme for output collection –Collector must know state  How to cluster distributors / collectors?  Distributor scheme (1) and collector incur no sequential delay (wire min-cut ok)  Distributor scheme (2)(a) can be cast into delay-optimal state clustering: –Decompose reading states into sequences of single-read states –Pre-cluster states that read same stream – this forms distributors –Sequential delay of read request is now modeled as state transfer to distributor A B C D i o

41 3/6/01Eylon Caspi – Qualifying Exam41 Decomposing Large States  A state may be larger than a page  Decomposing into a sequence of page-size states leads to excessive inter-page transfer  Better: delay-optimal DAG- mapping into parallel pages

42 3/6/01Eylon Caspi – Qualifying Exam42 SFSM Optimizations  Many traditional compiler optimization techniques apply to TDF  State flow ~ basic block flow  Different cost model  “Unlimited” registers and functional units  E.g. work-reducing optimizations  Constant folding / propagation  Common subexpression elimintation  Hoist loop invariants  Strength reduction

43 3/6/01Eylon Caspi – Qualifying Exam43 SCORE Functional Simulation  FPGA based on HSRA [Berkeley, FPGA ’99]  CP:512 4-LUTs  CMB:2Mbit DRAM  Area for CP-CMB pair:  Page reconfiguration:5000 cycles (from CMB)  Synchronous operation(same clock speed as processor)  x86 microprocessor  Page Scheduler task  Swap on timer interrupt (every 250,000 cycles)  Fully dynamic scheduling.25  :12.9mm 2 (1/9 of PII-450).18  : 6.7mm 2 (1/16 of PIII-600)

44 3/6/01Eylon Caspi – Qualifying Exam44 Application: JPEG Encode

45 3/6/01Eylon Caspi – Qualifying Exam45 Scaling Results: JPEG Encode Physical Compute Pages Total Time (Makespan in millions of cycles)

46 3/6/01Eylon Caspi – Qualifying Exam46 Page Hardware Model  Page = fixed-size slice of rsrcs + stream interface  FSM for:  Firing Output emission Data-path control Branching FSM Reconfigurable Fixed logic

47 3/6/0147 Page Firing Logic  Sample firing logic  3 inputs (A,B,C)  3 outputs (X,Y,Z)  Single signature

48 3/6/01Eylon Caspi – Qualifying Exam48 How Large is a State? JPEG Encode JPEG Decode MPEG (I) MPEG (P) IIR

49 49 SFSM Firing Delay  Complex SFSM may require ≥1 cycle just for control  Evaluate firing rule, generate control signals, compute next state  Should we partition SFSM to minimize FSM logic?  No – incurring inter-page communication latency is worse! JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR Histogram of FSM Delay for 47 Operators (unpartitioned) 4-LUT Depth JPEG Encode JPEG Decode MPEG (I) MPEG (P) Wavelet Encode IIR Histogram of FSM Inputs for 47 Operators (unpartitioned)

50 3/6/01Eylon Caspi – Qualifying Exam50 Scaling the Hardware Resources  A simplified scaling model for architectural studies  Scaling page size (LUTs) induces scaling of other resources, e.g.:  Scaling memory  Constant CP-to-CMB ratio  Scaling page IO  Rent’s Rule: IO = CA p, (0 ≤ p ≤ 1)

Download ppt "Compilation for Scalable, Paged Virtual Hardware Eylon Caspi Qualifying Exam 3/6/01 University of California, Berkeley IAIA IBIB OAOA OBOB."

Similar presentations

Ads by Google