Presentation is loading. Please wait.

Presentation is loading. Please wait.

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Similar presentations


Presentation on theme: "11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer."— Presentation transcript:

1 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer Architecture Laboratory University of Michigan at Ann Arbor

2 22 2 University of Michigan 2 Software Defined Radio  Use software routines instead of ASICs for the physical layer operations of wireless communication system  Advantages:  Multi-mode operation  Lower costs  Faster time to market  Prototyping and bug fixes  Chip volumes  Longevity of platforms  Enables future wireless communication innovations  Complexity favors software-based solutions

3 33 3 University of Michigan 3 Case Study: W-CDMA  Key software characteristics  Multiple kernels connected together as a system  Streaming computation  Vector-based inter-kernel communications  Mostly static computation patterns

4 44 4 University of Michigan 4 SODA: A SDR DSP Architecture (ISCA 06)  Control-data decoupled multi-core architecture  1 ARM general purpose control processor  Scalar algorithms and protocol controls  4 data processing elements  SIMD+Scalar units  Used for high-throughput DSP algorithms

5 55 5 University of Michigan 5 SODA Execution Model  Software managed scratchpad memories  Each PE can only access its local memory  DMA operations  Access global memory  Inter-PE communications  Algorithms statically mapped onto PEs  RPCs from the ARM control processor

6 66 6 University of Michigan 6 Compilation Challenges for SDR  Compilation support for SDR is essential  Flexibility  Lower development cost  More complex protocols  Compilation support for SDR is challenging  Heterogeneous multiprocessor hardware  ARM + DSPs  Two level scratchpad memories  Multiple software constraints  Throughput + code & data size + real-time execution + others

7 77 7 University of Michigan 7 2-Tier Compilation Process Multiprocessor system compilation DSP kernel compilation  This study is focused on system compilation  Kernel compilation is treated as a black box  Existing libraries  SIMD compilers  Objective  Kernel-to-PE assignments  Memory allocations  Subject to  Throughput constraints  Memory constraints

8 88 8 University of Michigan 8 System Compilation Outline  SPIR – Function level IR  Traditional IR is not adequate  Complex inter-function interactions  Backend compilation  Scheduling functions instead of instructions  Function-level modulo scheduling

9 99 9 University of Michigan 9 SPIR Overview  Dataflow programming model  Graph consists of nodes and edges  Two types of nodes  Kernel (yellow) nodes for modeling functions  Memory (blue) nodes for modeling vector buffers  Buffer stream description + vector stream description  Dataflow edges  Synchronous dataflow (in the scope of this paper)

10 10 University of Michigan 10 SPIR Overview  Problems with flat dataflow graph representations  Matched to the highest rate  SDR kernels have very different stream rates  Turbo decoder: input rate = 9600; output rate = 3200  LPF: input rate = 1; output rate = 1

11 11 University of Michigan 11 SPIR Overview  Problems with flat dataflow graph representations  All must match to 9600 of the Turbo decoder  Minimum LPF rate: input = 38.4K, output = 38.4K  Stream rates translate to memory buffers  Unnecessarily large memory buffers

12 12 University of Michigan 12 SPIR Overview  Hierarchical dataflow graphs  Different hierarchy level with different streaming rates  Streaming vectors are modeled as hierarchical communications  Top level: buffer queue descriptions  Bottom level: vector streaming descriptions

13 13 University of Michigan 13 SPIR Overview  W-CDMA  Modeled with 3-level hierarchy in SPIR  Memory nodes are inserted between nodes with child graph  4x decrease in memory buffer usage

14 14 University of Michigan 14 Coarse-grained System Compilation  Three major tasks  Resource allocation (processor, memory and DMA)  Kernel execution ordering  Kernel execution timing  Static or dynamic?  Static – compiler  Less flexible, more efficient  Dynamic – run-time scheduler or OS  More flexible, less efficient  For SDR applications  Resource allocation: static  Kernel execution ordering: static  Kernel execution timing: dynamic

15 15 University of Michigan 15 Software Pipelining Streaming Kernels  Problem with coarse-grained compilation  Requires kernel-level parallelism to utilize the PEs  SDR protocols do not have many data-independent kernels  Compiler optimization: coarse-grained software pipelining  Stream computation: pipeline parallelism  Modulo scheduling

16 16 University of Michigan 16 Coarse-grained System Compilation  Input  Hierarchical graph  Step 1  Dataflow rate matching  Step 2  Stream size selection  Step 3  Modulo scheduling  Step 4  Hierarchical compilation Modulo compilation Dataflow rate matching Stream size selection Hierarchical scheduling

17 17 University of Michigan 17 Coarse-grained System Compilation  Step 1: Dataflow rate matching  Producer and consumer pair must have the same rates  Edges are memory buffers  Well studied with many existing algorithms  Single appearance schedule Dataflow rate matching

18 18 University of Michigan 18 Coarse-grained System Compilation  Step 2: Stream size selection  Pick optimal input/output buffer size  Multiple of the base rate  Binary search algorithm  Modulo schedule each candidate buffer size Stream size selection  Rate = 1, Streaming N elements  Case 1: N iterations  Too much DMA overhead  Case 2: 1 iteration  Cannot software pipeline  Case 3: N/M iterations

19 19 University of Michigan 19 Coarse-grained System Compilation  Step 3: Function-level modulo scheduling  II selection (Initiation Interval)  Interval between the start of successive iterations  MinII = Max(ResMII, RecMII)  ResMII : total latency of all nodes divided by # of PEs  RecMII : maximum latency of feedback paths  Constraint-based modulo scheduling  SMT-based algorithm Modulo compilation

20 20 University of Michigan 20 SMT-based Modulo Scheduling  Using Satisfiability Modulo Theory (SMT) solver Yices  Input: a set of constraints expressed as equations  Output: a set of conditions where the constraints evaluate to true  Constraints  Throughput constraints  i.e. total execution time must be less than or equal to II  Memory constraints  i.e. buffer size less than PE’s scratchpad memories  Communication constraints  i.e. DMA added for communicating kernels on different PEs status of kernel v i assigned to processor j (1 or 0) number of kernels

21 21 University of Michigan 21 Coarse-grained System Compilation Hierarchical scheduling  Step 4: Hierarchical scheduling  Bottom up scheduling  Treat each child graph as a single node  Memory nodes assigned to global memory

22 22 University of Michigan 22 Conclusion  Compilation support for SDR is essential  2-tiered compilation process  System compilation  DSP compilation  System compilation is function-level scheduling  Hierarchical dataflow IR  ~4x saving in memory buffer allocation  SMT-based modulo scheduling  Linear speedup up to 8 PEs  Resulting in ~23% faster schedules than greedy

23 23 University of Michigan 23 Questions

24 24 University of Michigan 24 Case Study: W-CDMA

25 25 University of Michigan 25 Results: Average Speedup


Download ppt "11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer."

Similar presentations


Ads by Google