11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

ECE-777 System Level Design and Automation Hardware/Software Co-design
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
1 U NIVERSITY OF M ICHIGAN 11 1 SODA: A Low-power Architecture For Software Radio Author: Yuan Lin, Hyunseok Lee, Mark Woh, Yoav Harel, Scott Mahlke, Trevor.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
A Parameterized Dataflow Language Extension for Embedded Streaming Systems Yuan Lin 1, Yoonseo Choi 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali Chakrabarti.
University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.
University of Michigan Electrical Engineering and Computer Science 1 Reducing Control Power in CGRAs with Token Flow Hyunchul Park, Yongjun Park, and Scott.
University of Michigan Electrical Engineering and Computer Science 1 Modulo Scheduling for Highly Customized Datapaths to Increase Hardware Reusability.
Scheduling for Embedded Real-Time Systems Amit Mahajan and Haibo.
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,
A System Solution for High- Performance, Low Power SDR Yuan Lin 1, Hyunseok Lee 1, Yoav Harel 1, Mark Woh 1, Scott Mahlke 1, Trevor Mudge 1 and Krisztian.
1 SODA: A Low-power Architecture For Software Radio Yuan Lin 1, Hyunseok Lee 1, Mark Woh 1, Yoav Harel 1, Scott Mahlke 1, Trevor.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
University of Michigan Electrical Engineering and Computer Science 1 Resource Recycling: Putting Idle Resources to Work on a Composable Accelerator Yongjun.
11 1 The Next Generation Challenge for Software Defined Radio Mark Woh 1, Sangwon Seo 1, Hyunseok Lee 1, Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
A Scalable Low-power Architecture For Software Radio
11 1 SPEX: A Programming Language for Software Defined Radio Yuan Lin, Robert Mullenix, Mark Woh, Scott Mahlke, Trevor Mudge, Alastair Reid 1, and Krisztián.
Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott.
University of Michigan Electrical Engineering and Computer Science 1 Streamroller: Automatic Synthesis of Prescribed Throughput Accelerator Pipelines Manjunath.
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
1 Design and Implementation of Turbo Decoders for Software Defined Radio Yuan Lin 1, Scott Mahlke 1, Trevor Mudge 1, Chaitali.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
1 Compiling with multicore Jeehyung Lee Spring 2009.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
Computer System Architectures Computer System Software
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
A Unified Modeling Framework for Distributed Resource Allocation of General Fork and Join Processing Networks in ACM SIGMETRICS
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
University of Michigan Electrical Engineering and Computer Science 1 Integrating Post-programmability Into the High-level Synthesis Equation* Scott Mahlke.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Static Translation of Stream Programs S. M. Farhad School of Information Technology The University of Sydney.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
INTRODUCTION TO PARALLEL ALGORITHMS. Objective  Introduction to Parallel Algorithms Tasks and Decomposition Processes and Mapping Processes Versus Processors.
CS 460/660 Compiler Construction. Class 01 2 Why Study Compilers? Compilers are important – –Responsible for many aspects of system performance Compilers.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko Bernd.
EECS 583 – Class 20 Research Topic 2: Stream Compilation, Stream Graph Modulo Scheduling University of Michigan November 30, 2011 Guest Speaker Today:
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
University of Michigan Electrical Engineering and Computer Science Automatic Synthesis of Customized Local Memories for Multicluster Application Accelerators.
University of Michigan Electrical Engineering and Computer Science 1 Cost Sensitive Modulo Scheduling in a Loop Accelerator Synthesis System Kevin Fan,
VEAL: Virtualized Execution Accelerator for Loops Nate Clark 1, Amir Hormati 2, Scott Mahlke 2 1 Georgia Tech., 2 U. Michigan.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
Hyunchul Park†, Kevin Fan†, Scott Mahlke†,
Memory-Aware Compilation Philip Sweany 10/20/2011.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Linear Analysis and Optimization of Stream Programs Masterworks Presentation Andrew A. Lamb 4/30/2003 Professor Saman Amarasinghe MIT Laboratory for Computer.
Static Translation of Stream Program to a Parallel System S. M. Farhad The University of Sydney.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
University of Michigan Electrical Engineering and Computer Science 1 Stream Compilation for Real-time Embedded Systems Yoonseo Choi, Yuan Lin, Nathan Chong.
Computer Architecture: Parallel Task Assignment
Ph.D. in Computer Science
Parallel Programming By J. H. Wang May 2, 2017.
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Mapping DSP algorithms to a general purpose out-of-order processor
Presentation transcript:

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer Architecture Laboratory University of Michigan at Ann Arbor

22 2 University of Michigan 2 Software Defined Radio  Use software routines instead of ASICs for the physical layer operations of wireless communication system  Advantages:  Multi-mode operation  Lower costs  Faster time to market  Prototyping and bug fixes  Chip volumes  Longevity of platforms  Enables future wireless communication innovations  Complexity favors software-based solutions

33 3 University of Michigan 3 Case Study: W-CDMA  Key software characteristics  Multiple kernels connected together as a system  Streaming computation  Vector-based inter-kernel communications  Mostly static computation patterns

44 4 University of Michigan 4 SODA: A SDR DSP Architecture (ISCA 06)  Control-data decoupled multi-core architecture  1 ARM general purpose control processor  Scalar algorithms and protocol controls  4 data processing elements  SIMD+Scalar units  Used for high-throughput DSP algorithms

55 5 University of Michigan 5 SODA Execution Model  Software managed scratchpad memories  Each PE can only access its local memory  DMA operations  Access global memory  Inter-PE communications  Algorithms statically mapped onto PEs  RPCs from the ARM control processor

66 6 University of Michigan 6 Compilation Challenges for SDR  Compilation support for SDR is essential  Flexibility  Lower development cost  More complex protocols  Compilation support for SDR is challenging  Heterogeneous multiprocessor hardware  ARM + DSPs  Two level scratchpad memories  Multiple software constraints  Throughput + code & data size + real-time execution + others

77 7 University of Michigan 7 2-Tier Compilation Process Multiprocessor system compilation DSP kernel compilation  This study is focused on system compilation  Kernel compilation is treated as a black box  Existing libraries  SIMD compilers  Objective  Kernel-to-PE assignments  Memory allocations  Subject to  Throughput constraints  Memory constraints

88 8 University of Michigan 8 System Compilation Outline  SPIR – Function level IR  Traditional IR is not adequate  Complex inter-function interactions  Backend compilation  Scheduling functions instead of instructions  Function-level modulo scheduling

99 9 University of Michigan 9 SPIR Overview  Dataflow programming model  Graph consists of nodes and edges  Two types of nodes  Kernel (yellow) nodes for modeling functions  Memory (blue) nodes for modeling vector buffers  Buffer stream description + vector stream description  Dataflow edges  Synchronous dataflow (in the scope of this paper)

10 University of Michigan 10 SPIR Overview  Problems with flat dataflow graph representations  Matched to the highest rate  SDR kernels have very different stream rates  Turbo decoder: input rate = 9600; output rate = 3200  LPF: input rate = 1; output rate = 1

11 University of Michigan 11 SPIR Overview  Problems with flat dataflow graph representations  All must match to 9600 of the Turbo decoder  Minimum LPF rate: input = 38.4K, output = 38.4K  Stream rates translate to memory buffers  Unnecessarily large memory buffers

12 University of Michigan 12 SPIR Overview  Hierarchical dataflow graphs  Different hierarchy level with different streaming rates  Streaming vectors are modeled as hierarchical communications  Top level: buffer queue descriptions  Bottom level: vector streaming descriptions

13 University of Michigan 13 SPIR Overview  W-CDMA  Modeled with 3-level hierarchy in SPIR  Memory nodes are inserted between nodes with child graph  4x decrease in memory buffer usage

14 University of Michigan 14 Coarse-grained System Compilation  Three major tasks  Resource allocation (processor, memory and DMA)  Kernel execution ordering  Kernel execution timing  Static or dynamic?  Static – compiler  Less flexible, more efficient  Dynamic – run-time scheduler or OS  More flexible, less efficient  For SDR applications  Resource allocation: static  Kernel execution ordering: static  Kernel execution timing: dynamic

15 University of Michigan 15 Software Pipelining Streaming Kernels  Problem with coarse-grained compilation  Requires kernel-level parallelism to utilize the PEs  SDR protocols do not have many data-independent kernels  Compiler optimization: coarse-grained software pipelining  Stream computation: pipeline parallelism  Modulo scheduling

16 University of Michigan 16 Coarse-grained System Compilation  Input  Hierarchical graph  Step 1  Dataflow rate matching  Step 2  Stream size selection  Step 3  Modulo scheduling  Step 4  Hierarchical compilation Modulo compilation Dataflow rate matching Stream size selection Hierarchical scheduling

17 University of Michigan 17 Coarse-grained System Compilation  Step 1: Dataflow rate matching  Producer and consumer pair must have the same rates  Edges are memory buffers  Well studied with many existing algorithms  Single appearance schedule Dataflow rate matching

18 University of Michigan 18 Coarse-grained System Compilation  Step 2: Stream size selection  Pick optimal input/output buffer size  Multiple of the base rate  Binary search algorithm  Modulo schedule each candidate buffer size Stream size selection  Rate = 1, Streaming N elements  Case 1: N iterations  Too much DMA overhead  Case 2: 1 iteration  Cannot software pipeline  Case 3: N/M iterations

19 University of Michigan 19 Coarse-grained System Compilation  Step 3: Function-level modulo scheduling  II selection (Initiation Interval)  Interval between the start of successive iterations  MinII = Max(ResMII, RecMII)  ResMII : total latency of all nodes divided by # of PEs  RecMII : maximum latency of feedback paths  Constraint-based modulo scheduling  SMT-based algorithm Modulo compilation

20 University of Michigan 20 SMT-based Modulo Scheduling  Using Satisfiability Modulo Theory (SMT) solver Yices  Input: a set of constraints expressed as equations  Output: a set of conditions where the constraints evaluate to true  Constraints  Throughput constraints  i.e. total execution time must be less than or equal to II  Memory constraints  i.e. buffer size less than PE’s scratchpad memories  Communication constraints  i.e. DMA added for communicating kernels on different PEs status of kernel v i assigned to processor j (1 or 0) number of kernels

21 University of Michigan 21 Coarse-grained System Compilation Hierarchical scheduling  Step 4: Hierarchical scheduling  Bottom up scheduling  Treat each child graph as a single node  Memory nodes assigned to global memory

22 University of Michigan 22 Conclusion  Compilation support for SDR is essential  2-tiered compilation process  System compilation  DSP compilation  System compilation is function-level scheduling  Hierarchical dataflow IR  ~4x saving in memory buffer allocation  SMT-based modulo scheduling  Linear speedup up to 8 PEs  Resulting in ~23% faster schedules than greedy

23 University of Michigan 23 Questions

24 University of Michigan 24 Case Study: W-CDMA

25 University of Michigan 25 Results: Average Speedup