Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE 40-226)

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
ECE 667 Synthesis and Verification of Digital Circuits
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Static Bus Schedule aware Scratchpad Allocation in Multiprocessors Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
1 CS 201 Compiler Construction Machine Code Generation.
High Level Languages: A Comparison By Joel Best. 2 Sources The Challenges of Synthesizing Hardware from C-Like Languages  by Stephen A. Edwards High-Level.
PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. PradeepKumar S K, Asst.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
Efficient Software Performance Estimation Methods for Hardware/Software Codesign Kei Suzuki Alberto Sangiovanni-Vincentelli Present: Yanmei Li.
ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
Courseware High-Level Synthesis an introduction Prof. Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.
4/25/08Prof. Hilfinger CS164 Lecture 371 Global Optimization Lecture 37 (From notes by R. Bodik & G. Necula)
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Multiscalar processors
Mahapatra-Texas A&M-Fall'001 Partitioning - I Introduction to Partitioning.
Improving Code Generation Honors Compilers April 16 th 2002.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
Winter-Spring 2001Codesign of Embedded Systems1 Introduction to HW/SW Co-Synthesis Algorithms Part of HW/SW Codesign of Embedded Systems Course (CE )
Winter-Spring 2001Codesign of Embedded Systems1 Introduction to HW/SW Codesign Part of HW/SW Codesign of Embedded Systems Course (CE )
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
Mapping Stream Programs onto Heterogeneous Multiprocessor Systems [by Barcelona Supercomputing Centre, Spain, Oct 09] S. M. Farhad Programming Language.
Computer-Aided Co-design Methods and Tools
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Automated Design of Custom Architecture Tulika Mitra
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
Lecture 10 Hardware Accelerators Ingo Sander
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
Design & Co-design of Embedded Systems Introduction to Co-synthesis Algorithms + HW/SW Partitioning Algorithms Maziar Goudarzi.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 2: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
FPGA-Based System Design: Chapter 7 Copyright  2004 Prentice Hall PTR Topics n Hardware/software co-design.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Nov 3, 2005.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: Distributed System Co- Synthesis Part of HW/SW Codesign of Embedded Systems Course.
Winter-Spring 2001Codesign of Embedded Systems1 Essential Issues in Codesign: Architectures Part of HW/SW Codesign of Embedded Systems Course (CE )
Multi-cellular paradigm The molecular level can support self- replication (and self- repair). But we also need cells that can be designed to fit the specific.
Pradeep Konduri Static Process Scheduling:  Proceedance process model  Communication system model  Application  Dicussion.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Dynamo: A Runtime Codesign Environment
Introduction to cosynthesis Rabi Mahapatra CSCE617
CSCI1600: Embedded and Real Time Software
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE )

Winter-Spring 2001Codesign of Embedded Systems2 Topics Introduction Preliminaries Hardware/Software Partitioning Distributed System Co-Synthesis

Winter-Spring 2001Codesign of Embedded Systems3 Topics Introduction A Classification Examples Vulcan Cosyma

Winter-Spring 2001Codesign of Embedded Systems4 Introduction to HW/SW Partitioning The first variety of co-synthesis applications Definition A HW/SW partitioning algorithm implements a specification on some sort of multiprocessor architecture Usually Multiprocessor architecture = one CPU + some ASICs on CPU bus

Winter-Spring 2001Codesign of Embedded Systems5 Introduction to HW/SW Partitioning (cont’d) A Terminology Allocation Synthesis methods which design the multiprocessor topology along with the PEs and SW architecture Scheduling The process of assigning PE (CPU and/or ASICs) time to processes to get executed

Winter-Spring 2001Codesign of Embedded Systems6 Introduction to HW/SW Partitioning (cont’d) In most partitioning algorithms Type of CPU is fixed and given ASICs must be synthesized What function to implement on each ASIC? What characteristics should the implementation have? Are single-rate synthesis problems CDFG is the starting model

Winter-Spring 2001Codesign of Embedded Systems7 HW/SW Partitioning (cont’d) Normal use of architectural components CPU performs less computationally-intensive functions ASICs used to accelerate core functions Where to use? High-performance applications No CPU is fast enough for the operations Low-cost application ASIC accelerators allow use of much smaller, cheaper CPU

Winter-Spring 2001Codesign of Embedded Systems8 A Classification Criterion: Optimization Strategy Trade-off between Performance and Cost Primal Approach Performance is the primary goal First, all functionality in ASICs. Progressively move more to CPU to reduce cost. Dual Approach Cost is the primary goal First, all functions in the CPU. Move operations to the ASIC to meet the performance goal.

Winter-Spring 2001Codesign of Embedded Systems9 A Classification (cont’d) Classification due to optimization strategy (cont’d) Example co-synthesis systems Vulcan (Stanford): Primal strategy Cosyma (Braunschweig, Germany): Dual strategy

Winter-Spring 2001Codesign of Embedded Systems10 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Vulcan

Winter-Spring 2001Codesign of Embedded Systems11 Partitioning Examples: Vulcan Gupta, De Micheli, Stanford University Primal approach 1. All-HW initial implementation. 2. Iteratively move functionality to CPU to reduce cost. System specification language HardwareC Is compiled into a flow graph

Winter-Spring 2001Codesign of Embedded Systems12 Partitioning Examples: Vulcan (cont’d) nop x=ay=b 1 1 x=a; y=b; HardwareC cond x=ey=f c>dc<=d if (c>d) x=e; else y=f; HardwareC

Winter-Spring 2001Codesign of Embedded Systems13 Partitioning Examples: Vulcan (cont’d) Flow Graph Definition A variation of a (single-rate) task graph Nodes Represent operations Typically low-level operations: mult, add Edges Represent data dependencies Each contains a Boolean condition under which the edge is traversed

Winter-Spring 2001Codesign of Embedded Systems14 Partitioning Examples: Vulcan (cont’d) Flow Graph is executed repeatedly at some rate can have initiation-time constraints for each node t(v j )+l ij  t(v j )  t(v j )+u ij can have rate constraints on each node m i  R i  M i

Winter-Spring 2001Codesign of Embedded Systems15 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis Algorithm Partitioning quantum is a thread Algorithm divides the flow graph into threads and allocates them Thread boundary is determined by 1. (always) a non-deterministic delay element, such as wait for an external variable 2. (on choice) other points of flow graph Target architecture CPU + Co-processor (multiple ASICs)

Winter-Spring 2001Codesign of Embedded Systems16 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Allocation Primal approach Scheduling is done by a scheduler on the target CPU is generated as part of synthesis process schedules all threads (both HW and SW threads) cannot be static, due to some threads non-deterministic initiation-time

Winter-Spring 2001Codesign of Embedded Systems17 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Cost estimation SW implementation Code size relatively straight forward Data size Biggest challenge. Vulcan puts some effort to find bounds for each thread HW implementation ?

Winter-Spring 2001Codesign of Embedded Systems18 Partitioning Examples: Vulcan (cont’d) Vulcan Co-synthesis algorithm (cont’d) Performance estimation Both SW- and HW-implementation From flow-graph, and basic execution times for the operators

Winter-Spring 2001Codesign of Embedded Systems19 Partitioning Examples: Vulcan (cont’d) Algorithm Details Partitioning goal Allocate each thread to one of two partitions CPU Set:  S Co-processor set:  H Required execution-rate must be met, and total cost minimized

Winter-Spring 2001Codesign of Embedded Systems20 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Algorithm steps 1. Put all threads in  H set 2. Iteratively do 2.1. Move some operations to  S Select a group of operations to move to  S Check performance feasibility, by computing worst-case delay through flow-graph given the new thread times Do the move, if feasible 2.2. Incrementally update the new cost-function to reflect the new partition

Winter-Spring 2001Codesign of Embedded Systems21 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Vulcan cost function f(w) = c 1 S h (  H ) - c 2 S s (  S ) + c 3 B - c 4 P + c 5 |m| c: weight constants S(): Size functions B: Bus utilization (<1) P: Processor utilization (<1) m: total number of variables to be transferred between the CPU and the co-processor

Winter-Spring 2001Codesign of Embedded Systems22 Partitioning Examples: Vulcan (cont’d) Algorithm Details (cont’d) Complementary notes A heuristic to minimize communication Once a thread is moved to  S, its immediate successors are placed in the list for evaluation in the next iteration. No back-track Once a thread is assigned to  S, it remains there Experimental results considerably faster implementations than all-SW, but much cheaper than all-HW designs are produced

Winter-Spring 2001Codesign of Embedded Systems23 Co-Synthesis Algorithms: HW/SW Partitioning HW/SW Partitioning Examples: Cosyma

Winter-Spring 2001Codesign of Embedded Systems24 Partitioning Examples: Cosyma Rolf Ernst, et al: Technical University of Braunschweig, Germany Dual approach 1. All-SW initial implementation. 2. Iteratively move basic blocks to the ASIC accelerator to meet performance objective. System specification language C x Is compiled into an ESG (Extended Syntax Graph) ESG is much like a CDFG

Winter-Spring 2001Codesign of Embedded Systems25 Partitioning Examples: Cosyma (cont’d) Cosyma Co-synthesis Algorithm Partitioning quantum is a Basic Block A Basic Blocks is a branch-free block of program Target Architecture CPU + accelerator ASIC(s) Scheduling Allocation Cost Estimation Performance Estimation Algorithm Details

Winter-Spring 2001Codesign of Embedded Systems26 Partitioning Examples: Cosyma (cont’d) Cosyma Co-synthesis Algorithm (cont’d) Performance Estimation SW implementation Done by examining the object code for the basic block generated by a compiler HW implementation Assumes one operator per clock cycle. Creates a list schedule for the DFG of the basic block. Depth of the list gives the number of clock cycles required. Communication Done by data-flow analysis of the adjacent basic blocks. In Shared-Memory Proportional to number of variables to be accessed

Winter-Spring 2001Codesign of Embedded Systems27 Partitioning Examples: Cosyma (cont’d) Algorithm Steps Change in execution-time caused by moving basic block b from CPU to ASIC:  c(b) = w( t HW (b)-t SW (b) + t com (Z) - t com (ZUb)) x It(b) w:Constant weight t(b):Execution time of basic block b t com (b):Estimated communication time between CPU and the accelerator ASIC, given a set Z of basic blocks implemented on the ASIC It(b):Total number of times that b is executed

Winter-Spring 2001Codesign of Embedded Systems28 Partitioning Examples: Cosyma (cont’d) Experimental Results By moving only basic-blocks to HW Typical speedup of only 2x Reason: Limited intra-basic-block parallelism Cure: Implement several control-flow optimizations to increase parallelism in the basic block, and hence in ASIC Examples: loop pipelining, speculative branch execution with multiple branch prediction, operator pipelining Result: Speedups: 2.7 to 9.7 CPU times: 35 to 304 seconds on a typical workstation

Winter-Spring 2001Codesign of Embedded Systems29 What we learned today HW/SW Partitioning: One broad category of co-synthesis algorithms Criteria by which a co-synthesis algorithm is categorized