High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Chapter 2 Data Manipulation Dr. Farzana Rahman Assistant Professor Department of Computer Science James Madison University 1 Some sldes are adapted from.
PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. PradeepKumar S K, Asst.
Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Reference: Message Passing Fundamentals.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Fall 2006Lecture 16 Lecture 16: Accelerator Design in the XUP Board ECE 412: Microcomputer Laboratory.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.
Chapter 7 Hardware Accelerators 金仲達教授 清華大學資訊工程學系 (Slides are taken from the textbook slides)
Define Embedded Systems Small (?) Application Specific Computer Systems.
04/16/2010CSCI 315 Operating Systems Design1 I/O Systems Notice: The slides for this lecture have been largely based on those accompanying an earlier edition.
Configurable System-on-Chip: Xilinx EDK
Reconfigurable Computing S. Reda, Brown University Reconfigurable Computing (EN2911X, Fall07) Lecture 08: RC Principles: Software (1/4) Prof. Sherief Reda.
Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.
System Partitioning Kris Kuchcinski
1 EE249 Discussion A Method for Architecture Exploration for Heterogeneous Signal Processing Systems Sam Williams EE249 Discussion Section October 15,
Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Winter-Spring 2001Codesign of Embedded Systems1 Introduction to HW/SW Co-Synthesis Algorithms Part of HW/SW Codesign of Embedded Systems Course (CE )
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.
Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.
High Performance Embedded Computing © 2007 Elsevier Lecture 16: Interconnection Networks Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
SBSE Course 4. Overview: Design Translate requirements into a representation of software Focuses on –Data structures –Architecture –Interfaces –Algorithmic.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.
Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE )
High Performance Embedded Computing © 2007 Elsevier Lecture 3: Design Methodologies Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
Lecture 10 Hardware Accelerators Ingo Sander
Configurable, reconfigurable, and run-time reconfigurable computing.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 2: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
C OMPARING T HREE H EURISTIC S EARCH M ETHODS FOR F UNCTIONAL P ARTITIONING IN H ARDWARE -S OFTWARE C ODESIGN Theerayod Wiangtong, Peter Y. K. Cheung and.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 8 Networks and Multiprocessors.
System-on-Chip Design Hao Zheng Comp Sci & Eng U of South Florida 1.
Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: Distributed System Co- Synthesis Part of HW/SW Codesign of Embedded Systems Course.
High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.
1 Hardware-Software Co-Synthesis of Low Power Real-Time Distributed Embedded Systems with Dynamically Reconfigurable FPGAs Li Shang and Niraj K.Jha Proceedings.
High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
System-on-Chip Design Homework Solutions
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
System-on-Chip Design
CSCI 315 Operating Systems Design
Introduction to cosynthesis Rabi Mahapatra CSCE617
CSCI1600: Embedded and Real Time Software
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
HIGH LEVEL SYNTHESIS.
CSCI1600: Embedded and Real Time Software
Presentation transcript:

High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte Based on slides and textbook from Wayne Wolf

© 2006 Elsevier Topics Platforms. Performance analysis. Design representations. Hardware/software partitioning. Co-synthesis for general multiprocessors. Optimization concepts Simulation

© 2006 Elsevier Design platforms Different levels of integration:  PC + board.  Custom board with CPU + FPGA or ASIC.  Platform FPGA.  System-on-chip.

© 2006 Elsevier CPU/accelerator architecture CPU is sometimes called host. Accelerator communicate via shared memory.  May use DMA to communicate. CPU memory accelerator

© 2006 Elsevier Example: Xilinx Virtex-4 System-on-chip:  FPGA fabric.  PowerPC.  On-chip RAM.  Specialized I/O devices. FPGA fabric is connected to PowerPC bus. MicroBlaze CPU can be added in FPGA fabric.

© 2006 Elsevier Example: WILDSTAR II Pro

© 2006 Elsevier Performance analysis Must analyze accelerator performance to determine system speedup. High-level synthesis helps:  Use as estimator for accelerator performance.  Use to implement accelerator.

© 2006 Elsevier Data path/controller architecture Data path performs regular operations, stores data in registers. Controller provides required sequencing. Data path controller

© 2006 Elsevier High-level synthesis High-level synthesis creates register-transfer description from behavioral description. Schedules and allocates:  Operators.  Variables.  Connections. Control step or time step is one cycle in system controller. Components may be selected from technology library.

© 2006 Elsevier Models Model as data flow graph. Critical path is set of nodes on path that determines schedule length.

© 2006 Elsevier Accelerator estimation How do we use high-level synthesis, etc. to estimate the performance of an accelerator? We have a behavioral description of the accelerator function. Need an estimate of the number of clock cycles. Need to evaluate a large number of candidate accelerator designs.  Can’t afford to synthesize them all.

© 2006 Elsevier Estimation methods Hermann et al. used numerical methods.  Estimated incremental costs due to adding blocks to the accelerator. Henkel and Ernst used path-based scheduling.  Cut CFDG into subgraphs: reduce loop iteration count; cut at large joins; divide into equal-sized pieces.  Schedule each subgraph independently. Vahid and Gajski estimate controller and data path costs incrementally.

© 2006 Elsevier Single- vs. multi-threaded One critical factor is available parallelism:  single-threaded/blocking: CPU waits for accelerator;  multithreaded/non-blocking: CPU continues to execute along with accelerator. To multithread, CPU must have useful work to do.  But software must also support multithreading.

© 2006 Elsevier Total execution time Single-threaded: Multi-threaded: P2 P1 A1 P3 P4 P2 P1 A1 P3 P4

© 2006 Elsevier Execution time analysis Single-threaded:  Count execution time of all component processes. Multi-threaded:  Find longest path through execution.

© 2006 Elsevier Hardware-software partitioning Partitioning methods usually allow more than one ASIC. Typically ignore CPU memory traffic in bus utilization estimates. Typically assume that CPU process blocks while waiting for ASIC. CPU ASIC mem

© 2006 Elsevier Synthesis tasks Scheduling: make sure that data is available when it is needed. Allocation: make sure that processes don’t compete for the PE. Partitioning: break operations into separate processes to increase parallelism, put serial operations in one process to reduce communication. Mapping: take PE, communication link characteristics into account.

© 2006 Elsevier Scheduling and allocation Must schedule/allocate  computation  communication Performance may vary greatly with allocation choice. P1 P2 P3 P1 P2 P3 CPU1 ASIC1

© 2006 Elsevier Problems in scheduling/allocation l Can multiple processes execute concurrently? l Is the performance granularity of available components fine enough to allow efficient search of the solution space? l Do computation and communication requirements conflict? l How accurately can we estimate performance?  software  custom ASICs

© 2006 Elsevier Partitioning example before after r = p1(a,b); s = p2(c,d); z = r + s; r=p1(a,b);s=p2(c,d); z = r + s

© 2006 Elsevier Problems in partitioning l At what level of granularity must partitioning be performed? l How well can you partition the system without an allocation? l How does communication overhead figure into partitioning?

© 2006 Elsevier Problems in mapping Mapping and allocation are strongly connected when the components vary widely in performance. Software performance depends on bus configuration as well as CPU type. Mappings of PEs and communication links are closely related.

© 2006 Elsevier Program representations CDFG: single-threaded, executable, can extract some parallelism. Task graph: task-level parallelism, no operator-level detail.  TGFF generates random task graphs. UNITY: based on parallel programming language.

© 2006 Elsevier Platform representations Technology table describes PE, channel characteristics.  CPU time.  Communication time.  Cost.  Power. Multiprocessor connectivity graph describes PEs, channels. TypeSpeedcost ARM 750E610 MIPS50E68 PE 1 PE 2 PE 3

© 2006 Elsevier Hardware/software partitioning assumptions CPU type is known.  Can determine software performance. Number of processing elements is known.  Simplifies system-level performance analysis. Only one processing element can multi-task.  Simplifies system-level performance analysis.

© 2006 Elsevier Two early HW/SW partitioning systems Vulcan:  Start with all tasks on accelerator.  Move tasks to CPU to reduce cost. COSYMA:  Start with all functions on CPU.  Move functions to accelerator to improve performance.

© 2006 Elsevier Additional Co-synthesis Approaches Vahid: Binary constraint search CoWare: communicating processes model Simulated annealing & Tabu search heuristics [Ele96] LYCOS: CDFG representation [Mad97] Several others in book (skim)

© 2006 Elsevier Multi-objective optimization Operations research provides notions for optimization functions with multiple objectives. Pareto optimality: optimal solution cannot be improved without making something else worse.

© 2006 Elsevier Large search space: Genetic algorithms Modeled as:  Genes = strings of symbols.  Mutations = changes to strings. Types of moves:  Reproduction makes a copy of a string.  Mutation changes a string.  Crossover interchanges parts of two strings.

© 2006 Elsevier Hardware/software co-simulation Must connect models with different models of computation, different time scales. Simulation backplane manages communication. Becker et al. used PLI in Verilog-XL to add C code that communicates with software models, UNIX networking to connect hardware simulator.

© 2006 Elsevier Mentor Graphics Seamless Hardware modules described using standard HDLs. Software can be loaded as C or binary. Bus interface module connects hardware models to processor instruction set simulator. Coherent memory server manages shared memory.

© 2006 Elsevier Summary Platforms. Performance analysis. Design representations. Hardware/software partitioning. Co-synthesis for general multiprocessors. Optimization concepts Simulation