High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 2: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

ECE 667 Synthesis and Verification of Digital Circuits

Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.

ECE-777 System Level Design and Automation Hardware/Software Co-design

ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.

PradeepKumar S K Asst. Professor Dept. of ECE, KIT, TIPTUR. PradeepKumar S K, Asst.

Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &

Fault Detection in a HW/SW CoDesign Environment Prepared by A. Gaye Soykök.

- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.

Define Embedded Systems Small (?) Application Specific Computer Systems.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

System Partitioning Kris Kuchcinski

1 EE249 Discussion A Method for Architecture Exploration for Heterogeneous Signal Processing Systems Sam Williams EE249 Discussion Section October 15,

Courseware Basics of Real-Time Scheduling Jan Madsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads, Building.

Mahapatra-Texas A&M-Fall'001 Partitioning - I Introduction to Partitioning.

High Performance Embedded Computing © 2007 Elsevier Chapter 6, part 1: Multiprocessor Software High Performance Embedded Computing Wayne Wolf.

Winter-Spring 2001Codesign of Embedded Systems1 Introduction to HW/SW Co-Synthesis Algorithms Part of HW/SW Codesign of Embedded Systems Course (CE )

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

1 of 14 1 / 18 An Approach to Incremental Design of Distributed Embedded Systems Paul Pop, Petru Eles, Traian Pop, Zebo Peng Department of Computer and.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.

SBSE Course 4. Overview: Design Translate requirements into a representation of software Focuses on –Data structures –Architecture –Interfaces –Algorithmic.

1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 8: Main Memory.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

May 2004 Department of Electrical and Computer Engineering 1 ANEW GRAPH STRUCTURE FOR HARDWARE- SOFTWARE PARTITIONING OF HETEROGENEOUS SYSTEMS A NEW GRAPH.

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: HW/SW Partitioning Part of HW/SW Codesign of Embedded Systems Course (CE )

Hardware-Software Co-partitioning for Distributed Embedded Systems.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #21 – HW/SW.

High Performance Embedded Computing © 2007 Elsevier Lecture 18: Hardware/Software Codesign Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics Basics of register-transfer design: –data paths and controllers; –ASM charts. Pipelining.

- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

1 SYNTHESIS of PIPELINED SYSTEMS for the CONTEMPORANEOUS EXECUTION of PERIODIC and APERIODIC TASKS with HARD REAL-TIME CONSTRAINTS Paolo Palazzari Luca.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Introduction to Real-Time Systems

Winter-Spring 2001Codesign of Embedded Systems1 Co-Synthesis Algorithms: Distributed System Co- Synthesis Part of HW/SW Codesign of Embedded Systems Course.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Task Mapping and Partition Allocation for Mixed-Criticality Real-Time Systems Domițian Tămaș-Selicean and Paul Pop Technical University of Denmark.

Pradeep Konduri Static Process Scheduling:  Proceedance process model  Communication system model  Application  Dicussion.

CHaRy Software Synthesis for Hard Real-Time Systems

Dynamo: A Runtime Codesign Environment

Wayne Wolf Dept. of EE Princeton University

Parallel Programming in C with MPI and OpenMP

Chapter 8: Main Memory.

Introduction to cosynthesis Rabi Mahapatra CSCE617

CSCI1600: Embedded and Real Time Software

Department of Electrical Engineering Joint work with Jiong Luo

Parallel Programming in C with MPI and OpenMP

CSCI1600: Embedded and Real Time Software

Presentation transcript:

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 2: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf

© 2006 Elsevier Topics Hardware/software partitioning. Co-synthesis for general multiprocessors.

© 2006 Elsevier Hardware/software partitioning assumptions CPU type is known.  Can determine software performance. Number of processing elements is known.  Simplifies system-level performance analysis. Only one processing element can multi-task.  Simplifies system-level performance analysis.

© 2006 Elsevier Two early HW/SW partitioning systems Vulcan:  Start with all tasks on accelerator.  Move tasks to CPU to reduce cost. COSYMA:  Start with all functions on CPU.  Move functions to accelerator to improve performance.

© 2006 Elsevier Gupta and De Micheli Target architecture: CPU + ASICs on bus Break behavior into threads at nondeterministic delay points; delay of thread is bounded Software threads run under RTOS; threads communicate via queues

© 2006 Elsevier Specification and modeling Specified in Hardware C. Spec divided into threads at non-deterministic delay points. Hardware properties: size, # clock cycles. CPU/software thread properties:  thread latency  thread reaction rate  processor utilization  bus utilization CPU/ASIC execution are non-overlapping.

© 2006 Elsevier HW/SW allocation Start with unbounded-delay threads in CPU, rest of threads in ASIC. Optimization:  test one thread for move  if move to SW does not violate performance requirement, move the thread  feasibility depends on SW, HW run times, bus utilization  if thread is moved, immediately try moving its successor threads

© 2006 Elsevier COSYMA Ernst et al.: moves operations from software to hardware. Operations are moved to hardware in units of basic blocks. Estimates communication overhead based on bus operations and register allocation. Hardware and software communicate by shared memory.

© 2006 Elsevier COSYMA design flow C* ES graph partitioning Cost estimation Run time analysis High-level synthesis Gnu C CDFG

© 2006 Elsevier Cost estimation Speedup estimate for basic block b:   c(b) = w(t HW (b) - t SW (b) + t com (Z) - t com (Z + b)) * It(b)  w = weight, It(b) = # iterations taken on b Sources of estimates:  Software execution time (t SW ) is estimated from source code.  Hardware execution time (t HW ) is estimated by list scheduling.  Communiation time (t com ) is estimated by data flow analysis of adjacent basic blocks.

© 2006 Elsevier COSYMA optimization Goal: satisfy execution time. User specifies maximum number of function units in co-processor. Start with all basic blocks in software. Estimate potential speedup in moving a basic block to software using execution profiling. Search using simulated annealing. Impose high cost penalty for solutions that don’t meet execution time.

© 2006 Elsevier Improved hardware cost estimation Used BSS high-level synthesis system to estimate costs.  Force-directed scheduling.  Simple allocation. CDFG scheduling allocation controller generation logic synthesis Area, Cycle time

© 2006 Elsevier Vahid et al. Uses binary search to minimize hardware cost while satisfying performance. Accept any solution with cost below C size. Cost function:  k perf (  performance violations) + k area (  hardware size). [Vah94]

© 2006 Elsevier CoWare Describe behavior as communicating processes. Refine system description to create an implementation. Co-synthesis implements communicating processes. Library describes CPU, bus.

© 2006 Elsevier Simulated annealing vs. tabu search Eles et al. compared simulated annealing, tabu search.  Tabu search uses short- term and long-term memory data structures. Objective function: Showed that simulated annealing, tabu search gave similar results but tabu is 20 times faster.

© 2006 Elsevier LYCOS Unified representation that can be derived from several languages. Quenya based on colored Petri nets. [Mad97]

© 2006 Elsevier LYCOS HW/SW partitioning Speedup for moving BSB to hardware: Evaluates sequences of BSBs, tries to find combination of non- overlapping BSBs that gives largest speedup, satisfies area constraint.

© 2006 Elsevier Estimation using high-level synthesis Xie and Wolf used high-level synthesis to estimate performance and area.  Used fast ILP-based high-level synthesis system. Global slack: slack between deadline and task completion. Local slack: slack between accelerator’s completion time and start of successor tasks. Start with fast accelerators, use global and local slacks to redesign and slow down accelerators.

© 2006 Elsevier Serra Combines static and dynamic scheduling.  Static scheduling performed by hardware unit.  Dynamic scheduling performed by preemptive scheduler. Never set defines combinations of tasks that cannot execute simultaneously. Uses heuristic form of dynamic programming to schedule.

© 2006 Elsevier Co-synthesis to general architectures Allocation and scheduling are closely related:  Need schedule/performance information to choose allocation.  Can’t determine performance until processes are allocated. Must make some assumptions to break the Gordian knot. Systems differ in the types of assumptions they make.

© 2006 Elsevier Co-synthesis as ILP Prakash and Parker formulated distributed system co-synthesis as an ILP problem: specified as a system of tasks = data flow graph; architecture model is set of processors with direct and indirect communication; constraints modeled data flow, processing times, communication times.

© 2006 Elsevier Kalavade et al. Uses both local and global measures to meet performance objectives and minimize cost. Global criterion: degree to which performance is critically affected by a component. Local criterion: heterogeneity of a node = implementation cost.  a function which has a high cost in one mapping but low cost in the other is an extremity  two functions which have very different implementation requirements (precision, etc.) repel each other into different implementations

© 2006 Elsevier GCLP algorithm Schedule one node at a time:  compute critical path  select node on critical path for assignment  evaluate effect of change in allocation of this node  if performance is critical, reallocate for performance, else reallocate for cost Extremity value helps avoid assigning an operation to a partition where it clearly doesn’t belong. Repellers help reduce implementation cost.

© 2006 Elsevier Two-phase optimization Inner loop uses estimates to search through design space quickly. Outer loop uses detailed measurements to check validity of inner loop assumptions:  code is compiled and measured  ASIC is synthesized Results of detailed estimate are used to apply correction to current solution for next run of inner loop.

© 2006 Elsevier SpecSyn Supports specify- explore-refine methodology. Functional description represented in SLIF. Statechart-like representation of program state machine. SLIF annotated with area, profiling information, etc. [Gaj98]

© 2006 Elsevier SpecSyn synthesis Allocation phase can allocate standard/custom processors, memories, busses. Partitioning assigns operations to hardware. Refined design continues to be simulatable and synthesizable:  Control refinement adds detail to protocol, etc.  Data refinement updates vlalues of variables.  Architectural refinements resolve conflicts, improve data transfers.

© 2006 Elsevier SpecSyn refinement [Gon97b] © 1997 ACM Press

© 2006 Elsevier Successive-refinement co-synthesis Wolf: scheduling, allocation, and mapping are intertwined:  process execution time depends on CPU type selection  scheduling depends on process execution times  process allocation depends on scheduling  CPU type selection depends on feasibility of scheduling Solution: allocate and map conservatively to meet deadlines, then re-synthesize to reduce implementation cost.

© 2006 Elsevier A heuristic algorithm 1. Allocate processes to CPUs and select CPU types to meet all deadlines. 2. Schedule processes based on current CPU type selection; analyze utilization. 3. Reallocate processes to CPUs to reduce cost. 4. Reallocate again to minimize inter-CPU communication. 5. Allocate communication channels to minimize cost. 6. Allocate devices, to internal CPU devices if possible.

© 2006 Elsevier Example 1—allocate and map for deadlines: P1 CPU1:ARM9 P2 CPU2:ARM7 P3 3—reallocate for cost: CPU3:ARM9 P1 CPU1:VLIW P2 CPU2:ARM7 P3 4—reallocate for communication: P1 CPU1:ARM9 P3 CPU2:ARM7 P2 5—allocate communication: P1 CPU1:ARM9 P2 CPU2:ARM7 P3

© 2006 Elsevier PE cost reduction step Step 3 contributes most to minimizing implementation cost. Want to eliminate unnecessary PEs. Iterative cost reduction:  reallocate all processes in one PE;  pairwise merge PEs;  balance load in system. Repeat until system cost is not reduced.

© 2006 Elsevier COSYN Dave and Jha: co-synthesize systems with large task graphs. Prototype task graph may be replicated many times.  Useful in communication systems---many separate tasks performing same operation on different data streams. COSYN will adjust deadlines by up to 3% to reduce the length of the hyperperiod.

© 2006 Elsevier COSYN task and hardware models Technology table. Communication vector gives communication time for each edge in task graph. Preference vector identifies the PEs to which a process can be mapped. Exclusion vector identifies processes that cannot share a PE. Average power vector. Memory vector defines memory requirements. Preemption overhead for each PE.

© 2006 Elsevier COSYN synthesis procedure Cluster tasks to reduce search space. Allocate tasks to PEs.  Driven by hardware cost. Schedule tasks and processes.  Concentrates on scheduling first copy of each task.  Allows mixed supply voltages. [Dav99b] © 1999 IEEE

© 2006 Elsevier Allocating concurrent tasks for pipelining Proper allocation helps pipelining of tasks. Allocate processes in hardware pipeline to minimize communication cost, time.

© 2006 Elsevier Hierarchical co-synthesis Task graph node may contain its own task graph. Hardware node is built from several smaller PEs. Co-synthesize by clustering, allocating, then scheduling.

© 2006 Elsevier Co-synthesis for fault tolerance COFTA uses two types of checks:  Assertion tasks compute assertions and issue an error when the assertion fails.  Compare tasks compare results of duplicate copies of tasks and issue error upon disagreement. System designer specifies assertions.  Assertions can be much more efficient than duplication. Duplicate tasks are generated for tasks that do not have assertions.

© 2006 Elsevier Allocation for fault tolerance Allocation is key phase for fault tolerance. Assign metrics to each task:  Assertion overhead of task with assertion is computation + communication times for all tasks in transitive fanin.  Fault tolerance level is assertion overhead plus maximum fault tolerance level of all processes in its fanout.  Both values must be recomputed as design is reclustered. COFTA shares assertion tasks when possible.

© 2006 Elsevier Protection in a failure group 1-by-n failure group:  m service modules that perform useful work.  One protection module. Hardware compares protection module against service modules. General case is m-by-n. [Dav99b]