Architecture Mapping 최기영 (서울대학교, 전기컴퓨터공학부) Copyrightⓒ2003.

Slides:



Advertisements
Similar presentations
Embedded System, A Brief Introduction
Advertisements

purpose Search : automation methods for device driver development in IP-based embedded systems in order to achieve high reliability, productivity, reusability.
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Hardware/ Software Partitioning 2011 年 12 月 09 日 Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: © Alexandra Nolte, Gesine Marwedel, 2003 These.
ECE-777 System Level Design and Automation Hardware/Software Co-design
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 05/06 Universität Dortmund Hardware/Software Codesign.
Addressing the System-on-a-Chip Interconnect Woes Through Communication-Based Design N. Vinay Krishnan EE249 Class Presentation.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
System Level Design: Orthogonalization of Concerns and Platform- Based Design K. Keutzer, S. Malik, R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Review of “Embedded Software” by E.A. Lee Katherine Barrow Vladimir Jakobac.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Chapter 13 Embedded Systems
Models of Computation for Embedded System Design Alvise Bonivento.
Copyright  1999 Daniel D. Gajski IP – Based Design Methodology Daniel D. Gajski University of California
1 EE249 Discussion A Method for Architecture Exploration for Heterogeneous Signal Processing Systems Sam Williams EE249 Discussion Section October 15,
Transaction Level Modeling Definitions and Approximations Trevor Meyerowitz EE290A Presentation May 12, 2005.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Winter-Spring 2001Codesign of Embedded Systems1 Introduction to HW/SW Codesign Part of HW/SW Codesign of Embedded Systems Course (CE )
Mahapatra-Texas A&M-Fall'001 Codesign Framework Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
- 1 -  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Universität Dortmund Actual design flows and tools.
Universität Dortmund  P. Marwedel, Univ. Dortmund, Informatik 12, 2003 Hardware/software partitioning  Functionality to be implemented in software.
1  Staunstrup and Wolf Ed. “Hardware Software codesign: principles and practice”, Kluwer Publication, 1997  Gajski, Vahid, Narayan and Gong, “Specification,
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
- 1 - EE898-HW/SW co-design Hardware/Software Codesign “Finding right combination of HW/SW resulting in the most efficient product meeting the specification”
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
ECE-777 System Level Design and Automation Introduction 1 Cristinel Ababei Electrical and Computer Department, North Dakota State University Spring 2012.
CAD Techniques for IP-Based and System-On-Chip Designs Allen C.-H. Wu Department of Computer Science Tsing Hua University Hsinchu, Taiwan, R.O.C {
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Automated Design of Custom Architecture Tulika Mitra
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
High Performance Embedded Computing © 2007 Elsevier Chapter 1, part 2: Embedded Computing High Performance Embedded Computing Wayne Wolf.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
A Methodology for Architecture Exploration of heterogeneous Signal Processing Systems Paul Lieverse, Pieter van der Wolf, Ed Deprettere, Kees Vissers.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
1 Copyright  2001 Pao-Ann Hsiung SW HW Module Outline l Introduction l Unified HW/SW Representations l HW/SW Partitioning Techniques l Integrated HW/SW.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
Automated Software Generation and Hardware Coprocessor Synthesis for Data Adaptable Reconfigurable Systems Andrew Milakovich, Vijay Shankar Gopinath, Roman.
CHaRy Software Synthesis for Hard Real-Time Systems
System-on-Chip Design
System-on-Chip Design Homework Solutions
Andreas Hoffmann Andreas Ropers Tim Kogel Stefan Pees Prof
Dynamo: A Runtime Codesign Environment
The Dataflow Interchange Format (DIF): A Framework for Specifying, Analyzing, and Integrating Dataflow Representations of Signal Processing Systems Shuvra.
The Hardware / Software Tradeoff -John Burnette-
Application-to-Architecture Mapping
Parallel Programming By J. H. Wang May 2, 2017.
Introduction ( A SoC Design Automation)
FPGAs in AWS and First Use Cases, Kees Vissers
IP – Based Design Methodology
Design Flow System Level
Gabor Madl Ph.D. Candidate, UC Irvine Advisor: Nikil Dutt
Introduction to cosynthesis Rabi Mahapatra CSCE617
ECE-C662 Introduction to Behavioral Synthesis Knapp Text Ch
Chapter 1 Introduction.
HIGH LEVEL SYNTHESIS.
Department of Electrical Engineering Joint work with Jiong Luo
Transaction Level Modeling: An Overview
Paper discussed in class: M. Chiodo, P. Giusto, A. Jurecska, H
Presentation transcript:

Architecture Mapping 최기영 (서울대학교, 전기컴퓨터공학부) Copyrightⓒ2003

과목 개요(Learning Map) SoC Design Methodoloy Design Flow Classes Lab. Ÿ Soc Design Flow Introduction Ÿ Design Reuse & SoC Platform Specification Ÿ System Specification System Spec. Lab. Ÿ HW/SW Interface Design Ÿ Power Estimation & Management Design Design Language Lab. Ÿ DSM Design & Signal Integrity Ÿ Design Language Ÿ Synthesis Synthesis Ÿ Architecture Mapping Ÿ Verification Verification Ÿ HW/SW Co-simulation Co-simulation Lab. Ÿ Prototyping & Emulation Ÿ SoC Testing Test SoC Test Lab. Ÿ Design for Testability Copyrightⓒ2003

Outline Introduction Platform-Based Design Trend in System-on-Chip (SoC) design Design reuse Platform-Based Design Platform-based design flow Application to Architecture Mapping Y-chart approach YAPI Trace-driven approach Hybrid mapping approach POLIS COSY References Copyrightⓒ2003

Introduction Trend in System-on-Chip (SoC) design Larger design space Exponentially growing transistor counts (Moore's law) Ever increasing complexity of applications Multi-functional and multi-standard More flexibility, higher performance, lower energy, ... Shorter Time-to-Market Need more efficient design methodology Copyrightⓒ2003

Design reuse Complexity vs. productivity Complexity Productivity 58%/yr growth rate Productivity 21%/yr growth rate Copyrightⓒ2003

Programmable video operations, DCT, IDCT, motion estimation Reuse of Cell (standard cell) IP Architecture (platform) --> platform-based design IC (reconfigurability) Memory Video RAM I/O Host interface DSP core 1 (D950) Modem DSP core 2 Sound ASIP 1 Master Control ASIP 2 Controller ASIP 3 Bit Manipulation ASIP 4 (VLIW DSP) Programmable video operations, standard extensions S interface Glue logic A/D & D/A High-speed HW Video operations for DCT, IDCT, motion estimation Single chip videophone (H.263) Copyrightⓒ2003

Platform-Based Design Soft IP EDA Tools Hard IP EDA Integrator Others EDA Tools Application specific integration platform Derivative Copyrightⓒ2003

Platform-Based Design Design-Space Exploration Specification Architectural Space Application Space Application Instance Platform Instance System Application Space Application Instance Large Design-Space Exploration Platform Instance Architectural Space Conventional Design Platform-Based Design Copyrightⓒ2003

Platform-based design flow Application Architecture Constraints Mapping Mapping results SW synthesis IF synthesis HW synthesis SW HW Copyrightⓒ2003

Application to Architecture Mapping for(i = 0; i < 18; i++) { s = (mpfloat)0.0f; k = 0; do { s += X[k] * v[k]; s += X[k+1] * v[k+1]; s += X[k+2] * v[k+2]; s += X[k+3] * v[k+3]; s += X[k+4] * v[k+4]; s += X[k+5] * v[k+5]; k += 6; } while(k < 18); v += 18; ISCALE(s); t[i] = s; } /* correct the transform into the 18x36 IMDCT we need */ /* 36 muls */ for(i = 0; i < 9; i++) { x[i] = t[i+9] * Granule_imdct_win[gr->block_type][i]; ISCALE(x[i]); x[i+9] = t[17-i] * Granule_imdct_win[gr->block_type][i+9]; ISCALE(x[i+9]); x[i+18] = t[8-i] * Granule_imdct_win[gr->block_type][i+18]; ISCALE(x[i+18]); x[i+27] = t[i] * Granule_imdct_win[gr->block_type][i+27]; ISCALE(x[i+27]); } Application in C Platform architecture Copyrightⓒ2003

HW-SW partitioning Partitioning system functionality into Application specific hardware and Software executing on one (or more) processor(s) Partitioning Problem Find minimum cost HW-SW combination satisfying constraints Cost = f (HW area, HW delay, SW size, SW time, interface size, interface delay, power, ... ) Need efficient and accurate performance, cost, power estimation models Need efficient partitioning algorithms Greedy method Simulated annealing Kernighan-Lin Integer linear programming Global criticality/local phase Manual ... Computationally intractable in general Copyrightⓒ2003

R. Niemann [1] Concurrent partitioning, scheduling, and sharing Integer linear programming VHDL C code VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster SW nodes Copyrightⓒ2003

POLIS [2] A design environment for control-dominated embedded systems MoC CFSM (Co-design Finite State Machine) Globally asynchronous/locally synchronous Formal verification or simulation for the analysis of a system at the behavioral level It can generate C-code/HDL code Weak points Only CFSM: control-dominated application Not support estimation technique for complex processor models Does not support multiple hardware and software partitioning Copyrightⓒ2003

Overall flow formal languges (Esterel) translators translator CFSMs simulation partitioning verification intermediate format formal verification partitioned CFSMs SW synthesis HW synthesis interface synthesis S-graph BLIF HW interface scheduler template + timing constraints OS synthesis logic synthesis C code optimized hardware integration Copyrightⓒ2003

Heterogeneous multiprocessor scheduling [3] Allocate additional PEs until the given time constraint is satisfied Perform list scheduling with the allocated PEs task-PE time table heterogeneous multiprocessor scheduler task-PE allocation controller performance evaluation Fail Good cosynthesis result Copyrightⓒ2003

Priority for the list scheduling is given by BIL(i,j)=E(i,j)+maxd[min(BIL(d,j), mink(BIL(d,k)+C(i,d)))] where E(i,j) is the execution time of node i on processor j and C(i,d) is the IPC overhead between i and d. BIL(i,j) is the shortest possible path length from node i to the sink. B C P0 A D solution P1 A B D C processor cost task-PE profile table exec time(cost) P0(HW) P1(1) P2(5) B0 B1 B2 A 3(4) 2(6) 1(10) 7 2 B 4(5) 2(8) 10 3 C 2(3) 1(5) 5 D 5(10) 3(15) 15 P0 P1(1) P2(5) B0 7 10 2(3) 15 Copyrightⓒ2003

Y-chart approach [4] Copyrightⓒ2003

Abstraction pyramid [5] Copyrightⓒ2003

Design approach using Y-chart environment Design trajectory Golden point design Design approach using Y-chart environment Copyrightⓒ2003

Stack of Y-chart Use different models at different levels of abstraction Copyrightⓒ2003

Mapping A crucial step in DSE to evaluate the performance of different application-architecture combinations For smooth mapping Need a good match in data and operation types between the corresponding model of architecture and model of computation Mapping Architecture Application match in data/operation type Model of architecture Model of computation Copyrightⓒ2003

Model of computation (MoC) A formal representation of the operational semantics of networks of functional blocks describing computations Well-known MoCs Discrete Events (DE) Finite State Machines (FSM) Process Networks (PN) Synchronous Data Flow (SDF) Synchronous/Reactive (SR) Many different MoCs for various application domains May need multiple MoCs for modeling an application Copyrightⓒ2003

Model of architecture (MoA) A formal representation of the operational semantics of networks of functional blocks describing architectures It is for modeling an architecture instance of the architecture template Architecture template A specification of a class of architectures in a parameterized form Parameters are number of functional units, buffer size, bus type, latency, etc. Architecture instance The result of assigning values to parameters of the architecture template Copyrightⓒ2003

YAPI [6] Y-chart API Application modeling for signal processing systems For the reuse of signal processing applications For the mapping of signal processing applications onto heterogeneous systems Kahn process network (KPN) Often used for modeling signal processing applications Concurrent processes communicate through unidirectional first-in-first-out channels Blocking read Non-blocking write Deterministic A limitation of KPN Cannot model reactiveness such as user interaction, that is, non-deterministic events Control flow models such as finite state machines are a solution, but less suited for the implementation of computationally intensive applications. Copyrightⓒ2003

To extend KPN with non-deterministic events Introduce a communication primitive (channel selection primitive) YAPI separates the concerns of the application programmer and the system designer. Implementation of YAPI In the form of a C++ run-time library Read(), write(), execute(), and select() The implementation of these functions is a concern of the system designer (may be implemented in different ways). Copyrightⓒ2003

Channel selection to be decoded Architecture evaluation in YAPI VIDEOTOP application The top-level process network model MPEG2 stream Channel selection to be decoded Copyrightⓒ2003

Simulation to measure the workload Communication requirement The amount of data that is transferred between processes Computation requirement The amount of computation of processes From the result We know that the required communication bandwidth is 150MB/s We select initial architecture as input for a more detailed mapping and performance analysis Copyrightⓒ2003

Trace-driven approach SPADE (System level Performance Analysis and Design space Exploration) [7] For architecture exploration of heterogeneous signal processing systems Support an explicit mapping step Cosimulation of application models and architecture models using trace-driven simulation technique Architecture model do not need to model the functional behavior, still handling data dependent behavior correctly Copyrightⓒ2003

In SPADE, applications and architectures are modeled separately. An application imposes a workload on the resources provided by an architecture Workload Computation and communication workload Resources Processing resources Programmable cores or dedicated hardware Communication resources Bus structures and memory resources such as RAMs or FIFO buffers Copyrightⓒ2003

Trace-driven simulation Application model A network of concurrent communicating processes Each process of application model Produce a so-called trace which contains information on the communication and computation operations The traces get interfaced to an architecture model Drive computation and communication activities in the architecture Copyrightⓒ2003

Architecture modeling Application modeling Kahn Process Network model Modeled with YAPI based API read(), write(), and execute() They generate trace entries execute() function takes a symbolic instruction as an argument Architecture modeling Architecture model does not model the functional behavior It is constructed from generic building blocks Trace driven execution unit (TDEU) Interprets trace entries and has a configurable number of I/O ports Interfaces Translates the generic protocol (FIFO) into a communication resource specific protocol (e.g. bus) Copyrightⓒ2003

Architecture modeling (Cont’d) All blocks are parameterized TDEU: a list of symbolic instructions and latencies Interface block: buffer size, bus width, setup delay and transfer delay Copyrightⓒ2003

Mapping Simulation Each process is mapped onto a TDEU Can be many-to-one Need to be scheduled by the TDEU (round robin) Each process port is mapped one-to-one onto an I/O port Simulation Concurrent simulation of the application model and the architecture model Architecture simulation TSS (Tool for System Simulation): Philips in-house architecture modeling and simulation framework Copyrightⓒ2003

Case study [8] M-JPEG application Workload analysis JPEG compression to each frame in the video sequence Workload analysis To get an initial idea of the bandwidth Copyrightⓒ2003

M-JPEG application model KPN model Copyrightⓒ2003

Target architecture Five processing components and a bus communicating via shared memory Copyrightⓒ2003

The description of architecture model Specified using the SPADE architecture description language Copyrightⓒ2003

Mapping description Copyrightⓒ2003

Simulation results Utilization of initial architecture Utilization of mP and VLEP is low To reduce cost, VLEP can be mapped onto mP, still satisfying the required throughput. Utilization of initial architecture Utilization after removing the VLEP Copyrightⓒ2003

Hybrid mapping approach [9] Existing two mapping approaches Trace-driven approach Fast simulation time Insufficient accuracy For architecture exploration only, i.e. does not connect well to a trajectory for detailed design Potential concurrency (ILP) hidden in the trace CDFG mapping approach Long simulation time Fairly accurate performance numbers Better for system synthesis Copyrightⓒ2003

Hybrid mapping approach Takes symbolic program (SP) approach Positioned between two extremes SP itself is not executable But want to simulate the architecture in a trace driven way Need control information Exploits both of Control information in a TD manner SP code (CDFG-like representation) Copyrightⓒ2003

Hybrid mapping approach Trace Gen. Control Trace Gen. Symbolic Program Gen. CDFG Gen. Architecture simulation Instruction stream Architecture simulation Data stream Architecture simulation TD approach CDFG approach Hybrid approach Low accuracy High accuracy High simulation speed Low simulation speed Copyrightⓒ2003

To support this approach The SPU model Introduce Symbolic Program Unit (SPU) The SPU model Front-end part Interprets an SP and fits the result into available resources Back-end part Dispatches the symbolic load and executes symbolic instruction Copyrightⓒ2003

Symbolic program architecture model [10] SPUs + read/write interfaces + FIFO buffers Instruction level parallelism in SPUs Task level parallelism by execution of different tasks on different SPUs Each process is mapped onto an SPU. Copyrightⓒ2003

Simulation results (QR algorithm example) Comparison with SPADE (TD approach) SPU approach Case1 : sequential SPs and non-pipelined FIFO Case2 : VLIW-like SPs and non-pipelined FIFO Case3 : VLIW-like SPs and pipelined FIFO Results show that SPU model is more accurate model Loop unfold factor Copyrightⓒ2003

COSY [11] COdesign Simulation and Synthesis Focus is on communication refinement and DSE Input specification in YAPI Simplifies the design process The level of abstraction for the communication mechanism Application level, system level, virtual component level and physical transfer level introduced by the VSI Alliance are adopted by this approach Four COSY levels APP – executable untimed specification SYS – timed functional abstraction level (transactions) VCI – the interfaces work with addresses and split data in chunks manageable by a bus or a switching network PHY – deals with physical bus size, signaling, and arbitration protocols Copyrightⓒ2003

COSY communication refinement Copyrightⓒ2003

References [1] R. Niemann and P. Marwedel, “Hardware/software partitioning using integer programming,” Proc. ED&TC, Mar. 1996. [2] F. Balarin, M. Chiodo, A. Jurecska, H. Hsieh, A. L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara, Hardware-Software Co-Design of Embedded Systems: The Polis Approach, Kluwer Academic Publishers, 1997. [3] H. Oh and S. Ha, "A hardware-software cosynthesis technique based on heterogeneous multiprocessor scheduling," Proc. CODES, May 1999. [4] B. Kienhuis, E. Deprettere, K. Vissers, P. van der Wolf, "An approach for quantitative analysis of application-specific datatow architectures," Proc. ASAP'97, 1997. [5] A. Kienhuis, Design Space Exploration of Stream-based Datatow Architectures, Ph.D. Thesis, Delft University of Technology, 1999. [6] E. de Kock, G. Essink, P. van der Wolf, J.-Y. Brunel, W. Kruijtzer, P. Lieverse, and K. Vissers, "YAPI: Application Modeling for Signal Processing Systems," Proc. DAC, 2000. Copyrightⓒ2003

[7] P. Lieverse, P. van der Wolf, E. Deprettere, K [7] P. Lieverse, P. van der Wolf, E. Deprettere, K. Vissers, "A methodology for architecture exploration of heterogeneous signal processing systems," Proc. SIPS, 1999. [8] Lieverse, T. Stefanov, P. van der Wolf, and E. Deprettere, "System level design with spade: an M-JPEG case study," Proc. ICCAD, 2001. [9] V. Zivkovic, P. van der Wolf, E. Deprettere, and E. de Kock, "Design space exploration of streaming multiprocessor architectures," Proc. SIPS, 2002. [10] V. Zivkovic, E. de Kock, P. van der Wolf, E. Deprettere, "Fast and accurate multiprocessor architecture exploration with symbolic programs," Proc. DATE, 2003. [11]Jean-Yves Brunel et al., "COSY communication IP’s," Proc. DAC’, 2000. Copyrightⓒ2003