Architecture Mapping 최기영 (서울대학교, 전기컴퓨터공학부) Copyrightⓒ2003.

과목 개요(Learning Map) SoC Design Methodoloy Design Flow Classes Lab.
Ÿ Soc Design Flow Introduction Ÿ Design Reuse & SoC Platform Specification Ÿ System Specification System Spec. Lab. Ÿ HW/SW Interface Design Ÿ Power Estimation & Management Design Design Language Lab. Ÿ DSM Design & Signal Integrity Ÿ Design Language Ÿ Synthesis Synthesis Ÿ Architecture Mapping Ÿ Verification Verification Ÿ HW/SW Co-simulation Co-simulation Lab. Ÿ Prototyping & Emulation Ÿ SoC Testing Test SoC Test Lab. Ÿ Design for Testability Copyrightⓒ2003

Outline Introduction Platform-Based Design
Trend in System-on-Chip (SoC) design Design reuse Platform-Based Design Platform-based design flow Application to Architecture Mapping Y-chart approach YAPI Trace-driven approach Hybrid mapping approach POLIS COSY References Copyrightⓒ2003

Introduction Trend in System-on-Chip (SoC) design Larger design space
Exponentially growing transistor counts (Moore's law) Ever increasing complexity of applications Multi-functional and multi-standard More flexibility, higher performance, lower energy, ... Shorter Time-to-Market Need more efficient design methodology Copyrightⓒ2003

Design reuse Complexity vs. productivity Complexity Productivity
58%/yr growth rate Productivity 21%/yr growth rate Copyrightⓒ2003

Programmable video operations, DCT, IDCT, motion estimation
Reuse of Cell (standard cell) IP Architecture (platform) --> platform-based design IC (reconfigurability) Memory Video RAM I/O Host interface DSP core 1 (D950) Modem DSP core 2 Sound ASIP 1 Master Control ASIP 2 Controller ASIP 3 Bit Manipulation ASIP 4 (VLIW DSP) Programmable video operations, standard extensions S interface Glue logic A/D & D/A High-speed HW Video operations for DCT, IDCT, motion estimation Single chip videophone (H.263) Copyrightⓒ2003

Platform-Based Design
Soft IP EDA Tools Hard IP EDA Integrator Others EDA Tools Application specific integration platform Derivative Copyrightⓒ2003

Platform-Based Design
Design-Space Exploration Specification Architectural Space Application Space Application Instance Platform Instance System Application Space Application Instance Large Design-Space Exploration Platform Instance Architectural Space Conventional Design Platform-Based Design Copyrightⓒ2003

Platform-based design flow
Application Architecture Constraints Mapping Mapping results SW synthesis IF synthesis HW synthesis SW HW Copyrightⓒ2003

Application to Architecture Mapping
for(i = 0; i < 18; i++) { s = (mpfloat)0.0f; k = 0; do { s += X[k] * v[k]; s += X[k+1] * v[k+1]; s += X[k+2] * v[k+2]; s += X[k+3] * v[k+3]; s += X[k+4] * v[k+4]; s += X[k+5] * v[k+5]; k += 6; } while(k < 18); v += 18; ISCALE(s); t[i] = s; } /* correct the transform into the 18x36 IMDCT we need */ /* 36 muls */ for(i = 0; i < 9; i++) { x[i] = t[i+9] * Granule_imdct_win[gr->block_type][i]; ISCALE(x[i]); x[i+9] = t[17-i] * Granule_imdct_win[gr->block_type][i+9]; ISCALE(x[i+9]); x[i+18] = t[8-i] * Granule_imdct_win[gr->block_type][i+18]; ISCALE(x[i+18]); x[i+27] = t[i] * Granule_imdct_win[gr->block_type][i+27]; ISCALE(x[i+27]); } Application in C Platform architecture Copyrightⓒ2003

HW-SW partitioning Partitioning system functionality into
Application specific hardware and Software executing on one (or more) processor(s) Partitioning Problem Find minimum cost HW-SW combination satisfying constraints Cost = f (HW area, HW delay, SW size, SW time, interface size, interface delay, power, ... ) Need efficient and accurate performance, cost, power estimation models Need efficient partitioning algorithms Greedy method Simulated annealing Kernighan-Lin Integer linear programming Global criticality/local phase Manual ... Computationally intractable in general Copyrightⓒ2003

R. Niemann [1] Concurrent partitioning, scheduling, and sharing
Integer linear programming VHDL C code VHDL code retargetable compilation high-level synthesis SW costs HW costs partitioning (solve ILP) cluster SW nodes Copyrightⓒ2003

POLIS [2] A design environment for control-dominated embedded systems
MoC CFSM (Co-design Finite State Machine) Globally asynchronous/locally synchronous Formal verification or simulation for the analysis of a system at the behavioral level It can generate C-code/HDL code Weak points Only CFSM: control-dominated application Not support estimation technique for complex processor models Does not support multiple hardware and software partitioning Copyrightⓒ2003

Overall flow formal languges (Esterel) translators translator CFSMs
simulation partitioning verification intermediate format formal verification partitioned CFSMs SW synthesis HW synthesis interface synthesis S-graph BLIF HW interface scheduler template + timing constraints OS synthesis logic synthesis C code optimized hardware integration Copyrightⓒ2003

Heterogeneous multiprocessor scheduling [3]
Allocate additional PEs until the given time constraint is satisfied Perform list scheduling with the allocated PEs task-PE time table heterogeneous multiprocessor scheduler task-PE allocation controller performance evaluation Fail Good cosynthesis result Copyrightⓒ2003

Priority for the list scheduling is given by
BIL(i,j)=E(i,j)+maxd[min(BIL(d,j), mink(BIL(d,k)+C(i,d)))] where E(i,j) is the execution time of node i on processor j and C(i,d) is the IPC overhead between i and d. BIL(i,j) is the shortest possible path length from node i to the sink. B C P0 A D solution P1 A B D C processor cost task-PE profile table exec time(cost) P0(HW) P1(1) P2(5) B0 B1 B2 A 3(4) 2(6) 1(10) 7 2 B 4(5) 2(8) 10 3 C 2(3) 1(5) 5 D 5(10) 3(15) 15 P0 P1(1) P2(5) B0 7 10 2(3) 15 Copyrightⓒ2003

Abstraction pyramid [5]
Copyrightⓒ2003

Design approach using Y-chart environment
Design trajectory Golden point design Design approach using Y-chart environment Copyrightⓒ2003

Mapping A crucial step in DSE to evaluate the performance of different application-architecture combinations For smooth mapping Need a good match in data and operation types between the corresponding model of architecture and model of computation Mapping Architecture Application match in data/operation type Model of architecture Model of computation Copyrightⓒ2003

Model of computation (MoC)
A formal representation of the operational semantics of networks of functional blocks describing computations Well-known MoCs Discrete Events (DE) Finite State Machines (FSM) Process Networks (PN) Synchronous Data Flow (SDF) Synchronous/Reactive (SR) Many different MoCs for various application domains May need multiple MoCs for modeling an application Copyrightⓒ2003

Model of architecture (MoA)
A formal representation of the operational semantics of networks of functional blocks describing architectures It is for modeling an architecture instance of the architecture template Architecture template A specification of a class of architectures in a parameterized form Parameters are number of functional units, buffer size, bus type, latency, etc. Architecture instance The result of assigning values to parameters of the architecture template Copyrightⓒ2003

YAPI [6] Y-chart API Application modeling for signal processing systems For the reuse of signal processing applications For the mapping of signal processing applications onto heterogeneous systems Kahn process network (KPN) Often used for modeling signal processing applications Concurrent processes communicate through unidirectional first-in-first-out channels Blocking read Non-blocking write Deterministic A limitation of KPN Cannot model reactiveness such as user interaction, that is, non-deterministic events Control flow models such as finite state machines are a solution, but less suited for the implementation of computationally intensive applications. Copyrightⓒ2003

To extend KPN with non-deterministic events
Introduce a communication primitive (channel selection primitive) YAPI separates the concerns of the application programmer and the system designer. Implementation of YAPI In the form of a C++ run-time library Read(), write(), execute(), and select() The implementation of these functions is a concern of the system designer (may be implemented in different ways). Copyrightⓒ2003

Channel selection to be decoded
Architecture evaluation in YAPI VIDEOTOP application The top-level process network model MPEG2 stream Channel selection to be decoded Copyrightⓒ2003

Simulation to measure the workload
Communication requirement The amount of data that is transferred between processes Computation requirement The amount of computation of processes From the result We know that the required communication bandwidth is 150MB/s We select initial architecture as input for a more detailed mapping and performance analysis Copyrightⓒ2003

Trace-driven approach
SPADE (System level Performance Analysis and Design space Exploration) [7] For architecture exploration of heterogeneous signal processing systems Support an explicit mapping step Cosimulation of application models and architecture models using trace-driven simulation technique Architecture model do not need to model the functional behavior, still handling data dependent behavior correctly Copyrightⓒ2003

In SPADE, applications and architectures are modeled separately.
An application imposes a workload on the resources provided by an architecture Workload Computation and communication workload Resources Processing resources Programmable cores or dedicated hardware Communication resources Bus structures and memory resources such as RAMs or FIFO buffers Copyrightⓒ2003

Trace-driven simulation
Application model A network of concurrent communicating processes Each process of application model Produce a so-called trace which contains information on the communication and computation operations The traces get interfaced to an architecture model Drive computation and communication activities in the architecture Copyrightⓒ2003

Architecture modeling
Application modeling Kahn Process Network model Modeled with YAPI based API read(), write(), and execute() They generate trace entries execute() function takes a symbolic instruction as an argument Architecture modeling Architecture model does not model the functional behavior It is constructed from generic building blocks Trace driven execution unit (TDEU) Interprets trace entries and has a configurable number of I/O ports Interfaces Translates the generic protocol (FIFO) into a communication resource specific protocol (e.g. bus) Copyrightⓒ2003

Architecture modeling (Cont’d)
All blocks are parameterized TDEU: a list of symbolic instructions and latencies Interface block: buffer size, bus width, setup delay and transfer delay Copyrightⓒ2003

Mapping Simulation Each process is mapped onto a TDEU
Can be many-to-one Need to be scheduled by the TDEU (round robin) Each process port is mapped one-to-one onto an I/O port Simulation Concurrent simulation of the application model and the architecture model Architecture simulation TSS (Tool for System Simulation): Philips in-house architecture modeling and simulation framework Copyrightⓒ2003

Simulation results Utilization of initial architecture
Utilization of mP and VLEP is low To reduce cost, VLEP can be mapped onto mP, still satisfying the required throughput. Utilization of initial architecture Utilization after removing the VLEP Copyrightⓒ2003

Hybrid mapping approach [9]
Existing two mapping approaches Trace-driven approach Fast simulation time Insufficient accuracy For architecture exploration only, i.e. does not connect well to a trajectory for detailed design Potential concurrency (ILP) hidden in the trace CDFG mapping approach Long simulation time Fairly accurate performance numbers Better for system synthesis Copyrightⓒ2003

Hybrid mapping approach
Takes symbolic program (SP) approach Positioned between two extremes SP itself is not executable But want to simulate the architecture in a trace driven way Need control information Exploits both of Control information in a TD manner SP code (CDFG-like representation) Copyrightⓒ2003

Hybrid mapping approach
Trace Gen. Control Trace Gen. Symbolic Program Gen. CDFG Gen. Architecture simulation Instruction stream Architecture simulation Data stream Architecture simulation TD approach CDFG approach Hybrid approach Low accuracy High accuracy High simulation speed Low simulation speed Copyrightⓒ2003

To support this approach The SPU model
Introduce Symbolic Program Unit (SPU) The SPU model Front-end part Interprets an SP and fits the result into available resources Back-end part Dispatches the symbolic load and executes symbolic instruction Copyrightⓒ2003

Symbolic program architecture model [10]
SPUs + read/write interfaces + FIFO buffers Instruction level parallelism in SPUs Task level parallelism by execution of different tasks on different SPUs Each process is mapped onto an SPU. Copyrightⓒ2003

Simulation results (QR algorithm example)
Comparison with SPADE (TD approach) SPU approach Case1 : sequential SPs and non-pipelined FIFO Case2 : VLIW-like SPs and non-pipelined FIFO Case3 : VLIW-like SPs and pipelined FIFO Results show that SPU model is more accurate model Loop unfold factor Copyrightⓒ2003

COSY [11] COdesign Simulation and Synthesis
Focus is on communication refinement and DSE Input specification in YAPI Simplifies the design process The level of abstraction for the communication mechanism Application level, system level, virtual component level and physical transfer level introduced by the VSI Alliance are adopted by this approach Four COSY levels APP – executable untimed specification SYS – timed functional abstraction level (transactions) VCI – the interfaces work with addresses and split data in chunks manageable by a bus or a switching network PHY – deals with physical bus size, signaling, and arbitration protocols Copyrightⓒ2003

References [1] R. Niemann and P. Marwedel, “Hardware/software partitioning using integer programming,” Proc. ED&TC, Mar [2] F. Balarin, M. Chiodo, A. Jurecska, H. Hsieh, A. L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara, Hardware-Software Co-Design of Embedded Systems: The Polis Approach, Kluwer Academic Publishers, 1997. [3] H. Oh and S. Ha, "A hardware-software cosynthesis technique based on heterogeneous multiprocessor scheduling," Proc. CODES, May 1999. [4] B. Kienhuis, E. Deprettere, K. Vissers, P. van der Wolf, "An approach for quantitative analysis of application-specific datatow architectures," Proc. ASAP'97, 1997. [5] A. Kienhuis, Design Space Exploration of Stream-based Datatow Architectures, Ph.D. Thesis, Delft University of Technology, 1999. [6] E. de Kock, G. Essink, P. van der Wolf, J.-Y. Brunel, W. Kruijtzer, P. Lieverse, and K. Vissers, "YAPI: Application Modeling for Signal Processing Systems," Proc. DAC, 2000. Copyrightⓒ2003

[7] P. Lieverse, P. van der Wolf, E. Deprettere, K
[7] P. Lieverse, P. van der Wolf, E. Deprettere, K. Vissers, "A methodology for architecture exploration of heterogeneous signal processing systems," Proc. SIPS, 1999. [8] Lieverse, T. Stefanov, P. van der Wolf, and E. Deprettere, "System level design with spade: an M-JPEG case study," Proc. ICCAD, 2001. [9] V. Zivkovic, P. van der Wolf, E. Deprettere, and E. de Kock, "Design space exploration of streaming multiprocessor architectures," Proc. SIPS, 2002. [10] V. Zivkovic, E. de Kock, P. van der Wolf, E. Deprettere, "Fast and accurate multiprocessor architecture exploration with symbolic programs," Proc. DATE, 2003. [11]Jean-Yves Brunel et al., "COSY communication IP’s," Proc. DAC’, 2000. Copyrightⓒ2003

Architecture Mapping 최기영 (서울대학교, 전기컴퓨터공학부) Copyrightⓒ2003.

Similar presentations

Presentation on theme: "Architecture Mapping 최기영 (서울대학교, 전기컴퓨터공학부) Copyrightⓒ2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Architecture Mapping 최기영 (서울대학교, 전기컴퓨터공학부) Copyrightⓒ2003.

Similar presentations

Presentation on theme: "Architecture Mapping 최기영 (서울대학교, 전기컴퓨터공학부) Copyrightⓒ2003."— Presentation transcript:

Similar presentations

About project

Feedback