L11: Lower Power High Level Synthesis(2) 1999. 8 성균관대학교 조 준 동 교수

Slides:

Advertisements

Similar presentations

Computer Science and Engineering Laboratory, Transport-triggered processors Jani Boutellier Computer Science and Engineering Laboratory This.

Advertisements

Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.

Modern VLSI Design 3e: Chapter 10 Copyright  2002 Prentice Hall Adapted by Yunsi Fei ECE 300 Advanced VLSI Design Fall 2006 Lecture 24: CAD Systems &

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

MEMORY ORGANIZATION Memory Hierarchy Main Memory Auxiliary Memory

CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.

L6: Lower Power Architecture Design

VADA Lab.SungKyunKwan Univ. 1 Lower Power High Level Synthesis 성균관대학교 조 준 동 교수

Problem 1 Defining Netlist Snarl Factor. Some Background A B C D F G EH A B C D F G EH Congested area PlacementRouting A B C D F G E H Netlist == Graph.

Behavioral Synthesis Outline –Synthesis Procedure –Example –Domain-Specific Synthesis –Silicon Compilers –Example Tools Goal –Understand behavioral synthesis.

Courseware Path-Based Scheduling Sune Fallgaard Nielsen Informatics and Mathematical Modelling Technical University of Denmark Richard Petersens Plads,

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

A High Performance Application Representation for Reconfigurable Systems Wenrui GongGang WangRyan Kastner Department of Electrical and Computer Engineering.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

Process Scheduling for Performance Estimation and Synthesis of Hardware/Software Systems Slide 1 Process Scheduling for Performance Estimation and Synthesis.

Scheduling with Optimized Communication for Time-Triggered Embedded Systems Slide 1 Scheduling with Optimized Communication for Time-Triggered Embedded.

System Partitioning Kris Kuchcinski

ECE 232 L2 Basics.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 2 Computer.

Logic Design Outline –Logic Design –Schematic Capture –Logic Simulation –Logic Synthesis –Technology Mapping –Logic Verification Goal –Understand logic.

Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.

COE 561 Digital System Design & Synthesis Resource Sharing and Binding Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum.

Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Introduction to Parallel Processing Ch. 12, Pg

High-Level Synthesis for Reconfigurable Systems. 2 Agenda Modeling 1.Dataflow graphs 2.Sequencing graphs 3.Finite State Machine with Datapath High-level.

Combining High Level Synthesis and Floorplan Together EDA Lab, Tsinghua University Jinian Bian.

1 ENTITY test is port a: in bit; end ENTITY test; DRC LVS ERC Circuit Design Functional Design and Logic Design Physical Design Physical Verification and.

Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.

Register-Transfer (RT) Synthesis Greg Stitt ECE Department University of Florida.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

ISE. Tatjana Petrovic 249/982/22 ISE software tools ISE is Xilinx software design tools that concentrate on delivering you the most productivity available.

Global Routing.

Network Aware Resource Allocation in Distributed Clouds.

CAD for Physical Design of VLSI Circuits

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Automated Design of Custom Architecture Tulika Mitra

Sub-expression elimination Logic expressions: –Performed by logic optimization. –Kernel-based methods. Arithmetic expressions: –Search isomorphic patterns.

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Computer Design Basics

Massachusetts Institute of Technology 1 L14 – Physical Design Spring 2007 Ajay Joshi.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

HYPER: An Interactive Synthesis Environment for Real Time Applications Introduction to High Level Synthesis EE690 Presentation Sanjeev Gunawardena March.

CS244-Introduction to Embedded Systems and Ubiquitous Computing Instructor: Eli Bozorgzadeh Computer Science Department UC Irvine Winter 2010.

Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

1 Power estimation in the algorithmic and register-transfer level September 25, 2006 Chong-Min Kyung.

L13 :Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 3: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.

L12 : Lower Power High Level Synthesis(3) 성균관대학교 조 준 동 교수

A High-Level Synthesis Flow for Custom Instruction Set Extensions for Application-Specific Processors Asia and South Pacific Design Automation Conference.

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Hyunchul Park†, Kevin Fan†, Scott Mahlke†,

VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수

Sequential Execution Example of three micro-operations in the same clock period.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

1 Architecture of Datapath- oriented Coarse-grain Logic and Routing for FPGAs Andy Ye, Jonathan Rose, David Lewis Department of Electrical and Computer.

Register Transfer Specification And Design

Introduction to cosynthesis Rabi Mahapatra CSCE617

Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Topics Logic synthesis. Placement and routing..

Architecture Synthesis

HIGH LEVEL SYNTHESIS.

Michele Santoro: Further Improvements in Interconnect-Driven High-Level Synthesis of DFGs Using 2-Level Graph Isomorphism Michele.

Presentation transcript:

L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수

Exploiting spatial locality for interconnect power reduction A spatially local cluster: group of algorithm operations that are tightly connected to each other in the flowgraph representation. Two nodes are tightly connected to each other on the flowgraph representaion if the shortest distance between them, in terms of number of edges traversed, is low. A spatially local assignment is a mapping of the algorithm operations to specific hardware units such that no operations in different clusters share the same hardware. Partitioning the algorithm into spatially local clusters ensures that the majority of the data transfers take place within clusters (with local bus) and relatively few occur between clusters (with global bus). The partitioning information is passed to the architecture netlist and floorplanning tools. Local: A given adder outputs data to its own inputs Global: A given adder outputs data to the aother adder's inputs

Hardware Mapping The last step in the synthesis process maps the allocated, assigned and scheduled flow graph (called the decorated flow graph) onto the available hardware blocks. The result of this process is a structural description of the processor architecture, (e.g., sdl input to the Lager IV silicon assembly environment). The mapping process transforms the flow graph into three structural sub-graphs: the data path structure graph the controller state machine graph the interface graph (between data path control inputs and the controller output signals)

Spectral Partitioning in High-Level Synthesis The eigenvector placement obtained forms an ordering in which nodes tightly connected to each other are placed close together. The relative distances is a measure of the tightness of connections. Use the eigenvector ordering to generate several partitioning solutions The area estimates are based on distribution graphs. A distribution graph displays the expected number of operations executed in each time slot. Local bus power: the number of global data transfers times the area of the cluster Global bus power: the number of global data transfer times the total area:

Finding a good Partition

Interconnection Estimation For connection within a datapath (over-the-cell routing), routing between units increases the actual height of the datapath by approximately 20-30% and that most wire lengths are about 30-40% of the datapath height. Average global bus length : square root of the estimated chip area. The three terms represent white space, active area of the components, and wiring area. The coefficients are derived statistically.

Incorporating into HYPER-LP

Experiments

Datapath Generation Register file recognition and the multiplexer reduction: – Individual registers are merged as much as possible into register files –reduces the number of bus multiplexers, the overall number of busses (since all registers in a file share the input and output busses) and the number of control signals (since a register file uses a local decoder). Minimize the multiplexer and I/O bus, simultaneously (clique partitioning is Np-complete, thus Simulated Annealing is used) Data path partitioning is to optimize the processor floorplan The core idea is to grow pairs of as large as possible isomorphic regions from corresponding of seed nodes.

Hardware Mapper

Hyper's Basic Architecture Model

Hyper's Crossbar Network

Refined Architecture Model

Bus Merging

Fanin Bus Merging

Fanout Bus merging

Global bus Merging

Test Example

Control Signal Assignment

Factors of the coarse-grained model (obtained by switch level simulator)

- 설계 자동화 연구실 - Low Power Scheduling and Binding (a) 저전력을 고려하지 않은 스케쥴링 (b) 저전력을 고려한 스케쥴링

The coarse-grained model provides a fast estimation of the power consumption when no information of the activity of the input data to the functional units is available.

Fine-grained model When information of the activity of the input data to the functional units is available.

Effect of the operand activity on the power consumption of an 8 X 8-bit Booth multiplier. AHD Input data

High-Level Power Estimation: P MUX and P FU

Loop Interchange If matrix A is laid out in memory in column-major form, execution order (a.2) implies more cache misses than the execution order in (b.2). Thus, the compiler chooses algorithm (b.1) to reduce the running time.