08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine High-Level.

Slides:

Advertisements

Similar presentations

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

ECE 667 Synthesis and Verification of Digital Circuits

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Altera FLEX 10K technology in Real Time Application.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Program Representations. Representing programs Goals.

Modern VLSI Design 2e: Chapter 8 Copyright  1998 Prentice Hall PTR Topics n High-level synthesis. n Architectures for low power. n Testability and architecture.

Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.

08/31/2001Copyright CECS & The Spark Project SPARK High Level Synthesis System Sumit GuptaTimothy KamMichael KishinevskyShai Rotem Nick SavoiuNikil DuttRajesh.

FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.

Copyright © 2002 UCI ACES Laboratory A Design Space Exploration framework for rISA Design Ashok Halambi, Aviral Shrivastava,

08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine Conditional.

TM Pro64™: Performance Compilers For IA-64™ Jim Dehnert Principal Engineer 5 June 2000.

ECE Synthesis & Verification - Lecture 2 1 ECE 697B (667) Spring 2006 ECE 697B (667) Spring 2006 Synthesis and Verification of Digital Circuits Scheduling.

Representing programs Goals. Representing programs Primary goals –analysis is easy and effective just a few cases to handle directly link related things.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

1 Intermediate representation Goals: –encode knowledge about the program –facilitate analysis –facilitate retargeting –facilitate optimization scanning.

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse Grain and Fine Grain Optimizations.

Center for Embedded Computer Systems Dynamic Conditional Branch Balancing during the High-Level Synthesis of Control-Intensive.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A C-to-VHDL Parallelizing High-Level.

Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.

Storage Assignment during High-level Synthesis for Configurable Architectures Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Multiscalar processors

Center for Embedded Computer Systems University of California, Irvine Coordinated Coarse-Grain and Fine-Grain Optimizations.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Center for Embedded Computer Systems University of California, Irvine and San Diego Hardware and Interface Synthesis of.

Center for Embedded Computer Systems University of California, Irvine SPARK: A High-Level Synthesis Framework for Applying.

Center for Embedded Computer Systems University of California, Irvine Dynamic Common Sub-Expression Elimination during Scheduling.

Merging Synthesis With Layout For Soc Design -- Research Status Jinian Bian and Hongxi Xue Dept. Of Computer Science and Technology, Tsinghua University,

Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.

ICS 252 Introduction to Computer Design

SPARK Accelerating ASIC designs through parallelizing high-level synthesis Sumit Gupta Rajesh Gupta

DAC 2001: Paper 18.2 Center for Embedded Computer Systems, UC Irvine Center for Embedded Computer Systems University of California, Irvine

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Center for Embedded Computer Systems University of California, Irvine and San Diego SPARK: A Parallelizing High-Level Synthesis.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.

Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology High-level Specification and Efficient Implementation.

Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.

A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.

Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.

3 rd Nov CSV881: Low Power Design1 Power Estimation and Modeling M. Balakrishnan.

CS 211: Computer Architecture Lecture 6 Module 2 Exploiting Instruction Level Parallelism with Software Approaches Instructor: Morris Lancaster.

ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.

MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.

6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.

Efficient Evaluation of XQuery over Streaming Data

Code Optimization.

CSCI1600: Embedded and Real Time Software

High-Level Synthesis for Side-Channel Defense

Ann Gordon-Ross and Frank Vahid*

Register Pressure Guided Unroll-and-Jam

Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.

Architecture Synthesis

Control Flow Analysis (Chapter 7)

Chapter 12 Pipelining and RISC

Dynamic Hardware Prediction

CSCI1600: Embedded and Real Time Software

Presentation transcript:

08/31/2001Copyright CECS & The Spark Project Center for Embedded Computer Systems University of California, Irvine High-Level Synthesis of High Performance Microprocessor Blocks Nick Savoiu Nikil Dutt Rajesh Gupta Alex Nicolau SPARK High Level Synthesis System Supported by Semiconductor Research Corporation and Intel Timothy Kam Michael Kishinevsky Steve Haynal Abdallah Tabbara Sumit Gupta Strategic CAD Labs Design Technologies Intel Inc, Hillsboro

2 Copyright CECS & The Spark Project Overview  Brief background u Spark High-Level Synthesis Framework u Previous work in Spark framework  High-level synthesis for Microprocessor blocks  Instruction Length Decoder u Design Behavior u Steps involved in synthesis  Work done this summer at SCL  Future Plans

3 Copyright CECS & The Spark Project High Level Synthesis From C to CDFG to Architecture

4 Copyright CECS & The Spark Project Scheduling with Given Resource Allocation Resource Constraints +<

5 Copyright CECS & The Spark Project The Spark High-Level Synthesis Framework

6 Copyright CECS & The Spark Project Limitations of high-level synthesis targeted by Spark  Quality of synthesis results severely effected by complex control flow u Control flow style effects the effectiveness of optimizations u Nested ifs and loops not handled or handled poorly  Poor understanding (much less integration) of the interaction between source-level and fine grain “compiler” transformations  No comprehensive synthesis framework u Few and scattered optimizations u Results presented for scheduling F Effects on logic synthesis not understood u Small, synthetic benchmarks

7 Copyright CECS & The Spark Project Generalized Code Motions If Node TF Conditional Speculation Reverse Speculation Speculation Across Hierarchical Blocks

8 Copyright CECS & The Spark Project Characteristics of ASIC Design  Large designs such as MPEG  Multi-cycle implementation  Resource constrained Implications on transformations applied  Extraction of parallelism constrained by area limitations u Speculation may lead to additional registers  More conservative with transformations such as loop unrolling

9 Copyright CECS & The Spark Project Characteristics of Microprocessor Blocks  Smaller Designs  Single or Dual Cycle implementation  High performance u Extract maximal parallelism u Area constraints are more lax Implications on transformations applied  Operations within behavior are chained together with no latching  All loops can be unrolled

10 Copyright CECS & The Spark Project Simplified Instruction Length Decoder Byte 0 Byte 1 Byte 2 Byte 3 Length Contribution NeedNextByte

11 Copyright CECS & The Spark Project Simplified Instruction Length Decoder Byte 0 Byte 1 Byte 2 Byte 3 Length Contribution NeedNextByte First Instruction

12 Copyright CECS & The Spark Project Behavioral Description in C NextStartByte = 0; for (i=0; i < n; i++) { len[i] = CalculateLength(i); if (i == NextStartByte) { NextStartByte = len[i]; Mark[i] = 1; } } /* for (i=0; i < n; i++) */ int CalculateLength(i) { lc1 = LengthContribution(i); need1 = need_next_byte(i); if (need1) { lc2 = LengthContribution(i+1); need2 = need_next_byte(i+1); if (need2) { lc3 = LengthContribution(i+2); need3 = need_next_byte(i+2); if (need3) { lc4 = LengthContribution(i+3); Length = lc1 + lc2 + lc3 + lc4; } else Length = lc1 + lc2 + lc3; } else Length = lc1 + lc2; } else Length = lc1; return Length; }

13 Copyright CECS & The Spark Project Control Logic Data Calculation Speculate Maximally NextStartByte = 0; for (i=0; i < n; i++) { len[i] = CalculateLength(i); if (i == NextStartByte) { NextStartByte = len[i]; Mark[i] = 1; } } /* for (i=0; i < n; i++) */ int CalculateLength(i) { lc1 = LengthContribution(i); need1 = need_next_byte(i); lc2 = LengthContribution(i+1); need2 = need_next_byte(i+1); lc3 = LengthContribution(i+2); need3 = need_next_byte(i+2); lc4 = LengthContribution(i+3); TempLength1 = lc1 + lc2 + lc3 + lc4; TempLength2 = lc1 + lc2 + lc3; TempLength3 = lc1 + lc2; if (need1) { if (need2) { if (need3) { Length = TempLength1; } else Length = TempLength2; } else Length = TempLength3; } else Length = lc1; return Length; }

14 Copyright CECS & The Spark Project Inlining (Done Earlier) Control Logic Data Calculation NextStartByte = 0; for (i=0; i < n; i++) { Results(i) = DataCalulation(i, i+1, i+2, i+3); Length(i) = ControlLogic(Results(i)); len[i] = Length(i); if (i == NextStartByte) { NextStartByte = len[i]; Mark[i] = 1; } } /* for (i=0; i < n; i++) */ int CalculateLength(i) { lc1 = LengthContribution(i); need1 = need_next_byte(i); lc2 = LengthContribution(i+1); need2 = need_next_byte(i+1); lc3 = LengthContribution(i+2); need3 = need_next_byte(i+2); lc4 = LengthContribution(i+3); TempLength1 = lc1 + lc2 + lc3 + lc4; TempLength2 = lc1 + lc2 + lc3; TempLength3 = lc1 + lc2; if (need1) { if (need2) { if (need3) { Length = TempLength1; } else Length = TempLength2; } else Length = TempLength3; } else Length = lc1; return Length; }

15 Copyright CECS & The Spark Project Unroll Loop Completely NextStartByte = 0; i=0; Results(i) = DataCalculation(i, i+1, i+2, i+3); Length(i) = ControlLogic(Results(i)); len[i] = Length(i); if (i == NextStartByte) { NextStartByte = len[i]; Mark[i] = 1; } Results(i+1) = DataCalculation(i +1, i+2, i+3, i+4); Length(i +1) = ControlLogic(Results(i +1)); len[i +1] = Length(i +1); if (i +1 == NextStartByte) { NextStartByte = len[i +1]; Mark[i +1] = 1; } Shown For Only 2 Unrolls

16 Copyright CECS & The Spark Project Propagate Constant: Loop Index NextStartByte = 0; Results(0) = DataCalculation(0, 1, 2, 3); Length(0) = ControlLogic(Results(0)); len[0] = Length(0); if (0 == NextStartByte) { NextStartByte = len[0]; Mark[0] = 1; } Results(1) = DataCalculation(1, 2, 3, 4); Length(1) = ControlLogic(Results(1)); len[1] = Length(1); if (1 == NextStartByte) { NextStartByte = len[1]; Mark[1] = 1; }

17 Copyright CECS & The Spark Project Data Calculation Ripple Control Logic Control Logic Maximally Parallelize/Compact Results(0) = DataCalculation(0,1,2,3); Results(1) = DataCalculation(1,2,3,4); … Results(n) = DataCalulation(n, n+1, n+2, n+3); Length(0) = ControlLogic(Results(0)); Length(1) = ControlLogic(Results(1)); … Length(n) = ControlLogic(Results(n)); len[0] = Length(0); len[1] = Length(1); … len[n] = Length(n); NextStartByte = 0; if (0 == NextStartByte) { NextStartByte = len[0]; Mark[0] = 1; } if (1 == NextStartByte) { NextStartByte = len[1]; Mark[1] = 1; } … if (n == NextStartByte) { NextStartByte = len[n]; Mark[n] = 1; }

18 Copyright CECS & The Spark Project Final Design Architecture Data Calculation Ripple Control Logic Control Logic Results(0) = DataCalculation(0,1,2,3) Results(1) = DataCalculation(1,2,3,4) … Results(n) = DataCalculation(n, n+1, n+2, n+3); Length(0) = ControlLogic(Results(0)); Length(1) = ControlLogic(Results(1)); … Length(n) = ControlLogic(Results(n)); if (0 == NextStartByte) { NextStartByte = len[0]; Mark[0] = 1; } … if (n == NextStartByte) { NextStartByte = len[n]; Mark[n] = 1; } Data Calculation Control Logic Ripple Logic Instruction Buffer

19 Copyright CECS & The Spark Project ILD Tasks Achieved This Summer  Chaining across conditional boundaries u Enables single cycle schedules u Useful as general high-level synthesis transformation as well u Had implications on other things such as VHDL generation  Complete unrolling of loops u Was implemented previously  Constant Propagation u Useful for loop index propagation after unrolling

20 Copyright CECS & The Spark Project Other Interaction within SCL  Interfacing with HLD team via XML u Implemented XML generation pass u Creates a path from C for NexSiS and the rest of HLD flow u Being driven by requirements from Abdallah  Analyzed some other designs u Whitney: 3-D design u FAX: Willamette floating point unit

21 Copyright CECS & The Spark Project Future Plans  Continue to work on ILD with the more complicated (complete) design  Look at similar designs u Detect first 3 zeros in 32 bit vector  Develop a set of transformations targeted to such high performance blocks  Expand interaction with HLD Design flow u Do some transformations before handing over CDFG via XML to Symbolic Scheduling u For example: transformations that lead to node duplication, source-to-source transformations, some loop transformations

22 Copyright CECS & The Spark Project Additional Slides

23 Copyright CECS & The Spark Project Spark’s Methodology  Applies coarse and fine grain compiler optimizations u Targets control flow transformations u “Fine grain” loop optimization techniques for multiple and nested loops u Mixed IR suitable for fine and coarse grain compiler transformations (similar to other systems such as SUIF)  Synthesis from C provides u Flow from architecture design to synthesis u Opportunity to apply coarse grain optimizations  Compiler transformations modified to target HLS u Multiple mutually-exclusive operations can be scheduled on the same resource in the same cycle

24 Copyright CECS & The Spark Project Spark’s Methodology  Customizable extensible scheduler u Range of transformations in modular toolbox F Percolation, trailblazing, loop pipelining (RDLP), inlining u Selected under heuristics and/or user control F Code motion, loop transformations  Ability to generate synthesizable RTL VHDL u Integrates with current IC design flows u Code generation at various levels: F Behavioral C F Behavioral VHDL F Structural VHDL

25 Copyright CECS & The Spark Project Generalized Code Motions  Hierarchical code motions u Operations are moved across entire conditional structures  Speculation to improve resource utilization u Has to be controlled to limit impact on number of registers  Reverse speculation u Moves operations down into conditional branches  Early condition execution u Evaluates conditionals as soon as the corresponding operation has been executed  Conditional Speculation u Duplicates operations up into conditional branches

26 Copyright CECS & The Spark Project Scheduling Results on MPEG Prediction Block

27 Copyright CECS & The Spark Project Scheduling Results on ADPCM Encoder

28 Copyright CECS & The Spark Project Scheduling Results  Synthesis results after scheduling by Spark show u Considerable gain in execution cycles u Critical path decreases marginally u Area can increase significantly  Benchmarks used are large real-life applications u well written; no gains due to sloppy code

29 Copyright CECS & The Spark Project Interconnect minimization by resource binding  Minimize the complexity of steering logic u Multiplexors and demultiplexors  Bind operations with same inputs and outputs to same functional units  Bind variables, which are inputs/outputs to same functional units, to the same registers

30 Copyright CECS & The Spark Project Results after Binding

31 Copyright CECS & The Spark Project Results after Binding: ADPCM

32 Copyright CECS & The Spark Project Future Plans  Synthesis for high-performance microprocessor blocks u Single cycle behavioral descriptions  Timing analysis and time budgeting u Introducing time constrained synthesis  Loop Transformations u Parallelizing compiler transformations: loop interchange, exchange, splitting, fusion  Resource versus Throughput analysis  Cost models for code motions

33 Copyright CECS & The Spark Project The Intermediate Representation HTG/CDFG EDG AST Hierarchical Task Graph (HTG) is main structure in the intermediate representation (IR) Maintains information on: Code structure (IFs, LOOPs) Loop bounds, type (FOR, WHILE) Array accesses are not lowered to address calculation followed by memory access Is complete Can regenerate input C code

34 Copyright CECS & The Spark Project IR Examples

35 Copyright CECS & The Spark Project C codeCDFGHTG

36 Copyright CECS & The Spark Project Scheduling

37 Copyright CECS & The Spark Project The Scheduler Framework Scheduler framework philosophy modular, reusability allow designer to write new scheduling algorithms with minimal effort Toolbox approach core transformations: percolation, trailblazing, RDLP heuristics to decide which transformations are to be applied

38 Copyright CECS & The Spark Project The Scheduler Framework  Designed to be completely customizable in terms of the scheduling algorithms and heuristics used  An instance of a scheduling algorithm consists of a set of u IR traversal algorithms u code motion algorithms u scheduling heuristics  The designer can use predefined algorithms and heuristics or design new ones u enabled by toolbox approach Algorithm Scheduling Heuristics Candidate Validators Candidate Provider IR Walkers

39 Copyright CECS & The Spark Project Extracting Parallelism with Speculation

40 Copyright CECS & The Spark Project Reverse Speculation Moves operations into conditionals Only moves to branches which require result Moves operations with lower priority

41 Copyright CECS & The Spark Project Early Condition Execution Evaluates conditions ASAP Moves all unscheduled operations into conditionals Uses reverse speculation to achieve this

42 Copyright CECS & The Spark Project Conditional Speculation

43 Copyright CECS & The Spark Project RDLP Example A i=i+1 B j=i+h C k=i+g D l=j+1 A B : C D A D : A Original LoopCompact Shift and Pipeline A B : C D : A Unroll and compact B : C D