High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook.

Slides:



Advertisements
Similar presentations
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Lecture 6 Programming the TMS320C6x Family of DSPs.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Programmability Issues
Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
High Performance Embedded Computing © 2007 Elsevier Chapter 7, part 1: Hardware/Software Co-Design High Performance Embedded Computing Wayne Wolf.
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.3 Program Flow Mechanisms.
Cpeg421-08S/final-review1 Course Review Tom St. John.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
CS 536 Spring Intermediate Code. Local Optimizations. Lecture 22.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
Previous finals up on the web page use them as practice problems look at them early.
Register Allocation (via graph coloring)
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
Intermediate Code. Local Optimizations
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Improving Code Generation Honors Compilers April 16 th 2002.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
High Performance Embedded Computing © 2007 Elsevier Lecture 11: Memory Optimizations Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.
Course Outline DayContents Day 1 Introduction Motivation, definitions, properties of embedded systems, outline of the current course How to specify embedded.
Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.
High Performance Architectures Dataflow Part 3. 2 Dataflow Processors Recall from Basic Processor Pipelining: Hazards limit performance  Structural hazards.
COP 4620 / 5625 Programming Language Translation / Compiler Writing Fall 2003 Lecture 10, 10/30/2003 Prof. Roy Levow.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
CSc 453 Final Code Generation Saumya Debray The University of Arizona Tucson.
High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.
A Graph Based Algorithm for Data Path Optimization in Custom Processors J. Trajkovic, M. Reshadi, B. Gorjiara, D. Gajski Center for Embedded Computer Systems.
L11: Lower Power High Level Synthesis(2) 성균관대학교 조 준 동 교수
1 CS 201 Compiler Construction Introduction. 2 Instructor Information Rajiv Gupta Office: WCH Room Tel: (951) Office.
Compiler Principles Fall Compiler Principles Lecture 0: Local Optimizations Roman Manevich Ben-Gurion University.
6. A PPLICATION MAPPING 6.3 HW/SW partitioning 6.4 Mapping to heterogeneous multi-processors 1 6. Application mapping (part 2)
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
1 Control Flow Analysis Topic today Representation and Analysis Paper (Sections 1, 2) For next class: Read Representation and Analysis Paper (Section 3)
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Hy-C A Compiler Retargetable for Single-Chip Heterogeneous Multiprocessors Philip Sweany 8/27/2010.
CS412/413 Introduction to Compilers Radu Rugina Lecture 18: Control Flow Graphs 29 Feb 02.
1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.
Branch Prediction Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.
High Performance Embedded Computing © 2007 Elsevier Lecture 7: Memory Systems & Code Compression Embedded Computing Systems Mikko Lipasti, adapted from.
Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,
High Performance Embedded Computing © 2007 Elsevier Lecture 4: Models of Computation Embedded Computing Systems Mikko Lipasti, adapted from M. Schulte.
ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Profile Guided Code Positioning C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Single Static Assignment Intermediate Representation (or SSA IR) Many examples and pictures taken from Wikipedia.
High Performance Embedded Computing © 2007 Elsevier Chapter 3, part 1: Programs High Performance Embedded Computing Wayne Wolf.
Topics to be covered Instruction Execution Characteristics
Code Optimization Overview and Examples
Global Register Allocation Based on
Prof. Onur Mutlu Carnegie Mellon University
Optimizing Compilers Background
5.2 Eleven Advanced Optimizations of Cache Performance
CSCI1600: Embedded and Real Time Software
Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab
Exam Topics Hal Perkins Autumn 2009
Lecture 16: Register Allocation
Predicting Unroll Factors Using Supervised Classification
CSc 453 Final Code Generation
CSCI1600: Embedded and Real Time Software
Exam Topics Hal Perkins Winter 2008
Presentation transcript:

High Performance Embedded Computing © 2007 Elsevier Lecture 10: Code Generation Embedded Computing Systems Michael Schulte Based on slides and textbook from Wayne Wolf

© 2006 Elsevier Topics Code generation overview. Instruction selection. Register allocation. Instruction representation and scheduling. Code placement. Programming environments.

© 2006 Elsevier Embedded vs. general-purpose compilers General-purpose compilers must generate code for a wide range of programs:  No real-time requirements.  Often no explicit low-power requirements.  Generally want fast compilation times. Embedded compilers must meet real-time, low-power requirements.  May be willing to wait longer for compilation results.

© 2006 Elsevier Code generation steps Instruction selection chooses opcodes, modes. Register allocation binds values to registers.  Many DSPs and ASIPs have irregular register sets. Address generation selects addressing mode, registers, etc. Instruction scheduling is important for pipelining and parallelism.

© 2006 Elsevier twig model for instruction selection twig models instructions, programs as graphs. Covers program graph with instruction graph.  Covering can be driven by costs.

© 2006 Elsevier twig instruction models Rewriting rule:  replacement<- template {cost} = action Dynamic programming can be used to cover the program with instructions for tree-structured instructions.  Heuristics are needed for more general instructions

© 2006 Elsevier ASIP instruction description Designing code generators for general purpose machines:  Only need to describe how instructions modify the programmer-visible registers Designing code generators for Application-Specific Instruction Processors (ASIPs)  May need to describe the complete behavior of the instruction in the pipeline  Most ASIPs do not have general purpose registers and many important instructions use specialized registers. Why?

© 2006 Elsevier Register allocation and lifetimes Two variables can be assigned to the same register if they are not live at the same time:  The last use of one variable is before the first use of the other

© 2006 Elsevier Conflict graphs and clique covering In conflict graphs, edges connect nodes (variables) that have disjoint lifetimes  Can be assigned to same register Clique: every pair of vertices is connected by an edge. Cliques in graph correspond to registers. Cliques should be maximal. Each node should belong to exactly one clique.

© 2006 Elsevier Instruction selection and scheduling Instruction selection is more challenging when processors have limited or irregular resources (e.g., for DSPs) When resources are limited instruction selection and scheduling often interact. FlexWare System  Includes a code generation system, called CodeSyn, for ASIPs and DSPs with irregular register files.  Has intermediate representation (IR) for programs control and dataflow (see next slide)  Target instructions use the same basic format as IR, but include information regarding how registers can communicate  Covers the program graph using dynamic programming for data flow and heuristics for control flow.

© 2006 Elsevier FlexWare intermediate representation [Lie94] © 1994 IEEE *

© 2006 Elsevier Register connectivity and classification [Lie94] © 1994 IEEE A separate representation indicates which registers can be used by which types of operations This information has to be taken into account when performing instruction selection and scheduling, and with register allocation.

© 2006 Elsevier Template matching [Lie94] © 1994 IEEE

© 2006 Elsevier Code placement Place code to minimize cache conflicts. Possible cache conflicts may be determined using addresses;  Interesting conflicts are determined through analysis. May require blank areas in program.

© 2006 Elsevier Code placement to reduce cache conflicts Blocks of instructions (e.g., functions) that are accessed frequently can map to different cache lines.

© 2006 Elsevier Hwu and Chang code placement Analyzed traces to find relative execution times of code sections. Inline expanded in frequently used subroutines.  Eliminates function call overhead Placed frequently-used traces using greedy algorithm.  Most frequently used programs are assigned to blocks with least conflicts

© 2006 Elsevier McFarling code placement Analyzed program structure, trace information. Annotated program with loop execution count, basic block size, procedure call frequency. Walked through program to propagate labels, group code based on labels, place code groups to minimize interference.

© 2006 Elsevier McFarling procedure inlining Estimated number of cache misses in a loop:  s l = effective loop body size.  s b = basic block size.  f = average execution frequency of block.  M l = number of misses per loop instance.  l = average number of loop iterations.  S = cache size. Estimated new cache miss rate for inlining; used greedy algorithm to select functions to inline.

© 2006 Elsevier Pettis and Hansen Profiled programs using gprof. Put caller and callee close together in the program, increasing the chance they would be on the same page.  Ordered procedures using call graph, weighted by number of invocations, merging highly-weighted edges. Optimized if-then-else code to take advantage of the processor’s branch prediction mechanism.  If branches are predicted taken then code restructured so that the more frequent path is predicted to be taken Identified basic blocks that were not executed by given input data (fluff blocks); moved to separate processes to improve memory system behavior.

© 2006 Elsevier FlexWare programming environment [Pau02] © 2002 IEEE