Performance Optimization for Embedded Software

Slides:



Advertisements
Similar presentations
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Advertisements

Instruction Set Design
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Xtensa C and C++ Compiler Ding-Kai Chen
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Code Compaction of an Operating System Kernel Haifeng He, John Trimble, Somu Perianayagam, Saumya Debray, Gregory Andrews Computer Science Department.
Fundamentals of Python: From First Programs Through Data Structures
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
RM2D Let’s write our FIRST basic SPIN program!. The Labs that follow in this Module are designed to teach the following; Turn an LED on – assigning I/O.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
CSc 453 Runtime Environments Saumya Debray The University of Arizona Tucson.
HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.
Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.
Copyright © 2005 Elsevier Chapter 8 :: Subroutines and Control Abstraction Programming Language Pragmatics Michael L. Scott.
1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Outline Announcements: –HW III due Friday! –HW II returned soon Software performance Architecture & performance Measuring performance.
Chapter One Introduction to Pipelined Processors
Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Parallel Computing Chapter 3 - Patterns R. HALVERSON MIDWESTERN STATE UNIVERSITY 1.
1 Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.
A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.
Lecture 10 CUDA Instructions
Chapter Goals Describe the application development process and the role of methodologies, models, and tools Compare and contrast programming language generations.
Code Optimization.
Recursion A problem solving technique where an algorithm is defined in terms of itself A recursive method is a method that calls itself A recursive algorithm.
William Stallings Computer Organization and Architecture 8th Edition
Selective Code Compression Scheme for Embedded System
Protection of System Resources
Microprocessor and Assembly Language
Embedded Systems Design
Assembly Language for Intel-Based Computers, 5th Edition
Von Neumann model - Memory
The HP OpenVMS Itanium® Calling Standard
Chapter 9 :: Subroutines and Control Abstraction
Instruction Scheduling for Instruction-Level Parallelism
CSCI1600: Embedded and Real Time Software
Application Binary Interface (ABI)
Chap. 8 :: Subroutines and Control Abstraction
Chap. 8 :: Subroutines and Control Abstraction
Objective of This Course
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
STUDY AND IMPLEMENTATION
Register Pressure Guided Unroll-and-Jam
Trying to avoid pipeline delays
The Procedure Abstraction Part I: Basics
Topic 3-a Calling Convention 1/10/2019.
Chapter 1 Introduction.
Von Neumann model - Memory
Introduction to Microprocessor Programming
Chapter 1 Computer System Overview
Chapter 12 Pipelining and RISC
Chapter 6 Programming the basic computer
CPU Structure CPU must:
Programming with Shared Memory Specifying parallelism
Lecture 4: Instruction Set Design/Pipelining
Lecture 5: Pipeline Wrap-up, Static ILP
6- General Purpose GPU Programming
Introduction to Computer Systems Engineering
CSCI1600: Embedded and Real Time Software
Pointer analysis John Rollinson & Kaiyuan Li
Chapter 4 The Von Neumann Model
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Performance Optimization for Embedded Software Presented by: Yingjun Lyu

What is Software Optimization? The process of modifying a software system —> work more efficiently or use fewer resources

Do you Optimize your Program?

When to Optimize? A better approach: design first, code from design, profile the code Keep performance goals in mind

Levels of Optimization Design level Algorithms and data structures Source code level while(1) vs for(;;) Build level Compile level Assembly level Run time

The Code Optimization Process Build —> Optimize —> Check outputs Build —> Generate tests —> Optimize —> Check outputs

Basic C Optimization Techniques Choose the right data type Example: a processor does not support a 32-bit multiplication. Use of a 32-bit type in a multiply—> A sequence of 16-bit operations What if only a 16-bit precision is needed? Solution: Use intrinsics to leverage embedded processor features.

An intrinsic function is a function available for use in a given programming language whose implementation is handled specially by the compiler.

Function calling conventions Definition: an implementation-level (low-level) scheme for how callees receive parameters from their caller and how they return a result. Stack-based or Register-based?

Restrict and point aliasing Compiler knows pointers do not alias—>Parallelism

Loops Communicate loop count information: specify the loop count bounds to the compiler Example: Hardware loop: keep the loop body in a buffer or prefetching

General Loop Transformation Loop unrolling Multisampling Partial summation Software pipelining

Loop unrolling: A loop body is duplicated one or more times Loop unrolling: A loop body is duplicated one or more times. The loop count is then reduced by the same factor to compensate.

Multisampling: independent output values that have an overlap in input source data values

Partial Summation: The computation for one output sum is divided into multiple smaller, or partial, sums.

Software pipelining: A sequence of instructions is transformed into a pipeline of several copies of that sequence

Is there any cost for performance optimization?

Example: Loop Unrolling

Code Size Optimization Why? Code Size —> The amount of space in memory the code will occupy at program run-time and the potential reduction in the amount of instruction cache needed by the device.

Compiler flags (configure the compiler) Optimize code size Example: command line option -Os in the GNU GCC compiler Optimize performance O3Os? Critical code is optimized for speed and the bulk of the code may be optimized for size

“Premium encodings”: The most commonly used instructions can be represented in a reduced binary footprint Example: integer add instructions in a 32-bit device are represented with a premium 16-bit encoding Drawback: Performance Degration

Tuning the ABI for code size ABI: application binary interface, an interface between a given program and the OS, system libraries, etc. To reduce code size, there are two areas of interest: calling convention and alignment

Fewer instructions are required for setting up parameters to be passed via registers than for those to be passed via the stack. Calling Convention

Increase cache misses and register pressure Space-time Tradeoff Depend on the unrolling factor Increase cache misses and register pressure

Space-time Tradeoff

Improve Performance through memory layout optimization Vectorization of loops Computation performed across multiple loop iterations can be combined into single vector instructions.

An important concern for vectorizing: Loop Dependence Analysis: array access, data modification, conditional statement, etc Challenge: Pointer aliasing Solution: Place restrict keyword

Array-of-structures or Structure-of-arrays Array-of-structures or Structure-of-arrays? Hint: Memory is most efficiently accessed sequentially.

Source Code Level Optimization Performance bug: Bugs that cause significant performance degradation PerfChecker: a performance bug detection tool for mobile applications (static analysis)

GUI lagging becomes the most dominant bug types(75.7%) Long running operations in main threads

View holder design pattern

[1] Oshana and Kraeling. Software Engineering for Embedded Systems: Methods, Practical Techniques, and Applications - Chapter 11: Optimizing Embedded Software for Performance [2] Oshana and Kraeling. Software Engineering for Embedded Systems: Methods, Practical Techniques, and Applications - Chapter 12: Optimizing Embedded Software for Memory [3] Heydemann, K., Bodin, F., Knijnenburg, P. M. W. and Morin, L. (2006), UFS: a global trade-off strategy for loop unrolling for VLIW architectures. Concurrency Computat.: Pract. Exper., 18: 1413–1434. doi:10.1002/cpe.1014 [4] Yepang Liu, Chang Xu, and Shing-Chi Cheung. 2014. Characterizing and detecting performance bugs for smartphone applications. In Proceedings of the 36th International Conference on Software Engineering (ICSE 2014). ACM, New York, NY, USA, 1013-1024. DOI=http://dx.doi.org/10.1145/2568225.2568229 [5] http://sccpu2.cse.ust.hk/andrewust/files/ICSE2014_presentation.pdf