OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.

Slides:



Advertisements
Similar presentations
fakultät für informatik informatik 12 technische universität dortmund Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
DSPs Vs General Purpose Microprocessors
COSC513 Operating System Research Paper Fundamental Properties of Programming for Parallelism Student: Feng Chen (134192)
Computer Architecture Instruction-Level Parallel Processors
Lecture 6 Programming the TMS320C6x Family of DSPs.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
Programmability Issues
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.
1 Presenter: Chien-Chih Chen. 2 Dynamic Scheduler for Multi-core Systems Analysis of The Linux 2.6 Kernel Scheduler Optimal Task Scheduler for Multi-core.
1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Efficient Realization of Hypercube Algorithms on Optical Arrays* Hong Shen Department of Computing & Maths Manchester Metropolitan University, UK ( Joint.
FPGA Latency Optimization Using System-level Transformations and DFG Restructuring Daniel Gomez-Prado, Maciej Ciesielski, and Russell Tessier Department.
Evaluation of Offset Assignment Heuristics Johnny Huynh, Jose Nelson Amaral, Paul Berube University of Alberta, Canada Sid-Ahmed-Ali Touati Universite.
1 Rainer Leupers, University of Dortmund, Computer Science Dept. ISSS ´98 A Uniform Optimization Technique for Offset Assignment Problems Rainer Leupers,
Execution of an instruction
University of Michigan Electrical Engineering and Computer Science 1 Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Addressing Optimization for Loop Execution Targeting DSP with Auto-Increment/Decrement Architecture Wei-Kai Cheng Youn-Long Lin* Computer & Communications.
EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.
Array Allocation Taking into Account SDRAM Characteristics Hong-Kai Chang Youn-Long Lin Department of Computer Science National Tsing Hua University HsinChu,
Center for Embedded Computer Systems University of California, Irvine and San Diego Loop Shifting and Compaction for the.
Parasol LaboratoryTexas A&M University IPDPS The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops Francis Dang, Hao Yu, and Lawrence.
Improving Code Generation Honors Compilers April 16 th 2002.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
Software Pipelining in Pegasus/CASH Cody Hartwig Elie Krevat
Part II: Addressing Modes
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Computer Architecture and the Fetch-Execute Cycle
UNIVERSITAT POLITÈCNICA DE CATALUNYA Departament d’Arquitectura de Computadors Exploiting Pseudo-schedules to Guide Data Dependence Graph Partitioning.
Efficient Mapping onto Coarse-Grained Reconfigurable Architectures using Graph Drawing based Algorithm Jonghee Yoon, Aviral Shrivastava *, Minwook Ahn,
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
1 Exploring Custom Instruction Synthesis for Application-Specific Instruction Set Processors with Multiple Design Objectives Lin, Hai Fei, Yunsi ACM/IEEE.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.
Execution of an instruction
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Implementing algorithms for advanced communication systems -- My bag of tricks Sridhar Rajagopal Electrical and Computer Engineering This work is supported.
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
Computer Organization Instructions Language of The Computer (MIPS) 2.
CML Path Selection based Branching for CGRAs ShriHari RajendranRadhika Thesis Committee : Prof. Aviral Shrivastava (Chair) Prof. Jennifer Blain Christen.
3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Area-Efficient Instruction Set Synthesis for Reconfigurable System on Chip Designs Philip BriskAdam KaplanMajid Sarrafzadeh Embedded and Reconfigurable.
L9 : Low Power DSP Jun-Dong Cho SungKyunKwan Univ. Dept. of ECE, Vada Lab.
©SoftMoore ConsultingSlide 1 Code Optimization. ©SoftMoore ConsultingSlide 2 Code Optimization Code generation techniques and transformations that result.
Conception of parallel algorithms
Superscalar Processors & VLIW Processors
CS170 Computer Organization and Architecture I
Hyunchul Park, Kevin Fan, Manjunath Kudlur,Scott Mahlke
Tsung-Hao Chen and Kuang-Ching Wang May
STUDY AND IMPLEMENTATION
Ronny Krashinsky and Mike Sung
A.R. Hurson 323 CS Building, Missouri S&T
Optimizing Transformations Hal Perkins Winter 2008
Implementation of a De-blocking Filter and Optimization in PLX
“Rate-Optimal” Resource-Constrained Software Pipelining
Presentation transcript:

OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer Science University of Texas at Dallas

Background DSP processors generally provide dedicated address generation units (AGUs) AGUs can reduce address arithmetic instructions by modifying address register in parallel with the current instruction –Three modes: Auto-increment, Auto- decrement, and using Modify Register

Goal Contrary to the traditional compilers, DSP compilers should carefully determine the relative location of data in memory to achieve compacted object code size and improved performance. We propose a scheme to exploit Address Assignment and scheduling for loops.

Contribution The algorithm utilizes the techniques of rotation scheduling, address assignment, and array transformation to minimize both address instructions and schedule length. Compared to the list scheduling, AIRLS shows an average reduction of 35.4% in schedule length and an average reduction of 38.3% in address instructions.

AGU Example 1 Load *(AR0) 2 Adar AR0, 1 3 Add *(AR0) 4 Adar AR0, 1 5 Stor *(AR0) 1 Load *(AR0)+ 2 Add *(AR0)+ 3 Stor *(AR0) A AR0 To Calculate: C = A + B B C Memory Layout High Low Assembly Code without AGU Assembly Code with AGU The address arithmetic instructions can be reduced by modifying address register in parallel with the current instruction by AGU

A LOOP WITH DFG AND SCHEDULE

SCHEDULE I

SCHEDULE II

PROCESSOR MODEL For each functional unit in a multiple functional units processor, there is an accumulator and an address register. Memory access can only occur indirectly via address registers, AR 0 through AR k. indirect addressing with post-increment, post-decrement

ADDRESS ASSIGNMENT With a careful placement of variables in memory, –total number of address instructions can be reduce –Both code size and timing performance is improved Address assignment – the optimization of memory layout of program variables –For single functional unit processors, this problem has been studied extensively. –However, little research has been done for multiple function units architecture like TI C6x VLIW processors.

PREVIOUS WORK Address Assignment is first studied by Bartley and Liao. They modeled the program as a graph theoretic optimization problem. The problem is proved to be NP-hard. An efficient algorithm is used to find the Maximum Weighted Path Covering

LOOP SCHEDULING DFG –Data Flow Graph G=(V, E, OP, d) Static Schedule – Repeated pattern of an execution of an iteration Unfolding –A schedule of unfolding factor f can be obtained by unfolding G f times. Rotation – Is a scheduling technique used to optimize a loop schedule with resource constraints. It transforms a schedule to a more compact one iteratively.

Algorithm Step 1 Put all nodes in the first row of Schedule S into a set R Delete the first Row of schedule S Shift S up by 1 control step Use L to record schedule length of S

Algorithm Step 2 Retime each node u in set R r(u) = r(u) + 1 Update each node with new computation B[i] = A[i] + 5 transform into B[i+1] = A[i+1] + 5

Algorithm Step 3 Generate a new array transformation assignment based on the new computations from Step 2.

Algorithm Step 4 Rotate each node u in set R by put u into the location with minimum address operations among all available locations. Function to calculate number of address operations: –AD(x,y) = 0 if x, y are the same –AD(x,y) = 1 if x, y are next to each other –AD(x,y) = 2 otherwise

EXPERIMENT SETTING Experiment conduct using filters Performed on PC with Linux All running time within 10 seconds

EXPERIMENT RESULT 1

EXPERIMENT RESULT 2

EXPERIMENT RESULT 3

RESULT SUMMARY Compared to the list scheduling, the reduction in schedule length become 45.3% and the reduction in address instructions become 40.7%. Compared to the rotation scheduling, the reduction in schedule length become 24.8% and the reduction in address instructions become 40.7%.

FURTHER IMPROVEMENT When we apply unfolding technique with a unfolding factor of 2, the average reduction in schedule length and number of address instructions are both increased Higher unfolding factor will give more reduction in both schedule length and address instructions

CONCLUSION We propose an algorithm, AIRLS Utilizes array transformation, address assignment, and rotation scheduling techniques to reduce schedule length and address operations for loops on multiple functional units DSP Can significantly reduce schedule length and address instructions comparing to the previous works.

FUTURE WORK Consider multiple address registers per functional unit Consider shared address registers Consider minimum number of address registers needed

Introduction

Motivating Example

Basic Concepts And Models

Algorithm

Experiments

Conclusion