C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
The University of Adelaide, School of Computer Science
EECC756 - Shaaban #1 lec # 1 Spring Systolic Architectures Replace single processor with an array of regular processing elements Orchestrate.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Chapter 3 Pipelining. 3.1 Pipeline Model n Terminology –task –subtask –stage –staging register n Total processing time for each task. –T pl =, where t.
Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.
Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.
C66x CorePac: Achieving High Performance. Agenda 1.CorePac Architecture 2.Single Instruction Multiple Data (SIMD) 3.Memory Access 4.Pipeline Concept.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
CS 300 – Lecture 24 Intro to Computer Architecture / Assembly Language The LAST Lecture!
EECE476: Computer Architecture Lecture 27: Virtual Memory, TLBs, and Caches Chapter 7 The University of British ColumbiaEECE 476© 2005 Guy Lemieux.
Operand Addressing and Instruction Representation
Basic Operational Concepts of a Computer
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
NATIONAL POLYTECHNIC INSTITUTE COMPUTING RESEARCH CENTER IPN-CICMICROSE Lab Design and implementation of a Multimedia Extension for a RISC Processor Eduardo.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.
Pipeline And Vector Processing. Parallel Processing The purpose of parallel processing is to speed up the computer processing capability and increase.
Chapter One Introduction to Pipelined Processors.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
History of Microprocessor MPIntroductionData BusAddress Bus
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.
Operating Systems Lecture No. 2. Basic Elements  At a top level, a computer consists of a processor, memory and I/ O Components.  These components are.
CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.
Computer Architecture Memory, Math and Logic. Basic Building Blocks Seen: – Memory – Logic & Math.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
December 2, 2015Single-Instruction Multiple Data (SIMD)1 Performance Optimization, cont. How do we fix performance problems?
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
November 22, 1999The University of Texas at Austin Native Signal Processing Ravi Bhargava Laboratory of Computer Architecture Electrical and Computer.
Operand Addressing And Instruction Representation Cs355-Chapter 6.
Sean Mathews, Christopher Kiser, Haoxiang Chen. Processor Design Tradeoffs: Instruction Set Design Support useful functions while implementing as efficiently.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
Playstation2 Architecture Architecture Hardware Design.
EKT303/4 Superscalar vs Super-pipelined.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Copyright © Curt Hill SIMD Single Instruction Multiple Data.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
SIMD Programming CS 240A, Winter Flynn* Taxonomy, 1966 In 2013, SIMD and MIMD most common parallelism in architectures – usually both in same.
1 Lecture 5a: CPU architecture 101 boris.
Chapter Overview General Concepts IA-32 Processor Architecture
Ioannis E. Venetis Department of Computer Engineering and Informatics
Visit for more Learning Resources
A Closer Look at Instruction Set Architectures
Embedded Systems Design
Cache Memory Presentation I
Morgan Kaufmann Publishers
Compilers for Embedded Systems
Vector Processing => Multimedia
Implementation of DWT using SSE Instruction Set
Parallel and Multiprocessor Architectures
Pipelining and Vector Processing
Array Processor.
EE 445S Real-Time Digital Signal Processing Lab Spring 2014
STUDY AND IMPLEMENTATION
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
EE 193: Parallel Computing
COMS 361 Computer Organization
COMPUTER ARCHITECTURES FOR PARALLEL ROCESSING
Memory System Performance Chapter 3
CPU Structure CPU must:
Presentation transcript:

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems Lecture 7 Integrated Systems of Hardware and Software

Single Instruction Multiple Data (SIMD) (1) MMX technology  8 mmx registers of 64 bit  extension of the floating point registers  can be handled as 8 8-bit, 4 16-bit, 2 32-bit and 1 64-bit, operations  1 L1 cache line of size 64 bits is loaded to the RF in 1 clk SSE technology  8/16 xmm registers of 128 bit (32-bit architectures support 8 registers only)  Can be handled from 16 8-bit to bit operations  1 L1 cache line of size 128 bits is loaded to the RF in 1 clk AVX technology  8/16 ymm registers of 256 bit (32-bit architectures support 8 registers only)  Can be handled from 32 8-bit to bit operations

Single Instruction Multiple Data (SIMD) (2) SSE instructions work only for data that they are written in consecutive main memory addresses Aligned load/store instructions are faster than the no aligned ones. MMX instructions have lower latency but SSE instructions have higher throughput MMX instructions are preferred for 64-bit operations – The packing/unpacking overhead may be high We can use both mmx and xmm registers SSE memory and arithmetical instructions are executed in parallel VLSI lab, C.E. Goutis, V.I. Kelefouras

Speeding up MVM for regular matrices using SIMD (1) a00a01a02a03a0N a10a11a12a13a1N a20a21a22a23a2N a30a31a32a33a3N a40a41a42a43a4N a50a51a52a53 aM0aMN y0 y1 y2 y3 y4 y5 yM = x YA (NxN)X … … … … … … x0 x1 x2 x3 x4 xN … XMM0 XMM1 XMM2 XMM3 XMM4 XMM6 XMM7 XMM5 we select the optimum production-consumption and the sub-optimum data reuse we use (Regs − 2) registers for Y, 1 register for A and 1 register for X Regs = m (7) o The scheduling with the optimum production-consumption of Y is the optimum  each register of Y contains more than one Y values which have to be summed, unpacked and stored into memory; thus, by maximizing the production-consumption, the number of SEE instructions is minimized (both load/store and arithmetic) o The scheduling is shown below

// sum the y0, y1,y2, y3, y4, y5 and store the results into Y[] count+=6; } a00a01a02a03a0N a10a11a12a13a1N a20a21a22a23a2N a30a31a32a33a3N a40a41a42a43a4N a50a51a52a53 aM0aMN y0 y1 y2 y3 y4 y5 yM = x … … … … … … x0 x1 x2 x3 x4 xN … XMM0 XMM1 XMM2 XMM3 XMM4 XMM6 XMM7 XMM5

Speeding up MVM for regular matrices using SIMD (3) a00a01a02a03a0N a10a11a12a13a1N a20a21a22a23a2N a30a31a32a33a3N a40a41a42a43a4N a50a51a52a53 aM0aMN y0 y1 y2 y3 y4 y5 yM = x YA (NxN)X … … … … … … x0 x1 x2 x3 x4 xN … XMM0 XMM1 XMM2 XMM3 XMM4 XMM6 XMM7 XMM5 o There are several ways to sum the XMM0:XMM5 data 1.to accumulate the four values of each XMM register, to pack their results into new registers and to store each one directly 2.to accumulate the four values of each XMM register and store each single value separately 3.to pack the XMM0:XMM5 values in new registers in such a way to add elements of different registers

a) b) c) Speeding up MVM for regular matrices using SIMD (4)

Speeding up MVM for regular matrices using SIMD (5) a00a01a02a03a0N a10a11a12a13a1N a20a21a22a23a2N a30a31a32a33a3N a40a41a42a43a4N a50a51a52a53 aM0aMN y0 y1 y2 y3 y4 y5 yM = x YA (NxN)X … … … … … … x0 x1 x2 x3 x4 xN … XMM0 XMM1 XMM2 XMM3 XMM4 XMM6 XMM7 XMM5 o In the case that the m rows of A and the X do not fit in L1 data cache, tiling for L1 is applied.  In many general purpose processors, tiling for data cache is not performance efficient (the extra addressing and load/store instructions degrade performance)  Tiling for L1 is used to decrease the number of X array cache misses only, whose number is small (Y is stored into memory once and A is fetched only once)

Speeding up BTMVM using SIMD Regarding SIMD, opt1 and opt2 are not good solutions The structure of BT matrix cannot be profitably exploited by SIMD architecture for two basic reasons 1. it is performance efficient to load address aligned data than no aligned 2. it is faster to load 128bit into one XMM register at once, than to load less bits and apply shuffling operations To implement opt1 into SSE  four copies of each element of X (for float numbers of 4 bytes each) has to be stored into 1 XMM register; this needs more than one SSE load instructions To implement opt2 into SSE  more shuffling and packing operations are needed. Thus, reusing the A array elements does not lead to high performance when SSE instructions are used. VLSI lab, C.E. Goutis, V.I. Kelefouras

a0a1a2a3a4a1’ a2’ a3’a4’aN’aN Arow (2N-1) Speeding up BTMVM and TMVM using SIMD (1) …… y0 y1 y2 y3 y4 y5 yM = Y … XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 x X … x0 x1 x2 x3 x4 xN … XMM7 XMM6 Regarding T and BT matrices, the same schedule as for regular matrices is used, but matrix A is 1-d and of size (2 × N − 1) The first N elements of Arow are these of the first column of A in reversed order, i.e. N, N-1,..., 1, and the next N-1 elements are elements of the first row of A except from the first one – It achieves a larger number of SSE instructions than MVM because the elements of A are fetched from non aligned memory locations the size of A is much smaller the m different elements fetched are in consecutive memory locations (smaller number of cache misses) VLSI lab, C.E. Goutis, V.I. Kelefouras

a0a1a2a3a4a1’ a2’ a3’a4’aN’aN Arow (2N-1) Speeding up BTMVM and TMVM using SIMD (2) …… y0 y1 y2 y3 y4 y5 yM = Y … XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 x X … x0 x1 x2 x3 x4 xN … XMM7 XMM6 When (m + 1) × N ≤ L1 holds, we use MVM instead of BTMVM or TMVM (the Toeplitz symmetry is not utilized)  MVM is faster than BTMVM/TMVM because it achieves a lower number of SSE instructions When (m + 1) × N > L1 holds, we use BTMVM/TMVM routine  BTMVM/TMVM is faster than MVM because the lower number of SSE instructions benefit is lost by the higher number of data cache misses VLSI lab, C.E. Goutis, V.I. Kelefouras

// sum the y0, y1,y2, y3, y4, y5 and store the results into Y[] count+=6; } a0a1a2a3a4a1’ a2’ a3’a4’aN’aN Arow (2N-1) …… y0 y1 y2 y3 y4 y5 yM = Y … XMM0 XMM1 XMM2 XMM3 XMM4 XMM5 x X … x0 x1 x2 x3 x4 xN … XMM7 XMM6 VLSI lab, C.E. Goutis, V.I. Kelefouras

Speeding up BTMVM and TMVM using SIMD (4) VLSI lab, C.E. Goutis, V.I. Kelefouras

for Toeplitz matrices the MVM problem can also be implemented by using the FFT algorithm we use three FFTs and one vector multiplication O(15nlog(2n) + 18n) complexity which is lower than O(n 2 ) It is obvious that for medium/large input sizes, MVM using FFT achieves a lower number of instructions. Speeding up BTMVM and TMVM using SIMD (5)