Single Thread Parallelism

Slides:



Advertisements
Similar presentations
Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Computer Organization and Architecture
Computer architecture
Streaming SIMD Extension (SSE)
Instruction-Level Parallel Processors {Objective: executing two or more instructions in parallel} 4.1 Evolution and overview of ILP-processors 4.2 Dependencies.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
The University of Adelaide, School of Computer Science
Optimizing single thread performance Dependence Loop transformations.
Eliminating Stalls Using Compiler Support. Instruction Level Parallelism gcc 17% control transfer –5 instructions + 1 branch –Reordering among 5 instructions.
No Pipeline IFIDEXMEMWB IFIDEXMEMWB IFIDEXMEMWB instr time IFIDEXMEMWB Assuming 1 cycle for IF/ID/EX/MEM/WB, Total # cycles for n instructions:
Instruction Level Parallelism (ILP) Colin Stevens.
Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.
Multiscalar processors
Sequential Circuit Introduction to Counter
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
Lab 8: A Calculator Using Stack Memory
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
Performance of mathematical software Agner Fog Technical University of Denmark
Rethinking Computer Architecture Wen-mei Hwu University of Illinois, Urbana-Champaign Celebrating September 19, 2014.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
Processor Architecture
Computer Architecture: Out-of-Order Execution II
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
15-740/ Computer Architecture Lecture 2: SIMD, MIMD, and ISA Principles Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/14/2011.
Computer Architecture Lecture 14: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/18/2015.
Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University.
Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors) Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/24/2014.
STUDY OF PIC MICROCONTROLLERS.. Design Flow C CODE Hex File Assembly Code Compiler Assembler Chip Programming.
Computer Architecture: SIMD and GPUs (Part I)
CS501 Advanced Computer Architecture
Precise Exceptions and Out-of-Order Execution
15-740/ Computer Architecture Lecture 21: Superscalar Processing
buses, crossing switch, multistage network.
Design of Digital Circuits Lecture 21: GPUs
Computer Architecture: SIMD and GPUs (Part II)
Fall 2012 Parallel Computer Architecture Lecture 22: Dataflow I
Independence Instruction Set Architectures
Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/16/2015
Morgan Kaufmann Publishers
Samira Khan University of Virginia Sep 4, 2017
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 2/21/2014
Vector Processing => Multimedia
Drinking from the Firehose Decode in the Mill™ CPU Architecture
The University of Adelaide, School of Computer Science
Array Processor.
Single Thread Parallelism
CMPT 886: Computer Architecture Primer
Computer Architecture Dataflow (Part II) and Systolic Arrays
Comparison of Two Processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Coe818 Advanced Computer Architecture
Multivector and SIMD Computers
buses, crossing switch, multistage network.
The University of Texas at Austin
Samira Khan University of Virginia Sep 5, 2018
Lecture 3: Evolution of the Uniprocessor
Part 2: Parallel Models (I)
The Vector-Thread Architecture
Pipelining: Basic Concepts
ECE 498AL Lecture 10: Control Flow
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 15 – Vectors Krste Asanovic Electrical Engineering and Computer.
ECE 498AL Spring 2010 Lecture 10: Control Flow
Computer Architecture Lecture 15: Dataflow and SIMD
Pipelining.
Prof. Onur Mutlu Carnegie Mellon University
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

Single Thread Parallelism

Outline Pipelining SIMD (both vector and array processing) VLIW DAE HPS Data Flow

An example: Vector processing The scalar code: for i=1,50 A(i) = (B(i)+C(i))/2; The vector code: lvs 1 lvl 50 vld V0,B vld V1,C vadd V2,V0,V1 vshf V3,V2,1 vst V3,A

Vector processing example (continued) Scalar code (loads take 11 clock cycles): Vector code (no vector chaining): Vector code (with chaining): Vector code (with 2 load, 1 store port to memory):

The HPS Paradigm Incorporated the following: Aggressive branch prediction Speculative execution Wide issue Out-of-order execution In-order retirement First published in Micro-18 (1985) Patt, Hwu, Shebanow: Introduction to HPS Patt, Melvin, Hwu, Shebanow: Critical issues

Data Flow Data Driven execution of inst-level Graphical code Nodes are operators Arcs are I/O Only REAL dependencies constrain processing Anti-dependencies don’t (write-after-read) Output dependencies don’t (write-after-write) NO sequential I-stream (No program counter) Operations execute ASYNCHRONOUSLY Instructions do not reference memory (at least memory as we understand it) Execution is triggered by presence of data