Super computers Parallel Processing By Lecturer: Aisha Dawood.

Slides:



Advertisements
Similar presentations
Instruction Level Parallelism and Superscalar Processors
Advertisements

CH14 Instruction Level Parallelism and Superscalar Processors
Computer Organization and Architecture
Computer architecture
CSCI 4717/5717 Computer Architecture
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CSE 8383 Superscalar Processor 1 Abdullah A Alasmari & Eid S. Alharbi.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Datorteknik F1 bild 1 Instruction Level Parallelism Scalar-processors –the model so far SuperScalar –multiple execution units in parallel VLIW –multiple.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Chapter 17 Parallel Processing.
Multiscalar processors
Csci4203/ece43631 Review Quiz. 1)It is less expensive 2)It is usually faster 3)Its average CPI is smaller 4)It allows a faster clock rate 5)It has a simpler.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Chapter One Introduction to Pipelined Processors.
The Vector-Thread Architecture Ronny Krashinsky, Chris Batten, Krste Asanović Computer Architecture Group MIT Laboratory for Computer Science
Architecture Basics ECE 454 Computer Systems Programming
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Parallelism Processing more than one instruction at a time. Pipelining
CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
Basics and Architectures
Very Long Instruction Word (VLIW) Architecture. VLIW Machine It consists of many functional units connected to a large central register file Each functional.
What have mr aldred’s dirty clothes got to do with the cpu
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Pipelining and Parallelism Mark Staveley
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Lecture 1: Introduction Instruction Level Parallelism & Processor Architectures.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
PipeliningPipelining Computer Architecture (Fall 2006)
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
Advanced Architectures
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
/ Computer Architecture and Design
Pipelining: Advanced ILP
Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers The Processor
Superscalar Processors & VLIW Processors
Instruction Level Parallelism and Superscalar Processors
CSC3050 – Computer Architecture
Instruction Level Parallelism
Presentation transcript:

Super computers Parallel Processing By Lecturer: Aisha Dawood

Text book "INTRODUCTION TO PARALLEL COMPUTING", by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar.

Sequential computing Sequential computer consists of a memory connected to a processor via a data path. Bottleneck of the computational processing rate of a computer system. This overhead led to parallelism. Implicit parallelism and explicit parallelism.

Motivating Parallelism The computational power – Transistors to FLOPS. Memory/Disk speed ( DRAM Latency – Memory Bandwidth – Caches). Data communication ( Networks – undesirable Centralized approaches).

Parallel and Distributed Computing Parallel computing (processing): the use of two or more processors (computers), usually within a single system, working simultaneously to solve a single problem. Distributed computing (processing): any computing that involves multiple computers remote from each other that each have a role in a computation problem or information processing. Parallel programming: the human process of developing programs that express what computations should be executed in parallel.

Pipelining and Superscalar execution Processors relies on pipelining to improve execution rates. Pipelining by overlapping various stages in instruction execution (Fetch, Decode, Operand Fetch, Execute, Store). i.e. (The assembly of a car taking 100 time units can be broken into 10 pipelines).

Pipelining and Superscalar execution To increase the speed of a pipeline we would break the tasks into smaller subtasks thus lengthen the pipeline and increasing the overlap of instructions execution. This enables faster clock rates since tasks are now smaller. E.g. Pentium 4 which operates at 2.0 GHz has 20 stage pipelines. The speed of a single pipeline limited by the largest atomic task in the pipeline. Long instruction pipeline need effective techniques for predicting branches. The penalty of a misprediction increases as the pipeline become deeper since a large number of instructions need to be flushed.

Superscalar execution The ability of a processor to issue multiple instructions in the same cycle is referred to as Superscalar execution. Architectures allow two issues per clock cycle is referred to as two-way superscalar.

Superscalar execution Example 2.1

Superscalar execution Load and Load are mutually independent. Add R1, R2 and Store are dependent. True data dependency : The result of instruction may be required for subsequent instructions, e.g. Load and add

Superscalar execution Dependencies must be resolved before issue of instructions, this has two implications: Since the resolution is done at a runtime it must bee supported in hardware, the complexity can be high. The parallelism in the instruction level is limited and is a function of coding technique. You can extract more parallelism by reordering instructions Example 2.1 (i).

Superscalar execution Consider two scheduling of two floating point operations on a dual issue machine with a single floating point unit the instructions can not be issued together since they are competing for a single processor resource this is called resource dependency. In a conditional branched instruction since the branch destination only known at the execution time, scheduling such instruction leads to errors this dependency referred to as branch dependencies or procedural dependencies handled by rolling back.

Superscalar execution Accurate branch prediction is critical for efficient superscalar execution. The ability of a processor to detect and schedule concurrent instructions is critical to superscalar performance example 2.1 (iii) In-order, out-of-order ( dynamic instruction issue). Superscalar architectures the execution aspect of the program assuming the multiply add unit (example 2.1) vertical waste, horizontal waste.

Very Long Instruction Word Processors (VLIW) The parallelism adopted by Superscalars is often limited by the instruction look-ahead. In VLIW processors relies on the compiler to resolve dependencies and resource availabilities at compile time.

Very Long Instruction Word Processors (VLIW) VLIW has advantages and disadvantages: Since scheduling is done in software the decoding and instruction issue mechanisms is simpler. Compilers are easier to optimize parallelism when compared to hardware issue unit. Compilers do not have the dynamic program state available to make scheduling decisions. Compilers allows the use of more static prediction schemas. Other runtime situations such as stalls on data fetch because of cache misses are difficult to predict. VLIW very sensitive to the compilers ability to detect dependencies. Superscalar computers have limited parallelism compared to VLIW processors.

Limitation of memory system performance The effective performance of a computer relies not just on the processor speed but also on the memory system ability to feed data to the processor. A memory system takes in a request for a word and return a block of size b containing the requested word in l nanoseconds, l referred to as latency. The rate at which data will be pumped to the processor determines the bandwidth. I.e. Water hose.

Improving memory latency using Caches One memory system innovation addresses the speed mismatch between the processor and DRAM. Caches have low latency and high bandwidth storage. Data reference satisfied by the cache is called cache hit ratio. The computation rate of application at which data can be pumped into the CPU is referred to as memory bound. The improvement of performance resulted from caches is based on the assumption of repeated references in a small time window this is called temporal locality.