Processor Level Parallelism 1

Slides:



Advertisements
Similar presentations
CS136, Advanced Architecture Limits to ILP Simultaneous Multithreading.
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
The University of Adelaide, School of Computer Science
Parallel computer architecture classification
Multithreading processors Adapted from Bhuyan, Patterson, Eggers, probably others.
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
How Multi-threading can increase on-chip parallelism
Chapter 7 Multicores, Multiprocessors, and Clusters.
Parallel Computer Architectures
Flynn’s Taxonomy of Computer Architectures Source: Wikipedia Michael Flynn 1966 CMPS 5433 – Parallel Processing.
Computer Architecture Parallel Processing
1 Parallel computing and its recent topics. 2 Outline 1. Introduction of parallel processing (1)What is parallel processing (2)Classification of parallel.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
Parallel Computing.
Processor Architecture
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Floating Point Numbers & Parallel Computing. Outline Fixed-point Numbers Floating Point Numbers Superscalar Processors Multithreading Homogeneous Multiprocessing.
HyperThreading ● Improves processor performance under certain workloads by providing useful work for execution units that would otherwise be idle ● Duplicates.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Computer Architecture And Organization UNIT-II Flynn’s Classification Of Computer Architectures.
EKT303/4 Superscalar vs Super-pipelined.
Lecture 3: Computer Architectures
Parallel Processing Presented by: Wanki Ho CS147, Section 1.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
LECTURE #1 INTRODUCTON TO PARALLEL COMPUTING. 1.What is parallel computing? 2.Why we need parallel computing? 3.Why parallel computing is more difficult?
Processor Performance & Parallelism Yashwant Malaiya Colorado State University With some PH stuff.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
COMP 740: Computer Architecture and Implementation
Flynn’s Taxonomy Many attempts have been made to come up with a way to categorize computer architectures. Flynn’s Taxonomy has been the most enduring of.
18-447: Computer Architecture Lecture 30B: Multiprocessors
Distributed Processors
buses, crossing switch, multistage network.
Parallel Processing - introduction
Simultaneous Multithreading
Multi-core processors
Computer Structure Multi-Threading
Flynn’s Classification Of Computer Architectures
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
Hyperthreading Technology
Computer Architecture: Multithreading (I)
Levels of Parallelism within a Single Processor
Computer Architecture Lecture 4 17th May, 2006
Chapter 17 Parallel Processing
Symmetric Multiprocessing (SMP)
buses, crossing switch, multistage network.
Overview Parallel Processing Pipelining
Fine-grained vs Coarse-grained multithreading
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter 11: Alternative Architectures
Levels of Parallelism within a Single Processor
Multicore and GPU Programming
Multicore and GPU Programming
Presentation transcript:

Processor Level Parallelism 1

Parallelism Levels Levels we can attack parallelism:

Bit Level Parallelism Circuits process bits in parallel

Instruction Level Parallelism Organization level may process instructions in parallel

Higher levels Thread Level Task Level Application Level Ability to run multiple simultaneous streams of instrucions Task Level Ability to run parts of a program on different chips Application Level Run separate jobs on different machines

Process vs Thread Process : Program Own memory space Has at least one thread

Multi Tasking Multitasking Done on single cores running multiple programs OS handles switch "Large" chunks of time Flush cache on switch

Process vs Thread Thread : Instruction sequence Own registers/stack Share memory with other threads in process

Threaded Code Demo…

Resource Usage Four threads running in 4-wide pipeline Can't always fill all 4 issue slots Have bubbles from memory access, page faults, etc… Issue Slots

Multithreading Multithreading Alternate or combine threads to maximize use of processor Finer timescale Maintain cache Hardware required Multiple register sets Track "owner" of pipeline instructions

Multithreading Corse Grained Multitasking Threads run for number of cycles Must drain pipeline before switch

Multithreading Single Pipeline Course Grained Assumption 1 cycle to retire after stall Threads to run Single Pipeline Time 

Multithreading Dual Pipeline Course Grained Assumption 1 cycle to retire after stall Threads to run Dual Pipeline Time 

Latency vs Throughput Multithreading favors throughput over latency

Multithreading Fine Grained Multitasking Hardware can switch to a new thread each cycle without draining pipeline

Multithreading Single Pipeline Fine Grained Assumption: Switches every cycle Threads to run Single Pipeline Time 

Multithreading Dual Pipeline Fine Grained Assumption: Switches every cycle Threads to run Dual Pipeline Time 

SMT SMT : Simultaneous Multithreading AKA Hyperthreading Issue ops from multiple threads in one cycle Time 

Multithreading SMT Try to start next thread early if spare pipeline Threads to run C gets to jump in early as B2 not ready Time 

Multithreading SMT Otherwise switch like fine grained Threads to run C gets full turn, A up next Time 

Multithreading SMT Still constrained by load delays Threads to run C5, B3 not ready until 8; A7 not ready until 9 Time 

SMT Challenges Resources must be duplicated or split Split too thin hurts performance… Duplicate everything and you aren't maximizing use of hardware…

Intel vs AMD Variations on SMT

Processor Level Parallelism Styles

Processor Parallelism Process Parallelism : Run multiple instruction streams simultaneously

Flynn's Taxonomy Categorization of architectures based on Number of simultaneous instructions Number of simultaneous data items

Flynn's Taxonomy Categorization of architectures based on

SISD SISD : Single Instruction – Single Data One instruction One piece data May be pipelined or superscalar

SISD SIMD : Single Instruction – Multiple Data One instruction Multiple pieces of data

SIMD Roots ILLIAC IV One instruction issued to 64 processing units

SIMD Roots Cray I Vector processor One instruction applied to all elements of vector register

Modern SIMD x86 Processors SSE Units : Streaming SIMD Execution Operate on special 128 bit registers 4 32bit chunks 2 64bit chunks 16 8 bit chiunks …

MISD MISD : Multiple Instruction – Single Data One piece of data Processed by multiple instructions Rare Space shuttle : Five processors handle fly by wire input, vote

MIMD MIMD : Multiple Instruction – Multiple Data Multiple pieces of data, multiple instruction streams

MIMD MIMD : Multiple Instruction – Multiple Data Multi core processors Super computers Computational Grids

Coupling and Topologies MIMD differences How connected are nodes? How shared is memory?

BlueGene http://s.top500.org/static/lists/2012/11/TOP500_201211_Poster.png

BG/P Full system : 72 x 32 x 32 torus of nodes

COW Cluster of Workstations