Copyright © 2005-2011 Curt Hill Parallelism in Processors Several Approaches.

Slides:



Advertisements
Similar presentations
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
Computer System Organization Computer-system operation – One or more CPUs, device controllers connect through common bus providing access to shared memory.
Intel Multi-Core Technology. New Energy Efficiency by Parallel Processing – Multi cores in a single package – Second generation high k + metal gate 32nm.
IT Systems Multiprocessor System EN230-1 Justin Champion C208 –
Chapter 12 Pipelining Strategies Performance Hazards.
Reducing Cache Misses 5.1 Introduction 5.2 The ABCs of Caches 5.3 Reducing Cache Misses 5.4 Reducing Cache Miss Penalty 5.5 Reducing Hit Time 5.6 Main.
Organization of a Simple Computer. Computer Systems Organization  The CPU (Central Processing Unit) is the “brain” of the computer. Fetches instructions.
Pipelining By Toan Nguyen.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
Prince Sultan College For Woman
Processor Structure & Operations of an Accumulator Machine
Parallelism Processing more than one instruction at a time. Pipelining
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
 Introduction to Operating System Introduction to Operating System  Types Of An Operating System Types Of An Operating System  Single User Single User.
Introduction CSE 410, Spring 2008 Computer Systems
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Computer Architecture Lecture10: Input/output devices Piotr Bilski.
What have mr aldred’s dirty clothes got to do with the cpu
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Copyright © 1997 – 2014 Curt Hill Concurrent Execution of Programs An Overview.
Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
The fetch-execute cycle. 2 VCN – ICT Department 2013 A2 Computing RegisterMeaningPurpose PCProgram Counter keeps track of where to find the next instruction.
Classic Model of Parallel Processing
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Processor Architecture
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Copyright © Curt Hill Concurrent Execution An Overview for Database.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Computer performance issues* Pipelines, Parallelism. Process and Threads.
Assoc. Prof. Dr. Ahmet Turan ÖZCERİT.  What Operating Systems Do  Computer-System Organization  Computer-System Architecture  Operating-System Structure.
EKT303/4 Superscalar vs Super-pipelined.
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
M211 – Central Processing Unit
Interrupts and Exception Handling. Execution We are quite aware of the Fetch, Execute process of the control unit of the CPU –Fetch and instruction as.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Copyright © Curt Hill More on Operating Systems Continuation of Introduction.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
PipeliningPipelining Computer Architecture (Fall 2006)
Introduction CSE 410, Spring 2005 Computer Systems
Advanced Architectures
Parallel Processing - introduction
CSC 4250 Computer Architectures
Assembly Language for Intel-Based Computers, 5th Edition
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Multi-Core Computing Osama Awwad Department of Computer Science
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Chapter 6: CPU Scheduling
Chapter 1 Introduction.
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
Instruction Level Parallelism
Presentation transcript:

Copyright © Curt Hill Parallelism in Processors Several Approaches

Copyright © Curt Hill Why Parallelism? Simple fact is there is never enough processor speed Performance gains come from two areas Better integration technololgy Better implementation of parallelism Next two graphics show this

Copyright © Curt Hill Chip Performance

Copyright © Curt Hill Gains From Parallelism

Copyright © Curt Hill Summary The bulk of the gains have come from faster and smaller components A significant amount from parallelism The parallelism has also offset the greater complexity of the instruction set

Copyright © Curt Hill Approaches Instruction level parallelism –Instructions operate in parallel –Pipelining Data parallelism –Vector processors Processor level parallelism –Multiple CPUs

Copyright © Curt Hill First Attempt One bottleneck is that accessing instructions from memory is slow Processor is usually order of magnitude faster Usually faster than cache also Therefore have a fetch engine that gets instructions all the time This is the Prefetch buffer

Copyright © Curt Hill Prefetch buffer Don’t wait for the current instruction to finish –Fetch the next instruction as soon as the current instruction arrives This scheme can make a mistake since a goto or branch makes the next instruction difficult to guess You may also fetch in two directions and discard the unused –These are stored in the prefetch buffer

Copyright © Curt Hill Two stages Now we have two independent pieces The instruction fetch mechanism –Using the prefetch buffer The instruction execute mechanism –This is where most of the work is done This generalizes into a pipeline of several stages

Copyright © Curt Hill Pipelines Each of the following are stages: –Fetch the instruction –Decode the instruction –Locate and fetch operands –Execute the operation –Write the results back These may belong to separate hardware chunks that operate in parallel

Copyright © Curt Hill Example: All of this goes on in parallel: Fetch instruction 8 Decode instruction 7 Fetch operands for instruction 6 Execute instruction 5 Write back data for instruction 4

Copyright © Curt Hill A Simulator

Copyright © Curt Hill Superscalar architectures Have a single fetcher drive two different lines each of which consists of these stages The decode through write back occurs in parallel on two or more separate lines This is the Pentium approach The main pipeline can handle anything The second pipeline can handle integer operations or simple floating point operations –Simple such as load / store from floating processor

Copyright © Curt Hill CDC 6600 Just the execute is parallel This only works well if execute step takes longer than the other steps This is particularly true for floating point and memory access instructions The 6600 had multiple I/O and Floating Point processors that could execute in parallel –This is the last of the Cray machines in 60s

Copyright © Curt Hill Problems? Pipelining needs some instruction independence to work optimally If instructions A, B, C are consecutive and B depends on the result of A and C depends on the result of B we may have a problem with either approach Operand fetch of B cannot complete until write back of A, stalling the whole line However, the average mix of instructions tends to not have these hard dependencies in every instruction Compilers can also optimize by mixing up the expression output

Copyright © Curt Hill Problem Example

Copyright © Curt Hill Limits on Instruction Level Parallelism There is a limit on the gains The more stages the less likely that the instruction sequence will be suitable The more expensive the recovery for a mistake Dividing up an instruction processing past stages makes for too little work to be done by each stage The more complicated the processor the more heat it generates

Copyright © Curt Hill Chip Power Consumption

Operating System Parallelism Next we need the types of parallel processing enabled by the OS This usually involves multiple processes and thread Several flavors: Uniprocessing Hyperthreading –Multiprocessing Copyright © Curt Hill

UniProcessing Single CPU, but apparent multiple tasks Permissive –Any system call allows the current task to be suspended and another started –Windows 3 Preemptive –A task is suspended when it makes a system call that could require waiting –A time slice occurs Scalar, array and vector processors

Copyright © Curt Hill Multiple Processors MultiProcessing Real multiprocessing involves multiple CPUs Multiple CPUs can be executing different jobs They may also be in the same job, if it allows The CPUs are almost completely independent –They may share memory or disk or both

Copyright © Curt Hill Multiprocessors Two or more CPUs with shared memory Multiprocessors generally need both hardware and OS support This technique has been used since the 60s The idea is that two CPUs can outperform one It will become even more important

Copyright © Curt Hill Half Way: HyperThreading The Hyper Threading CPUs are a transitional form There is one CPU with two register sets The CPU alternates between registers in execution thus giving better concurrency than a uniprocessor Windows XP considers it two CPUs

Copyright © Curt Hill Multi-Tasking Operating System –There are multiple processes –Each has its own memory –In a single CPU system process executes until: Waiting for I/O Used its time slice Something with higher priority is now ready –When a process is suspended –A queue of processes waiting to execute is examined, the first is chosen and executed

Copyright © Curt Hill Multiple CPUs –Updating this to multiple CPUs mostly requires that the dispatcher part cannot have both CPUs running there at the same time –This requires some type of exclusive instruction and the dispatcher utilize it –Windows 95, DOS cannot –Windows NT, OS/2 and UNIX allow

Copyright © Curt Hill MPU Loss Because of the need to have one CPU lock out the other in certain instances, two CPUs never perform to the same level as one that is twice as fast –90% seems to be average –Thus an MPU with two 1 GHz processors will perform similar to a 1.8 GHz uniprocessor More than two yields more loss Most servers are duals or more

Copyright © Curt Hill Multiprocessors Again Before the Pentium a multiprocessor needed extra hardware to prevent the CPUs from performing a race error of some sort The Pentium could share four pins and that was all the hardware support that was needed The next advance was the multicores

Copyright © Curt Hill Multicore Chips Instead of one very fast CPU on a chip put two not so fast CPUs These are the multicore chips They are actually removing some of the complexity of pipelining to make it smaller and then also using a slower and cooler technology

Copyright © Curt Hill Manufacturer’s Offerings Intel’s HyperThreading chips were a transitionary form AMD and Intel dual-processors became available in 2005 Sun has a 4 core SPARC to be released Microsoft changed its license to be per chip, so that a multi-core chip is considered one processor

Copyright © Curt Hill Disadvantages The bus to the memory becomes the bottleneck Several things are accessing the memory independently: two or more CPUs, Direct Memory Access controllers (disk controllers, video) One solution is dual port memory Separate caches can also help Another solution is to give each processor its own local, private memory, but this diminishes the type of sharing that can go on

Copyright © Curt Hill Chip MultiProcessors

Copyright © Curt Hill Multicomputers When the number of connections get large the sharing of memory gets hard A multicomputer consists of many parallel processors, each with their own memory and disk Then communication is accomplished by messages sent from one to all or one to another Grid computing is one alternative

Conclusion Moore’s Law has not been about just better integration techniques Parallelism in the single CPU and in multiple CPUs has also contributed Pipelining has been major technique for single CPUs There are other presentations on multicomputer and multiprocessor systems Copyright © Curt Hill