Download presentation
Presentation is loading. Please wait.
Published byOsborn Turner Modified over 9 years ago
1
Copyright © 2005-2011 Curt Hill Parallelism in Processors Several Approaches
2
Copyright © 2005-2011 Curt Hill Why Parallelism? Simple fact is there is never enough processor speed Performance gains come from two areas Better integration technololgy Better implementation of parallelism Next two graphics show this
3
Copyright © 2005-2011 Curt Hill Chip Performance
4
Copyright © 2005-2011 Curt Hill Gains From Parallelism
5
Copyright © 2005-2011 Curt Hill Summary The bulk of the gains have come from faster and smaller components A significant amount from parallelism The parallelism has also offset the greater complexity of the instruction set
6
Copyright © 2005-2011 Curt Hill Approaches Instruction level parallelism –Instructions operate in parallel –Pipelining Data parallelism –Vector processors Processor level parallelism –Multiple CPUs
7
Copyright © 2005-2011 Curt Hill First Attempt One bottleneck is that accessing instructions from memory is slow Processor is usually order of magnitude faster Usually faster than cache also Therefore have a fetch engine that gets instructions all the time This is the Prefetch buffer
8
Copyright © 2005-2011 Curt Hill Prefetch buffer Don’t wait for the current instruction to finish –Fetch the next instruction as soon as the current instruction arrives This scheme can make a mistake since a goto or branch makes the next instruction difficult to guess You may also fetch in two directions and discard the unused –These are stored in the prefetch buffer
9
Copyright © 2005-2011 Curt Hill Two stages Now we have two independent pieces The instruction fetch mechanism –Using the prefetch buffer The instruction execute mechanism –This is where most of the work is done This generalizes into a pipeline of several stages
10
Copyright © 2005-2011 Curt Hill Pipelines Each of the following are stages: –Fetch the instruction –Decode the instruction –Locate and fetch operands –Execute the operation –Write the results back These may belong to separate hardware chunks that operate in parallel
11
Copyright © 2005-2011 Curt Hill Example: All of this goes on in parallel: Fetch instruction 8 Decode instruction 7 Fetch operands for instruction 6 Execute instruction 5 Write back data for instruction 4
12
Copyright © 2005-2011 Curt Hill A Simulator
13
Copyright © 2005-2011 Curt Hill Superscalar architectures Have a single fetcher drive two different lines each of which consists of these stages The decode through write back occurs in parallel on two or more separate lines This is the Pentium approach The main pipeline can handle anything The second pipeline can handle integer operations or simple floating point operations –Simple such as load / store from floating processor
14
Copyright © 2005-2011 Curt Hill CDC 6600 Just the execute is parallel This only works well if execute step takes longer than the other steps This is particularly true for floating point and memory access instructions The 6600 had multiple I/O and Floating Point processors that could execute in parallel –This is the last of the Cray machines in 60s
15
Copyright © 2005-2011 Curt Hill Problems? Pipelining needs some instruction independence to work optimally If instructions A, B, C are consecutive and B depends on the result of A and C depends on the result of B we may have a problem with either approach Operand fetch of B cannot complete until write back of A, stalling the whole line However, the average mix of instructions tends to not have these hard dependencies in every instruction Compilers can also optimize by mixing up the expression output
16
Copyright © 2005-2011 Curt Hill Problem Example
17
Copyright © 2005-2011 Curt Hill Limits on Instruction Level Parallelism There is a limit on the gains The more stages the less likely that the instruction sequence will be suitable The more expensive the recovery for a mistake Dividing up an instruction processing past 10-20 stages makes for too little work to be done by each stage The more complicated the processor the more heat it generates
18
Copyright © 2005-2011 Curt Hill Chip Power Consumption
19
Operating System Parallelism Next we need the types of parallel processing enabled by the OS This usually involves multiple processes and thread Several flavors: Uniprocessing Hyperthreading –Multiprocessing Copyright © 2005-2011 Curt Hill
20
UniProcessing Single CPU, but apparent multiple tasks Permissive –Any system call allows the current task to be suspended and another started –Windows 3 Preemptive –A task is suspended when it makes a system call that could require waiting –A time slice occurs Scalar, array and vector processors
21
Copyright © 2005-2011 Curt Hill Multiple Processors MultiProcessing Real multiprocessing involves multiple CPUs Multiple CPUs can be executing different jobs They may also be in the same job, if it allows The CPUs are almost completely independent –They may share memory or disk or both
22
Copyright © 2005-2011 Curt Hill Multiprocessors Two or more CPUs with shared memory Multiprocessors generally need both hardware and OS support This technique has been used since the 60s The idea is that two CPUs can outperform one It will become even more important
23
Copyright © 2005-2011 Curt Hill Half Way: HyperThreading The Hyper Threading CPUs are a transitional form There is one CPU with two register sets The CPU alternates between registers in execution thus giving better concurrency than a uniprocessor Windows XP considers it two CPUs
24
Copyright © 2005-2011 Curt Hill Multi-Tasking Operating System –There are multiple processes –Each has its own memory –In a single CPU system process executes until: Waiting for I/O Used its time slice Something with higher priority is now ready –When a process is suspended –A queue of processes waiting to execute is examined, the first is chosen and executed
25
Copyright © 2005-2011 Curt Hill Multiple CPUs –Updating this to multiple CPUs mostly requires that the dispatcher part cannot have both CPUs running there at the same time –This requires some type of exclusive instruction and the dispatcher utilize it –Windows 95, DOS cannot –Windows NT, OS/2 and UNIX allow
26
Copyright © 2005-2011 Curt Hill MPU Loss Because of the need to have one CPU lock out the other in certain instances, two CPUs never perform to the same level as one that is twice as fast –90% seems to be average –Thus an MPU with two 1 GHz processors will perform similar to a 1.8 GHz uniprocessor More than two yields more loss Most servers are duals or more
27
Copyright © 2005-2011 Curt Hill Multiprocessors Again Before the Pentium a multiprocessor needed extra hardware to prevent the CPUs from performing a race error of some sort The Pentium could share four pins and that was all the hardware support that was needed The next advance was the multicores
28
Copyright © 2005-2011 Curt Hill Multicore Chips Instead of one very fast CPU on a chip put two not so fast CPUs These are the multicore chips They are actually removing some of the complexity of pipelining to make it smaller and then also using a slower and cooler technology
29
Copyright © 2005-2011 Curt Hill Manufacturer’s Offerings Intel’s HyperThreading chips were a transitionary form AMD and Intel dual-processors became available in 2005 Sun has a 4 core SPARC to be released 2005-2006 Microsoft changed its license to be per chip, so that a multi-core chip is considered one processor
30
Copyright © 2005-2011 Curt Hill Disadvantages The bus to the memory becomes the bottleneck Several things are accessing the memory independently: two or more CPUs, Direct Memory Access controllers (disk controllers, video) One solution is dual port memory Separate caches can also help Another solution is to give each processor its own local, private memory, but this diminishes the type of sharing that can go on
31
Copyright © 2005-2011 Curt Hill Chip MultiProcessors
32
Copyright © 2005-2011 Curt Hill Multicomputers When the number of connections get large the sharing of memory gets hard A multicomputer consists of many parallel processors, each with their own memory and disk Then communication is accomplished by messages sent from one to all or one to another Grid computing is one alternative
33
Conclusion Moore’s Law has not been about just better integration techniques Parallelism in the single CPU and in multiple CPUs has also contributed Pipelining has been major technique for single CPUs There are other presentations on multicomputer and multiprocessor systems Copyright © 2005-2011 Curt Hill
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.