Classic Model of Parallel Processing

Name: Classic Model of Parallel Processing
Uploaded: 2017-10-09T23:33:17+00:00
Duration: PTM10S51
Channel: Baldwin Bates
Description: Classic Model of Parallel Processing

Classic Model of Parallel Processing
An example parallel process of time 10: Multiple Processors available (4) A Process can be divided into serial and parallel portions The parallel parts are executed concurrently Serial Time: 10 time units Parallel Time: 4 time units S - Serial or non-parallel portion A - All A parts can be executed concurrently B - All B parts can be executed concurrently All A parts must be completed prior to executing the B parts Executed on a single processor: S A B Executed in parallel on 4 processors: S A B

Amdahl’s Law (Analytical Model)
Analytical model of parallel speedup from 1960s Parallel fraction () is run over n processors taking /n time The part that must be executed in serial (1- ) gets no speedup Overall performance is limited by the fraction of the work that cannot be done in parallel (1- ) diminishing returns with increasing processors (n)

Pipelined Processing Single Processor enhanced with discrete stages
Instructions “flow” through pipeline stages Parallel Speedup with multiple instructions being executed (by parts) simultaneously Realized speedup is partly determined by the number of stages: 5 stages=at most 5 times faster Cycle: F D OF EX WB F - Instruction Fetch D - Instruction Decode OF - Operand Fetch EX - Execute WB - Write Back or Result Store Processor clock/cycle is divided into sub-cycles, each stage takes one sub-cycle

Pipeline Performance Speedup is serial time (nS) over parallel time
Performance is limited by the number of pipeline flushes (n) due to jumps speculative execution and branch prediction can minimize pipeline flushes Performance is also reduced by pipeline stalls (s), due to conflicts with bus access, data not ready delays, and other sources

Super-Scalar: Multiple Pipelines
Concurrent Execution of Multiple sets of instructions Example: Simultaneous execution of instructions though an integer pipeline while processing instructions through a floating point pipeline Compiler: identifies and specifies separate instruction sets for concurrent execution through different pipes

Algorithm/Thread Level Parallelism
Example: Algorithms to compute Fast Fourier Transform (FFT) used in Digital Signal Processing (DSP) Many separate computations in parallel (High Degree Of Parallelism) Large exchange of data - much communication between processors Fine-Grained Parallelism Communication time (latency) may be a consideration if multiple processors are combined on a board of motherboard Large communication load (fine-grained parallelism) can force the algorithm to become bandwidth-bound rather than computation-bound.

Simple Algorithm/Thread Parallelism Model
B Parallel “threads of execution” could be a separate process could be a multi-thread process Each thread of execution obeys Amdahl’s parallel speedup model Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism S A B Observe that the serial parts of Program 1 and Program 2 are now running in parallel with each other. Each program would take 6 time units on a uniprocessor, or a total workload serial time of 12. Each has a speedup of 1.5. The total speedup is 12/4 = 3, which is also the sum of the program speedups.

Multiprocess Speedup Concurrent Execution of Multiple Processes
Each process is limited by Amdahl’s parallel speedup Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism Avoid Degree of Parallelism (DOP) speedup limitations Linear scaling up to machine limits of processors and memory: n  single process speedup S A B S A B No speedup - uniprocessor 12 t S A B S A B Single Process 8 t, Speedup = 1.5 S A B S A B Multi-Process 4 t, Speedup = 3 Two

Algorithm/Thread Parallelism - Analytical Model
Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed similar) processes or threads Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads

(Simple) Unified Model with Scaled Speedup
Adds scaling factor on parallel work, while holding serial work constant k1 = scaling factor on parallel portion  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads

Capturing Multiple Levels of Parallelism
Most parallelism suffers from diminishing returns - resulting in limited scalability. Allocating hardware resources to capture multiple levels of parallelism - operate at efficient end of speedup curves. Manufacturers of microcontrollers are integrating multiple levels of parallelism on a single chip

Trend in Microprocessor Architectures
1. Intra-Instruction Parallelism: Pipelines 2. Instruction-Level Parallelism: Super-Scalar - Multiple Pipelines 3. Algorithm/Thread Parallelism: Multiple processing elements Integrated DSP with microcontroller Enhanced microcontroller to do DSP Enhanced DSP processor that also functions as a microcontroller Architectural Variations DSP and microcontroller cores on same chip DSP also does microprocessor Microprocessor also does DSP Multiprocessor Each variation captures some speedup from all three levels Varying amounts of speedup from each level Each parallel level operates at a more efficient level than if all hardware resources were allocated to a single parallel level

More Levels of Parallelism Outside the Chip
Multiple Processors in a box: on a motherboard on back-plane with daughter-boards Shared-Memory Multiprocessors communication is through shared memory Clustered Multiprocessors another hierarchical level processors are grouped into clusters intra-cluster can be bus or network inter-cluster can be bus or network Distributed Multicomputers multiple computers loosely coupled through a network n-tiered Architectures modern client/server architectures

Speedup of Client-Server, 2-Tier Systems
 - workload balance,% of workload on client  = 1 (100%), completely distributed  = 0 (100%), completely centralized n clients, m servers n CLIENTS m SERVERS LAN INTERNET LAN

Speedup of Client-Server, n-Tier Systems
m1 level 1 machines (clients) m2 server2, m3 server3, m3 server3, etc. 1 - workload balance,% of workload on client 2 - % of workload on server2, 3 - % of workload on server3, etc. SERVERS m2 m3 m4 m1 CLIENTS LAN INTERNET LAN SAN

Presentation Summary Architects/Chip-Manufacturers are integrating additional levels of parallelism. Multiple levels of speedup achieve higher speedups and greater efficiencies than increasing hardware at a single parallel level. A balanced approach would achieve about the same level of efficiency in cost of hardware resources allocated, in delivering parallel speedup at each level of parallelism. Numerous architectural approaches are possible, each with different trade-offs and performance returns. Current technology is integrating DSP processing with microcontroller functionality - achieving up to three levels of parallelism.

Classic Model of Parallel Processing

Similar presentations

Presentation on theme: "Classic Model of Parallel Processing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Classic Model of Parallel Processing

Similar presentations

Presentation on theme: "Classic Model of Parallel Processing"— Presentation transcript:

Similar presentations

About project

Feedback