Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations.

Similar presentations


Presentation on theme: "Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations."— Presentation transcript:

1 Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations of integrated circuits will continue to support increasing numbers of transistors. How to make efficient use of the additional transistors? Answer: Parallelism beyond multiple pipelines: adding multiple processors or processing components in a single chip or single package. Each level of parallelism performance suffers from the law of diminishing returns outlined by Amdahl. Incorporating multiple levels of parallelism results in higher overall performance and efficiency.

2 Presentation Content A discussion of practical and theoretical parallel speedup alternative methods and the efficient use of hardware/processing resources in capturing speedup. Parallel Speedup/Amdahl’s Law, Scaled Speedup Pipelined Processors Multiprocessors and Multicomputers Multiple concurrent threads Multiple concurrent processes Multiple levels of parallelism with integrated chips/packages that combine microcontrollers with Digital Signal Processing chips

3 Presentation Summary Architects/Chip-Manufacturers are integrating additional levels of parallelism. Multiple levels of speedup achieve higher speedups and greater efficiencies than increasing hardware at a single parallel level. A balanced approach would achieve about the same level of efficiency in cost of hardware resources allocated, in delivering parallel speedup at each level of parallelism. Numerous architectural approaches are possible, each with different trade-offs and performance returns. Current technology is integrating DSP processing with microcontroller functionality - achieving up to three levels of parallelism.

4 Classic Model of Parallel Processing Multiple Processors available (4) A Process can be divided into serial and parallel portions The parallel parts are executed concurrently Serial Time: 10 time units Parallel Time: 4 time units S - Serial or non-parallel portion A - All A parts can be executed concurrently B - All B parts can be executed concurrently All A parts must be completed prior to executing the B parts An example parallel process of time 10 : Executed on a single processor: Executed in parallel on 4 processors: SAAAABBBBS S A A A A B B B B S

5 Amdahl’s Law (Analytical Model) Analytical model of parallel speedup from 1960s Parallel fraction (  ) is run over n processors taking  /n time The part that must be executed in serial (1-  ) gets no speedup Overall performance is limited by the fraction of the work that cannot be done in parallel (1-  ) diminishing returns with increasing processors (n)

6 OF Pipelined Processing Single Processor enhanced with discrete stages Instructions “flow” through pipeline stages Parallel Speedup with multiple instructions being executed (by parts) simultaneously Realized speedup is partly determined by the number of stages: 5 stages=at most 5 times faster FD WBEX F - Instruction Fetch D - Instruction Decode OF - Operand Fetch EX - Execute WB - Write Back or Result Store Processor clock/cycle is divided into sub-cycles, each stage takes one sub-cycle Cycle: 1 2 3 4 5

7 Pipeline Performance Speedup is serial time (nS) over parallel time Performance is limited by the number of pipeline flushes (n) due to jumps speculative execution and branch prediction can minimize pipeline flushes Performance is also reduced by pipeline stalls (s), due to conflicts with bus access, data not ready delays, and other sources

8 Super-Scalar: Multiple Pipelines Concurrent Execution of Multiple sets of instructions Example: Simultaneous execution of instructions though an integer pipeline while processing instructions through a floating point pipeline Compiler: identifies and specifies separate instruction sets for concurrent execution through different pipes

9 Algorithm/Thread Level Parallelism Example: Algorithms to compute Fast Fourier Transform (FFT) used in Digital Signal Processing (DSP) – Many separate computations in parallel (High Degree Of Parallelism) – Large exchange of data - much communication between processors – Fine-Grained Parallelism – Communication time (latency) may be a consideration if multiple processors are combined on a board of motherboard – Large communication load (fine-grained parallelism) can force the algorithm to become bandwidth-bound rather than computation- bound.

10 Simple Algorithm/Thread Parallelism Model Parallel “threads of execution” –could be a separate process –could be a multi-thread process Each thread of execution obeys Amdahl’s parallel speedup model Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism S A AB B S S A AB B S P1 P2 Observe that the serial parts of Program 1 and Program 2 are now running in parallel with each other. Each program would take 6 time units on a uniprocessor, or a total workload serial time of 12. Each has a speedup of 1.5. The total speedup is 12/4 = 3, which is also the sum of the program speedups.

11 Multiprocess Speedup Concurrent Execution of Multiple Processes Each process is limited by Amdahl’s parallel speedup Multiple concurrently executing processes resulting in: Multiple serial components executing concurrently - another level of parallelism Avoid Degree of Parallelism (DOP) speedup limitations Linear scaling up to machine limits of processors and memory: n  single process speedup S A AB B S S A AB B S Two SAABBSSAABBS No speedup - uniprocessor 12 t Single Process 8 t, Speedup = 1.5 S A AB B S Multi-Process 4 t, Speedup = 3 S A AB B S

12 Algorithm/Thread Parallelism - Analytical Model Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed similar) processes or threads Multi-Process/Thread Speedup  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads

13 (Simple) Unified Model with Scaled Speedup Adds scaling factor on parallel work, while holding serial work constant k 1 = scaling factor on parallel portion  = fraction of work that can be done in parallel n=number of processors N = number concurrent (assumed dissimilar) processes or threads

14 Capturing Multiple Levels of Parallelism Most parallelism suffers from diminishing returns - resulting in limited scalability. Allocating hardware resources to capture multiple levels of parallelism - operate at efficient end of speedup curves. Manufacturers of microcontrollers are integrating multiple levels of parallelism on a single chip

15 Trend in Microprocessor Architectures Architectural Variations –DSP and microcontroller cores on same chip –DSP also does microprocessor – Microprocessor also does DSP – Multiprocessor Each variation captures some speedup from all three levels Varying amounts of speedup from each level Each parallel level operates at a more efficient level than if all hardware resources were allocated to a single parallel level 1. Intra-Instruction Parallelism: Pipelines 2. Instruction-Level Parallelism: Super-Scalar - Multiple Pipelines 3. Algorithm/Thread Parallelism: – Multiple processing elements – Integrated DSP with microcontroller – Enhanced microcontroller to do DSP – Enhanced DSP processor that also functions as a microcontroller

16 More Levels of Parallelism Outside the Chip Multiple Processors in a box: –on a motherboard – on back-plane with daughter-boards Shared-Memory Multiprocessors – communication is through shared memory Clustered Multiprocessors – another hierarchical level – processors are grouped into clusters – intra-cluster can be bus or network – inter-cluster can be bus or network Distributed Multicomputers – multiple computers loosely coupled through a network n-tiered Architectures – modern client/server architectures

17 Speedup of Client-Server, 2-Tier Systems  - workload balance,% of workload on client –  = 1 (100%), completely distributed –  = 0 (100%), completely centralized n clients, m servers LAN INTERNET n CLIENTSm SERVERS

18 Speedup of Client-Server, n-Tier Systems m 1 level 1 machines (clients) m 2 server 2, m 3 server 3, m 3 server 3, etc.  1 - workload balance,% of workload on client  2 - % of workload on server 2,  3 - % of workload on server 3, etc. LAN INTERNET m 1 CLIENTS SERVERS m 2 m 3 m 4 SAN

19 Hierarchy of Embedded Parallelism 1. N-tiered Client-Server Distributed Systems 2. Clustered Multi-computers 3. Clustered-Multiprocessor 4. Multiple Processors on a Chip 5. Multiple Processing Elements 6. Multiple Pipelines 7. Multiple Stages per Pipeline Goals: Single analytical model that captures parallelism from all levels Simulator that allows exploration

20 References K. Hoganson, "Alternative Mechanisms to Achieve Parallel Speedup", First IEEE Online Symposium for Electronics Engineers, IEEE Society, August 2000. K. Hoganson, “Mapping Parallel Application Communication Topology to Rhombic Overlapping-Cluster Multiprocessors”, accepted for publication, to appear in The Journal of Supercomputing, To appear 8/2000, Vol. 17, No. 1. K. Hoganson, “Workload Execution Strategies and Parallel Speedup on Clustered Computers”, accepted for publication, IEEE Transactions on Computers, Vol. 48, No. 11, November 1999. Undergraduate Research Project: Unified Parallel System Modeling project, Directed Study, Summer-Fall 2000


Download ppt "Extending the Unified Parallel Processing Speedup Model Computer architectures take advantage of low-level parallelism: multiple pipelines The next generations."

Similar presentations


Ads by Google