Presentation on theme: "1. Introduction to Parallel Computing"— Presentation transcript:
11. Introduction to Parallel Computing Parallel computing is a subject of interest in the computing community.Ever-growing size of databases and increasing complexity are putting great stress on the single processor computers.Now the entire computer science community is looking for some computing environment where current computational capacity can be enhanced.By improving the performance of a single computer( Uniprocessor system)By parallel processing
2The most obvious solution is the introduction of multiple processors working in tandem to solve a given problem.The Architecture used to solve this problem is Advanced/ Parallel Computer Architecture and the algorithms are known as Parallel Algorithms and programming of these computers is known as Parallel Programming.Parallel computing is the simultaneous execution of the same task, split into subtasks, on multiple processors in order to obtain results faster.
3why we require parallel computing? what are the levels of parallel processing ?how flow of data occurs in parallel processing?What is the Role the of parallel processing in some fields like science and engineering, database queries and artificial intelligence?
4Objectives Historical facts of parallel computing. • Explain the basic concepts of program, process, thread, concurrent execution, parallel execution and granularity• Explain the need of parallel computing.• Describe the levels of parallel processing .Describe Parallel computer classification Schemes.• Describe various applications of parallel computing.
5Why Parallel Processing? Computation requirements are ever increasing:simulations, scientific prediction (earthquake), distributed databases, weather forecasting (will it rain tomorrow?), search engines, e-commerce, Internet service applications, Data Center applications, Finance (investment risk analysis), Oil Exploration, Mining, etc.
6Why Parallel Processing? Hardware improvements like pipelining, superscalar are not scaling well and require sophisticated compiler technology to exploit performance out of them.Techniques such as vector processing works well for certain kind of problems.
7Why Parallel Processing? Significant development in networking technology is paving a way for network-based cost-effective parallel computing.The parallel processing technology is mature and is being exploited commercially.
8Constraints of conventional architecture : von Neumann machine( sequential computers)
9Parallelism in uniprocesor System Parallel processing mechanisms to achieve parallelism in uniprocessor system are :Multiple function unitsParallelism and pipelining within CPUOverlapped CPU and i/o operationsUse of hierarchical memory systemMultiprogramming and time sharing
13Comparison between Sequential and Parallel Computer Sequential ComputersAre uniprocessor systems ( 1 CPU)Can Execute 1 Instruction at a timeSpeed is limitedIt is quite expensive to make single cpu fasterArea where it can be used : colleges, labs,Ex: Pentium PCParallel ComputersAre Multiprocessor Systems ( many CPU’s )Can Execute several Instructions at a time.No limitation on speedLess expensive if we use larger number of fast processors to achieve better performance.Ex : CRAY 1, CRAY-XMP(USA) and PARAM ( India )
14History of Parallel Computing The experiments and implementations of the use of parallelism started in the 1950s by the IBM.A serious approach towards designing parallel computers was started with the development of ILLIAC IV inThe concept of pipelining was introduced in computer CDC 7600 in 1969.
15History of Parallel Computing In 1976, the CRAY1 was developed by Seymour Cray. Cray1 was a pioneering effort in the development of vector registers.The next generation of Cray called Cray XMP was developed in the yearsIt was coupled with supercomputers and used a shared memory.In the 1980s Japan also started manufacturing high performance supercomputers. Companies like NEC, Fujitsu and Hitachi were the main manufacturers.
16Parallel computers * Parallel computers are those systems which emphasize on parallel processing * Parallel processing is an efficient form of information processing which emphasis the exploitation of concurrent events in computing process
18pipelined computers performs overlapped computations to exploit temporal parallelism . here successive instructions are executed in overlapped fashion as shown in figure(next…).In Nonpipelined computers the execution of first instruction must be completed before the next instruction can be issued
19Pipeline Computers These computers performs overlapped computations. Instruction cycle of digital computer involves 4 major steps :IF (Instruction Fetch)ID (Instruction Decode)OF (Operand Fetch)EX (Execute)
22Array ProcessorArray processor is synchronous parallel computer with multiple ALUs ,called as processing elements (PE), these PE’s can operate in parallel mode.An appropriate data routing algorithm must be established among PE’s.
23Multiprocessor system This system achieves asynchronous parallelism through a set of interactive processors with shared resources (memories ,databases etc.) .
24PROBLEM SOLVING IN PARALLEL: Temporal Parallelism : Ex: submission of Electricity Bills :Suppose there are residents in a locality and they are supposed to submit their electricity bills in one office.
25steps to submit the bill are as follows: Go to the appropriate counter to take the form to submit the bill.Submit the filled form along with cash.Get the receipt of submitted bill.
26Serial Vs. ParallelCOUNTER 2COUNTERCOUNTER 1QPlease
27sequential execution Giving application form = 5 seconds Accepting filled application form and counting the cash and returning, if required = 5mnts, i.e., 5 ×60= 300 sec.Giving receipts = 5 seconds.Total time taken in processing one bill = = 310 seconds
28if we have 3 persons sitting at three different counters with : One person giving the bill submission formii) One person accepting the cash andReturning ,if necessary andiii) One person giving the receipt.
29As three persons work in the same time, it is called temporal parallelism. Here, a task is broken into many subtasks, and those subtasks are executed simultaneously in the time domain .
30Data ParallelismIn data parallelism, the complete set of data is divided into multiple blocks and operations on the blocks are applied parallel.data parallelism is faster as compared to Temporal parallelism.Here, no synchronization is required between counters (or processors ).It is more tolerant of faults.The working of one person does not effect the other.Inter-processor communication is less.
31Disadvantages : Data parallelism The task to be performed by each processor is pre-decided i.e., assignment of load is static.It should be possible to break the input task into mutually exclusive tasks.space would be required for counters. This requires multiple hardware which may be costly.
32PERFORMANCE EVALUATION The performance attributes are:Cycle time (T): It is the unit of time for all the operations of a computer system. It is the inverse of clock rate (l/f). The cycle time is represented in n sec.Cycles Per Instruction (CPI): Different instructions takes different number of cycles for exection. CPI is measurement of number of cycles per instructionInstruction count (LC): Number of instruction in a program is called instruction count. If we assume that all instructions have same number of cycles, then the total execution time of a program
33the total execution time of a program= number of instruction in the program * number of cycle required by one instruction * time of one cycle.execution time T=Ic*CPI*Tsec.Practically the clock frequency of the system is specified in MHz.the processor speed is measured in terms of million instructions per sec (MIPS).
34SOME ELEMENTARY CONCEPTS ProgramProcessThreadConcurrent and Parallel ExecutionGranularityPotential of Parallelism
35processEach process has a life cycle, which consists of creation, execution and termination phases. A process may create several new processes, which in turn may also create a new processes.
36Process creation requires four actions Setting up the process descriptionAllocating an address spaceLoading the program into the allocated address spacePassing the process description to the process schedulerThe process scheduling involves three concepts: process state, state transition and scheduling policy.
38Thread Thread is a sequential flow of control within a process. A process can contain one or more threads.Threads have their own program counter and register values, but they share the memory space and other resources of the process.
39Thread is basically a lightweight process . Advantages: It takes less time to create and terminate a new thread than to create, and terminate a process.It takes less time to switch between two threads within the same process .Less communication overheads.
40Study of concurrent and parallel executions is important due to following reasons: i) Some problems are most naturally solved by using a set of co-operating processes.ii) To reduce the execution time.Concurrent execution is the temporal behavior of the N-client 1-server model .Parallel execution is associated with the N-client N-server model. It allows the servicing of more than one client at the same time as the number of servers is more than one.
41Granularity refers to the amount of computation done in parallel relative to the size of the whole program.In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.
42Potential of Parallelism Some problems may be easily parallelized.On the other hand, there are some inherent sequential problems (computation of Fibonacci sequence) whose parallelization is nearly impossible .If processes don’t share address space and we could eliminate data dependency among instructions, we can achieve higher level of parallelism.
43Speed-upThe concept of speed up is used as a measure of the speed up that indicates up to what extent to which a sequential program can be parallelized.
47Computational Power Improvement MultiprocessorUniprocessorC.P.INo. of Processors
48Characteristics of Parallel computer Parallel computers can be characterized based onthe data and instruction streams forming various types of computer organizations.the computer structure, e.g. multiple processors having separate memory or one shared global memory.size of instructions in a program called grain size.
49TYPES OF CLASSIFICATION 1) Classification based on the instruction and data streams2) Classification based on the structure of computers3) Classification based on how the memory is accessed4) Classification based on grain size
50classification of parallel computers Flynn’s classification based on instruction and data streamsThe Structural classification based on different computer organizations;The Handler's classification based on three distinct levels of computer:Processor control unit (PCU), Arithmetic logic unit(ALU), Bit-level circuit (BLC)describe the sub-tasks or instructions of a program that can be executed in parallel based on the grain size.
51FLYNN’S CLASSIFICATION Proposed by Michael Flynn in 1972.Introduced the concept of instruction and data streams for categorizing of computers.This classification is based on instruction and data streamsWorking of the instruction cycle.
52Instruction CycleThe instruction cycle consists of a sequence of steps needed for the execution of an instruction in a program
53The control unit fetches instructions one at a time. The fetched Instruction is then decoded by the decoderthe processor executes the decoded instructions.The result of execution is temporarily stored in Memory Buffer Register (MBR).
54Instruction Stream and Data Stream flow of instructions is called instruction stream.flow of operands between processor and memory is bi-directional. This flow of operands is called data stream.
55Flynn’s Classification Based on multiplicity of instruction streams and data streams observed by the CPU during program execution.Single Instruction and Single Data stream (SISD)Single Instruction and multiple Data stream (SIMD)Multiple Instruction and Single Data stream (MISD)Multiple Instruction and Multiple Data stream (MIMD)
56SISD : A Conventional Computer ProcessorData InputData OutputInstructionsSpeed is limited by the rate at which computer can transfer information internally.Ex: PCs, Workstations
57Single Instruction and Single Data stream (SISD) sequential execution of instructions is performed by one CPU containing a single processing element (PE)Therefore, SISD machines are conventional serial computers that process only one stream of instructions and one stream of data. Ex: Cray-1, CDC 6600, CDC 7600
58The MISD ArchitectureDataInputStreamOutputProcessorABCInstructionStream AStream BInstruction Stream CMore of an intellectual exercise than a practical configuration. Few built, but commercially not available
59Multiple Instruction and Single Data stream (MISD) multiple processing elements are organized under the control of multiple control units.Each control unit is handling one instruction stream and processed through its corresponding processing element.each processing element is processing only a single data stream at a time.Ex:C.mmp built by Carnegie-Mellon University.
60All processing elements are interacting with the common shared memory for the organization of single data stream
61Advantages of MISD for the specialized applications like Real time computers need to be fault tolerant where several processors execute the same data for producing the redundant data.All these redundant data are compared as results which should be same otherwise faulty unit is replaced.Thus MISD machines can be applied to fault tolerant real time computers.
64Single Instruction and multiple Data stream (SIMD) multiple processing elements work under the control of a single control unit.one instruction and multiple data stream.All the processing elements of this organization receive the same instruction broadcast from the CU.Main memory can also be divided into modules for generating multiple data streams.Every processor must be allowed to complete its instruction before the next instruction is taken for execution.The execution of instructions is synchronous
65SIMD ProcessorsSome of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2 are belonged to this class of machines.Variants of this concept have found use in co-processing units such as the MMX units in Intel processors and IBM Cell processor.SIMD relies on the regular structure of computations (such as those in image processing).It is often necessary to selectively turn off operations on certain data items. For this reason, most SIMD programming architectures allow for an ``activity mask'', which determines if a processor should participate in a computation or not.
67Shared Memory MIMD machine ProcessorAProcessorBProcessorCMEMORYBUSMEMORYBUSMEMORYBUSGlobal Memory SystemCommunication : Source PE writes data to GM & destination retrieves itEasy to build, conventional OSes of SISD can be easily portedLimitation : reliability & expandability. A memory component or any processor failure affects the whole system.Increase of processors leads to memory contention.Ex. : Silicon graphics supercomputers....
68Distributed Memory MIMD IPCchannelIPCchannelProcessorAProcessorBProcessorCMEMORYBUSMEMORYBUSMEMORYBUSMemorySystem ASystem BSystem CCommunication : IPC (Inter-Process Communication) via High Speed Network.Network can be configured to ... Tree, Mesh, Cube, etc.Unlike Shared MIMDeasily/ readily expandableHighly reliable (any CPU failure does not affect the whole system)
69MIMD ProcessorsIn contrast to SIMD processors, MIMD processors can execute different programs on different processors.A variant of this, called single program multiple data streams (SPMD) executes the same program on different processors.It is easy to see that SPMD and MIMD are closely related in terms of programming flexibility and underlying architectural support.Examples of such platforms include current generation Sun Ultra Servers, SGI Origin Servers, multiprocessor PCs, workstation clusters.
70Multiple Instruction and Multiple Data stream (MIMD) multiple processing elements and multiple control units are organized as in MISD.for handling multiple instruction streams, multiple control units are there and For handling multiple data streams, multiple processing elements are organized.The processors work on their own data with their own instructions.Tasks executed by different processors can start or finish at different times.
71in the real sense MIMD organization is said to be a Parallel computer.
72All multiprocessor systems fall under this classification. Examples :C.mmp, Cray-2, Cray X-MP, IBM 370/168 MP, Univac 1100/80, IBM 3081/3084.MIMD organization is the most popular for a parallel computer.In the real sense, parallel computers execute the instructions in MIMD mode
73SIMD-MIMD ComparisonSIMD computers require less hardware than MIMD computers (single control unit).However, since SIMD processors are specially designed, they tend to be expensive and have long design cycles.Not all applications are naturally suited to SIMD processors.In contrast, platforms supporting the SPMD paradigm can be built from inexpensive off-the-shelf components with relatively little effort in a short amount of time.
74HANDLER’S CLASSIFICATION In 1977, Handler proposed an elaborate notation for expressing the pipelining and parallelism of computers.Handler's classification addresses the computer at three distinct levels:Processor control unit (PCU)---- CPUArithmetic logic unit (ALU)--- processing elementBit-level circuit (BLC)--- logic circuit .
75Way to describe a computer Computer = (p * p', a * a', b * b')Where p = number of PCUsp'= number of PCUs that can be pipelineda = number of ALUs controlled by each PCUa'= number of ALUs that can be pipelinedb = number of bits in ALU or processing element (PE) wordb'= number of pipeline segments on all ALUs or in a single PE
76Relationship between various elements of the computer • The '*' operator is used to indicate that the units are pipelined or macro-pipelined with a stream of data running through all the units.• The '+' operator is used to indicate that the units are not pipelined but work on independent streams of data.• The 'v' operator is used to indicate that the computerhardware can work in one of several modes.• The '~' symbol is used to indicate a range of values for any one of the parameters.
77Ex:The CDC 6600 has a single main processor supported by 10 I/O processors. One control unit coordinates one ALU with a 60-bit word length. The ALU has 10 functional units which can be formed into a pipeline. The 10 peripheral I/O processors may work in parallel with each other and with the CPU. Each I/O processor contains one 12-bit ALU.
78CDC 6600I/O = (10, 1, 12)The description for the main processor is:CDC 6600main = (1, 1 * 10, 60)The main processor and the I/O processors can be regarded as forming a macro-pipeline so the '*' operator is used to combine the two structures:CDC 6600 = (I/O processors) * (central processor) =(10, 1, 12) * (1, 1 * 10, 60)
80STRUCTURAL CLASSIFICATION a parallel computer (MIMD) can be characterized as a set of multiple processors and shared memory or memory modules communicating via an interconnection network.When multiprocessors communicate through the global shared memory modules then this organization is called Shared memory computer or Tightly coupled systems
81Shared memory multiprocessors have the following characteristics: Every processor communicates through a shared global memoryFor high speed real time processing, these systems are preferable as their throughput is high as compared to loosely coupled systems.
83In tightly coupled system organization, multiple processors share a global main memory, which may have many modules.The processors have also access to I/O devices. The inter- communication between processors, memory, and other devices are implemented through various interconnection networks,
84Types of Interconnection n/w Processor-Memory Interconnection Network (PMIN)This is a switch that connects various processors to different memory modules.Input-Output-Processor Interconnection Network (IOPIN)This interconnection network is used for communication between processors and I/O channelsInterrupt Signal Interconnection Network (ISIN)When a processor wants to send an interruption to another processor, then this interrupt first goes to ISIN, through which it is passed to the destination processor. In this way, synchronization between processor is implemented by ISIN.
89Uniform Memory Access Model (UMA) In this model, main memory is uniformly shared by all processors in multiprocessor systems and each processor has equal access time to shared memory.This model is used for time-sharing applications in a multi user environment
90Uniform Memory Access (UMA): Most commonly represented today by Symmetric Multiprocessor (SMP) machinesIdentical processorsEqual access and access times to memorySometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level.
92Non-Uniform Memory Access Model (NUMA) In shared memory multiprocessor systems, local memories can be connected with every processor. The collection of all local memories form the global memory being shared.global memory is distributed to all the processors .In this case, the access to a local memory is uniform for its corresponding processor ,but if one reference is to the local memory of some other remote processor, then the access is not uniform.It depends on the location of the memory. Thus, all memory words are not accessed uniformly.
93Non-Uniform Memory Access (NUMA): Often made by physically linking two or more SMPsOne SMP can directly access memory of another SMPNot all processors have equal access time to all memoriesMemory access across link is slowerIf cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMAAdvantages: Global address space provides a user-friendly programming perspective to memoryData sharing between tasks is both fast and uniform due to the proximity of memory to CPUsDisadvantages:Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memory-CPU path, and for cache coherent systems, geometrically increase traffic associated with cache/memory management.Programmer responsibility for synchronization constructs that ensure "correct" access of global memory.Expense: it becomes increasingly difficult and expensive to design and produce shared memory machines with ever increasing numbers of processors
95Cache-Only Memory Access Model (COMA) shared memory multiprocessor systems may use cache memories with every processor for reducing the execution time of an instruction .
96Loosely coupled system when every processor in a multiprocessor system, has its own local memory and the processors communicate via messages transmitted between their local memories, then this organization is called Distributed memory computer or Loosely coupled system
99each processor in loosely coupled systems is having a large local memory (LM), which is not shared by any other processor.such systems have multiple processors with their own local memory and a set of I/O devices.This set of processor, memory and I/O devices makes a computer system.these systems are also called multi-computer systems.
100These computer systems are connected together via message passing interconnection network through which processes communicate by passing messages to one another.Also called as distributed multi computer system .
101CLASSIFICATION BASED ON GRAIN SIZE This classification is based on recognizing the parallelism in a program to be executed on a multiprocessor system.The idea is to identify the sub-tasks or instructions in a program that can be executed in parallel .
105Data DependencyIt refers to the situation in which two or more instructions share same data.The instructions in a program can be arranged based on the relationship of data dependencyhow two instructions or segments are data dependent on each other
106Types of data dependencies i) Flow Dependence : If instruction I2 follows I1 and output of I1 becomes input of I2, then I2 is said to be flow dependent on I1.ii) Antidependence : When instruction I2 follows I1 such that output of I2 overlaps with the input of I1 on the same data.iii) Output dependence : When output of the two instructions I1 and I2 overlap on the same data, the instructions are said to be output dependent.iv) I/O dependence : When read and write operations by two instructions are invoked on the same file, it is a situation of I/O dependence.
107Control DependenceInstructions or segments in a program may contain control structures.dependency among the statements can be in control structures also. But the order of execution in control structures is not known before the run time.control structures dependency among the instructions must be analyzed carefully
108Resource DependenceThe parallelism between the instructions may also be affected due to the shared resources.If two instructions are using the same shared resource then it is a resource dependency condition
109Bernstein Conditions for Detection of Parallelism For execution of instructions or block of instructions in parallel, The instructions should be independent of each other.These instructions can be data dependent / control dependent / resource dependent on each other .
110Bernstein conditions are based on the following two sets of variables i) The Read set or input set RI that consists of memory locations read by the statement of instruction I1.ii) The Write set or output set WI that consists of memory locations written into by instruction I1.
111Parallelism based on Grain size Grain size: Grain size or Granularity is a measure which determines how much computation is involved in a process.Grain size is determined by counting the number of instructions in a program segment.
1131) Fine Grain: This type contains approximately less than 20 instructions. 2) Medium Grain: This type contains approximately less than 500 instructions.3) Coarse Grain: This type contains approximately greater than or equal to one thousand instructions.
114LEVELS OF PARALLEL PROCESSING Instruction LevelLoop levelProcedure levelProgram level
115Instruction levelThis is the lowest level and the degree of parallelism is highest at this level.The fine grain size is used at instruction levelThe fine grain size may vary according to the type of the program. For example, for scientific applications, the instruction level grain size may be higher.As the higher degree of parallelism can be achieved at this level, the overhead for a programmer will be more.
116Loop LevelThis is another level of parallelism where iterative loop instructions can be parallelized.Fine grain size is used at this level also.Simple loops in a program are easy to parallelize whereas the recursive loops are difficult.This type of parallelism can be achieved through the compilers .
117Procedure or Sub Program Level This level consists of procedures, subroutines or subprograms.Medium grain size is used at this level containing some thousands of instructions in a procedure.Multiprogramming is implemented at this level.
118Program LevelIt is the last level consisting of independent programs for parallelism.Coarse grain size is used at this level containing tens of thousands of instructions.Time sharing is achieved at this level of parallelism
121Operating Systems for High Performance Computing
122Operating Systems for PP MPP systems having thousands of processors requires OS radically different from current ones.Every CPU needs OS :to manage its resourcesto hide its detailsTraditional systems are heavy, complex and not suitable for MPP
123Operating System Models Frame work that unifies features, services and tasks performed.Three approaches to building OS....Monolithic OSLayered OSMicrokernel based OSClient server OSSuitable for MPP systemsSimplicity, flexibility and high performance are crucial for OS.
124Monolithic Operating System ApplicationProgramsApplicationProgramsUser ModeKernel ModeSystem ServicesHardwareBetter application PerformanceDifficult to extendEx: MS-DOS
126Traditional OS OS User Mode Kernel Mode OS Designer Application ProgramsUser ModeKernel ModeOSHardwareOS Designer
127New trend in OS design Servers Microkernel User Mode Kernel Mode ApplicationProgramsApplicationProgramsUser ModeKernel ModeMicrokernelHardware
128Microkernel/Client Server OS (for MPP Systems) ApplicationThreadlib.FileServerNetworkServerDisplayServerUserKernelMicrokernelSendReplyHardwareTiny OS kernel providing basic primitive (process, memory, IPC)Traditional services becomes subsystemsOS = Microkernel + User Subsystems
129Few Popular Microkernel Systems MACH, CMUPARAS, C-DACChorusQNX(Windows)
130ADVANTAGES OF PARALLEL COMPUTATION Reasons for using parallel computing:save time and solve larger problemswith the increase in number of processors working in parallel, computation time is bound to reduce .Cost savingsOvercoming memory constraintsLimits to serial computing
131APPLICATIONS OF PARALLEL PROCESSING Weather forecastingPredicting results of chemical and nuclear reactionsDNA structures of various species• Design of mechanical devices• Design of electronic circuits• Design of complex manufacturing processes• Accessing of large databases• Design of oil exploration systems
132Design of web search engines, web based business services • Design of computer-aided diagnosis in medicine• Development of MIS for national and multi-national corporations• Development of advanced graphics and virtual reality software, particularly for the entertainment industry, including networked video and multi-media technologies• Collaborative work (virtual) environments .
133Scientific Applications/Image processing Global atmospheric circulation,• Blood flow circulation in the heart,• The evolution of galaxies,• Atomic particle movement,• Optimization of mechanical components
134Engineering Applications Simulations of artificial ecosystems,• Airflow circulation over aircraft components
135Database Query/Answering Systems To speed up database queries we can use Teradata computer, which employs parallelism in processing complex queries.
136AI Applications Search through the rules of a production system, • Using fine-grain parallelism to search the semantic networks• Implementation of Genetic Algorithms,• Neural Network processors,• Preprocessing inputs from complex environments, such as visual stimuli.
137Mathematical Simulation and Modeling Applications Parsec, a C-based simulation language for sequential and parallel execution of discrete-event simulation models.• Omnet++ a discrete-event simulation software development environment written in C++.• Desmo-J a Discrete event simulation framework in Java.
138INDIA’S PARALLEL COMPUTERS In India, the development and design of parallel computers started in the early 80’s.(CDAC) in 1988 was designed the high-speed parallel machines
139India’s Parallel Computer Sailent Features of PARAM series:PARAM 8000 CDAC 1991: 256 Processor parallel computer,PARAM 8600 CDAC 1994: PARAM 8000 enhanced with Intel i860 vector microprocessor.. Improved software for numerical applications.PARAM 9000/SS CDAC 1996 : Used Sunsparc II processors.
140MARK Series: Flosolver Mark I NAL 1986 Flosolver Mark II NAL 1988 Flosolver Mark III NAL 1991
141Summary/ConclusionsIn the simplest sense, parallel computing is the simultaneous use of multiple compute resources to solve a computational problem:To be run using multiple CPUsA problem is broken into discrete parts that can be solved concurrentlyEach part is further broken down to a series of instructionsInstructions from each part execute simultaneously
144Uses for Parallel Computing: Science and Engineering: Historically, parallel computing has been considered to be "the high end of computing", and has been used to model difficult problems in many areas of science and engineering: Atmosphere, Earth, EnvironmentPhysics - applied, nuclear, particle, condensed matter, high pressure, fusion, photonicsBioscience, Biotechnology, GeneticsChemistry, Molecular SciencesGeology, SeismologyMechanical Engineering - from prosthetics to spacecraftElectrical Engineering, Circuit Design, MicroelectronicsComputer Science, Mathematics
145Industrial and Commercial: Today, commercial applications provide an equal or greater driving force in the development of faster computers. These applications require the processing of large amounts of data in sophisticated ways. For example: Databases, data miningOil explorationWeb search engines, web based business servicesMedical imaging and diagnosisPharmaceutical designFinancial and economic modellingManagement of national and multi-national corporationsAdvanced graphics and virtual reality, particularly in the entertainment industryNetworked video and multi-media technologiesCollaborative work environments
146Why Use Parallel Computing? Save time and/or money: In theory, throwing more resources at a task will shorten its time to completion, with potential cost savings. Parallel computers can be built from cheap, commodity components.Solve larger problems: Many problems are so large and/or complex that it is impractical or impossible to solve them on a single computer, especially given limited computer memory. For example:Web search engines/databases processing millions of transactions per secondProvide concurrency: A single compute resource can only do one thing at a time. Multiple computing resources can be doing many things simultaneously. For example, the Access Grid provides a global collaboration network where people from around the world can meet and conduct work "virtually".
147Use of non-local resources: Using compute resources on a wide area network, or even the Internet when local compute resources are scarce. For example:over 1.3 million users, 3.2 million computers in nearly every country in the world.Limits to serial computing: Both physical and practical reasons pose significant constraints to simply building ever faster serial computers:Transmission speeds - the speed of a serial computer is directly dependent upon how fast data can move through hardware.Economic limitations - it is increasingly expensive to make a single processor faster. Using a larger number of moderately fast commodity processors to achieve the same (or better) performance is less expensive.Current computer architectures are increasingly relying upon hardware level parallelism to improve performance:Multiple execution unitsPipelined instructionsMulti-core
148Terminologies related to PC Supercomputing / High Performance Computing (HPC) Using the world's fastest and largest computers to solve large problems.Node A standalone "computer in a box". Usually comprised of multiple CPUs/processors/cores. Nodes are networked together to comprise a supercomputer.CPU / Socket / Processor / Core a CPU (Central Processing Unit) was a singular execution component for a computer. Then, multiple CPUs were incorporated into a node. Then, individual CPUs were subdivided into multiple "cores", each being a unique execution unit. CPUs with multiple cores are sometimes called "sockets" - vendor dependent. The result is a node with multiple CPUs, each containing multiple cores.
149Task A logically discrete section of computational work Task A logically discrete section of computational work. A task is typically a program or program-like set of instructions that is executed by a processor. A parallel program consists of multiple tasks running on multiple processors.Pipelining Breaking a task into steps performed by different processor units, with inputs streaming through, much like an assembly line; a type of parallel computing.Shared Memory From a strictly hardware point of view, describes a computer architecture where all processors have direct (usually bus based) access to common physical memory. In a programming sense, it describes a model where parallel tasks all have the same "picture" of memory and can directly address and access the same logical memory locations regardless of where the physical memory actually exists.
150Symmetric Multi-Processor (SMP) Hardware architecture where multiple processors share a single address space and access to all resources;Distributed Memory In hardware, refers to network based memory access for physical memory that is not common. As a programming model, tasks can only logically "see" local machine memory and must use communications to access memory on other machines where other tasks are executing.Communications Parallel tasks typically need to exchange data. several ways of communication: through a shared memory bus or over a network, however the actual event of data exchange is commonly referred to as communications regardless of the method employed.
151Synchronization The coordination of parallel tasks in real time, very often associated with communications.Granularity In parallel computing, granularity is a qualitative measure of the ratio of computation to communication.Coarse: relatively large amounts of computational work are done between communication eventsFine: relatively small amounts of computational work are done between communication events
152Parallel Overhead The amount of time required to coordinate parallel tasks, as opposed to doing useful work. Parallel overhead can include factors such as:Task start-up time -SynchronizationsData communicationsSoftware overhead imposed by parallel compilers, libraries, tools, operating system, etc.Task termination timeMassively Parallel Refers to the hardware that comprises a given parallel system - having many processors. The meaning of "many" keeps increasing, but currently, the largest parallel computers can be comprised of processors numbering in the hundreds of thousands.Embarrassingly Parallel Solving many similar, but independent tasks simultaneously; little to no need for coordination between the tasks.Scalability Refers to a parallel system's (hardware and/or software) ability to demonstrate a proportionate increase in parallel speedup with the addition of more processors. Factors that contribute to scalability include: Hardware - particularly memory-cpu bandwidths and network communicationsApplication algorithm
153The Future:During the past 20+ years, the trends indicated by ever faster networks, distributed systems, and multi-processor computer architectures (even at the desktop level) clearly show that parallelism is the future of computing.