PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

Multiple Processor Systems
RAID Redundant Array of Independent Disks
Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Parallel System Performance CS 524 – High-Performance Computing.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel System Performance CS 524 – High-Performance Computing.
1 Chapter 4 Threads Threads: Resource ownership and execution.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
1 I/O Management in Representative Operating Systems.
PRASHANTHI NARAYAN NETTEM.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Introduction to Parallel Processing 3.1 Basic concepts 3.2 Types and levels of parallelism 3.3 Classification of parallel architecture 3.4 Basic parallel.
Computer Architecture Parallel Processing
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Computer System Architectures Computer System Software
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Outcome 2 – Computer Software The Range of Software Available The Different Categories of Software System Software Programming Languages Applications Software.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Operating System 4 THREADS, SMP AND MICROKERNELS
Threads, Thread management & Resource Management.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Multiprocessor and Real-Time Scheduling Chapter 10.
1 Multiprocessor and Real-Time Scheduling Chapter 10 Real-Time scheduling will be covered in SYSC3303.
Main Memory CS448.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
ECE200 – Computer Organization Chapter 9 – Multiprocessors.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
1: Operating Systems Overview 1 Jerry Breecher Fall, 2004 CLARK UNIVERSITY CS215 OPERATING SYSTEMS OVERVIEW.
Pipelining and Parallelism Mark Staveley
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
Outline Why this subject? What is High Performance Computing?
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Concurrency and Performance Based on slides by Henri Casanova.
Background Computer System Architectures Computer System Software.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Distributed and Parallel Processing George Wells.
These slides are based on the book:
Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming
4- Performance Analysis of Parallel Programs
PARALLEL COMPUTING Submitted By : P. Nagalakshmi
File System Implementation
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 9 – Real Memory Organization and Management
Parallel Programming in C with MPI and OpenMP
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Threads Chapter 4.
Distributed Systems CS
Multiprocessor and Real-Time Scheduling
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000

Parallelism First Step towards parallelism - Finding suitable parallel algorithms. Ability of many independent threads of control to make progress simultaneously. Available Parallelism - Maximum number of independent threads that can work simultaneously.

Amdahl’s law If only one percent of problems fails to parallelize, then no matter how much parallelism is available for the rest, the problem can never be solved more than hundred times faster than in the sequential case.

Reasons for Parallelizing code Problem Size –Large problems might require significant memory, which could be expensive and may not be sufficient Real time requirements –Real time requirements of order of minutes, hours or days. –e.g. Weather forecasting, financial modeling etc If parallelizing is being done for performance, effect of other users to be noted

Categories of Parallel Algorithm Regular and Irregular Synchronous and Asynchronous Coarse and Fine Grained Bandwidth Greedy and Frugal Latency Tolerant and Intolerant Distributed and Shared Address Spaces Beowulf Systems & Choices of Parallelism

Regular and Irregular Regular algorithms –Use data structures that fit naturally into rectangular arrays –Easier to partition into separate parallel processes Irregular algorithms –Use complicated data structures, e.g., trees, graphs, lists, hash-tables, indirect addressing –Require careful consideration for parallelization –Exploit features like sparseness, hierarchies, unpredictable updates

Synchronous and Asynchronous Parallel parts in Synchronous algorithms must be in “lockstep”. In asynchronous algorithms, different parts of algorithms can “get ahead”. Asynchronous algorithms often require less bookkeeping and hence much easier to parallelize.

Coarse and Fine Grained Grain size refers to the amount of work performed by each process in a parallel computation. Large grain size calls for less interprocessor communication The relative amounts of communication and grain size computation is proportional to the ratio of surface area to volume - this ratio ought to be small for best results.

Bandwidth Greedy and Frugal t comm = t latency + message length/bandwidth CPU speed >> memory speed >> Network speed (2400MBps >> MBps >> 15MBps) Memory bandwidth and network speeds are the limiting factors in overall system performance. Algorithms may or may not require bandwidth Increasing grain size leads to bandwidth frugal and better performing algorithms.

Latency Tolerant and Intolerant Latency - The time taken for the beginning of the delivery of message. number of bytes n 1/2 = latency x Bandwidth Longer messages are bandwidth dominated and shorter messages are latency dominated. High latencies are the most conspicuous shortcoming of Beowulf systems, successful algorithms are usually latency tolerant.

Distributed and Shared Address Spaces Shared address space –All processors are accessing a common shared address space. –Simplifies design of parallel algorithms –Race-condition and non-determinacy Distributed address spaces –Message passing procedures for communication –Designing languages and compilers has been difficult.

Choice of Parallelism Requires large grain size algorithms Latency tolerant and bandwidth frugal algorithms Regular and asynchronous algorithms Tuning of Beowulf system

Process Level Parallelism Beowulf systems are well suited for process level parallelism, i.e. running multiple independent processes Requirement for process level parallelism is the existence of sequential code that must be run many times If process pool is large, processes can self schedule and load balance across the system

Utilities for Parallel Computing Dispatcher program, which can take arbitrary list of commands and dispatch commands to a set of processors No more than one command is run on any host at any one time The order in which commands are run and dispatched to hosts is arbitrary

Overheads - rsh and File I/O Startup costs of establishing connections, user authentication, executing remote commands Additional overhead of I/O Transfer of executable, if it stored on NFS mounted file system, for every invocation NFS is limited by performance of server and its network connections

Overheads - rsh and File I/O Start up problems can reduced by pre- staging data and exes to disks and storing the results of computation Discipline in placement of files in directories to avoid network connections NFS is sequential system without broadcast. This problem can be overcome by caching data is unchanging, e.g., input files and executables (Swap space and /scratch)

Summary Process level parallelism is easiest and cost effective for Beowulf systems Optimization for reducing network traffic and improve I/O performance Auto load balancing of process level parallelism Multiple users could be accommodated with variable priorities and scheduling policies GUI for displaying status of system