Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size.

Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size

One thing to remember Easy Hard

Seeking Concurrency Data dependence graphs Data parallelism Functional parallelism Pipelining

Data Dependence Graph Directed graph Vertices = tasks Edges = dependences

Data Parallelism Independent tasks apply same operation to different elements of a data set Okay to perform operations concurrently for i  0 to 99 do a[i]  b[i] + c[i] endfor

Functional Parallelism Independent tasks apply different operations to different data elements First and second statements Third and fourth statements a  2 b  3 m  (a + b) / 2 s  (a 2 + b 2 ) / 2 v  s - m 2

Pipelining Divide a process into stages Produce several items simultaneously

Data Clustering Data mining = looking for meaningful patterns in large data sets Data clustering = organizing a data set into clusters of “similar” items Data clustering can speed retrieval of related items

Document Vectors Moon Rocket Alice in Wonderland A Biography of Jules Verne The Geology of Moon Rocks The Story of Apollo 11

Document Clustering

Clustering Algorithm Compute document vectors Choose initial cluster centers Repeat –Compute performance function –Adjust centers Until function value converges or max iterations have elapsed Output cluster centers

Data Parallelism Opportunities Operation being applied to a data set Examples –Generating document vectors –Finding closest center to each vector –Picking initial values of cluster centers

Functional Parallelism Opportunities Draw data dependence diagram Look for sets of nodes such that there are no paths from one node to another

Data Dependence Diagram Build document vectors Compute function value Choose cluster centers Adjust cluster centersOutput cluster centers

Programming Parallel Computers Extend compilers: translate sequential programs into parallel programs Extend languages: add parallel operations Add parallel language layer on top of sequential language Define totally new parallel language and compiler system

Strategy 1: Extend Compilers Parallelizing compiler –Detect parallelism in sequential program –Produce parallel executable program Focus on making Fortran programs parallel

Extend Compilers (cont.) Advantages –Can leverage millions of lines of existing serial programs –Saves time and labor –Requires no retraining of programmers –Sequential programming easier than parallel programming

Extend Compilers (cont.) Disadvantages –Parallelism may be irretrievably lost when programs written in sequential languages –Performance of parallelizing compilers on broad range of applications still up in air

Extend Language Add functions to a sequential language –Create and terminate processes –Synchronize processes –Allow processes to communicate

Extend Language (cont.) Advantages –Easiest, quickest, and least expensive –Allows existing compiler technology to be leveraged –New libraries can be ready soon after new parallel computers are available

Extend Language (cont.) Disadvantages –Lack of compiler support to catch errors –Easy to write programs that are difficult to debug

Add a Parallel Programming Layer Lower layer –Core of computation –Process manipulates its portion of data to produce its portion of result Upper layer –Creation and synchronization of processes –Partitioning of data among processes A few research prototypes have been built based on these principles

Create a Parallel Language Develop a parallel language “from scratch” –occam is an example Add parallel constructs to an existing language –Fortran 90 –High Performance Fortran –C*

New Parallel Languages (cont.) Advantages –Allows programmer to communicate parallelism to compiler –Improves probability that executable will achieve high performance Disadvantages –Requires development of new compilers –New languages may not become standards –Programmer resistance

Current Status Low-level approach is most popular –Augment existing language with low-level parallel constructs –MPI and OpenMP are examples Advantages of low-level approach –Efficiency –Portability Disadvantage: More difficult to program and debug

Architectures Interconnection networks Processor arrays (SIMD/data parallel) Multiprocessors (shared memory) Multicomputers (distributed memory) Flynn’s taxonomy

Interconnection Networks Uses of interconnection networks –Connect processors to shared memory –Connect processors to each other Interconnection media types –Shared medium –Switched medium

Shared versus Switched Media

Shared Medium Allows only message at a time Messages are broadcast Each processor “listens” to every message Arbitration is decentralized Collisions require resending of messages Ethernet is an example

Switched Medium Supports point-to-point messages between pairs of processors Each processor has its own path to switch Advantages over shared media –Allows multiple messages to be sent simultaneously –Allows scaling of network to accommodate increase in processors

Switch Network Topologies View switched network as a graph –Vertices = processors or switches –Edges = communication paths Two kinds of topologies –Direct –Indirect

Direct Topology Ratio of switch nodes to processor nodes is 1:1 Every switch node is connected to –1 processor node –At least 1 other switch node

Indirect Topology Ratio of switch nodes to processor nodes is greater than 1:1 Some switches simply connect other switches

Evaluating Switch Topologies Diameter Bisection width Number of edges / node Constant edge length? (yes/no)

2-D Mesh Network Direct topology Switches arranged into a 2-D lattice Communication allowed only between neighboring switches Variants allow wraparound connections between switches on edge of mesh

2-D Meshes

Vector Computers Vector computer: instruction set includes operations on vectors as well as scalars Two ways to implement vector computers –Pipelined vector processor: streams data through pipelined arithmetic units –Processor array: many identical, synchronized arithmetic processing elements

Why Processor Arrays? Historically, high cost of a control unit Scientific applications have data parallelism

Processor Array

Data/instruction Storage Front end computer –Program –Data manipulated sequentially Processor array –Data manipulated in parallel

Processor Array Performance Performance: work done per time unit Performance of processor array –Speed of processing elements –Utilization of processing elements

Performance Example 1 1024 processors Each adds a pair of integers in 1  sec What is performance when adding two 1024-element vectors (one per processor)?

Performance Example 2 512 processors Each adds two integers in 1  sec Performance adding two vectors of length 600?

2-D Processor Interconnection Network Each VLSI chip has 16 processing elements

if (COND) then A else B

Processor Array Shortcomings Not all problems are data-parallel Speed drops for conditionally executed code Don’t adapt to multiple users well Do not scale down well to “starter” systems Rely on custom VLSI for processors Expense of control units has dropped

Multicomputer, aka Distributed Memory Machines Distributed memory multiple-CPU computer Same address on different processors refers to different physical memory locations Processors interact through message passing Commercial multicomputers Commodity clusters

Asymmetrical Multicomputer

Asymmetrical MC Advantages Back-end processors dedicated to parallel computations  Easier to understand, model, tune performance Only a simple back-end operating system needed  Easy for a vendor to create

Asymmetrical MC Disadvantages Front-end computer is a single point of failure Single front-end computer limits scalability of system Primitive operating system in back-end processors makes debugging difficult Every application requires development of both front-end and back-end program

Symmetrical Multicomputer

Symmetrical MC Advantages Alleviate performance bottleneck caused by single front-end computer Better support for debugging Every processor executes same program

Symmetrical MC Disadvantages More difficult to maintain illusion of single “parallel computer” No simple way to balance program development workload among processors More difficult to achieve high performance when multiple processes on each processor

Commodity Cluster Co-located computers Dedicated to running parallel jobs No keyboards or displays Identical operating system Identical local disk images Administered as an entity

Network of Workstations Dispersed computers First priority: person at keyboard Parallel jobs run in background Different operating systems Different local images Checkpointing and restarting important

DM programming model Communicating sequential programs Disjoint address spaces Communicate sending “messages” A message is an array of bytes –Send(dest, char *buf, in len); –receive(&dest, char *buf, int &len);

Multiprocessors Multiprocessor: multiple-CPU computer with a shared memory Same address on two different CPUs refers to the same memory location Avoid three problems of processor arrays –Can be built from commodity CPUs –Naturally support multiple users –Maintain efficiency in conditional code

Centralized Multiprocessor Straightforward extension of uniprocessor Add CPUs to bus All processors share same primary memory Memory access time same for all CPUs –Uniform memory access (UMA) multiprocessor –Symmetrical multiprocessor (SMP)

Centralized Multiprocessor

Private and Shared Data Private data: items used only by a single processor Shared data: values used by multiple processors In a multiprocessor, processors communicate via shared data values

Problems Associated with Shared Data Cache coherence –Replicating data across multiple caches reduces contention –How to ensure different processors have same value for same address? Synchronization –Mutual exclusion –Barrier

Cache-coherence Problem Cache CPU A Cache CPU B Memory 7 X

Cache-coherence Problem CPU ACPU B Memory 7 X 7

Cache-coherence Problem CPU ACPU B Memory 7 X 7 7

Cache-coherence Problem CPU ACPU B Memory 2 X 7 2

Write Invalidate Protocol CPU ACPU B 7 X 7 7 Cache control monitor

Write Invalidate Protocol CPU ACPU B 7 X 7 7 Intent to write X

Write Invalidate Protocol CPU ACPU B 7 X 7 Intent to write X

Write Invalidate Protocol CPU ACPU B X 2 2

Distributed Multiprocessor Distribute primary memory among processors Increase aggregate memory bandwidth and lower average memory access time Allow greater number of processors Also called non-uniform memory access (NUMA) multiprocessor

Distributed Multiprocessor

Cache Coherence Some NUMA multiprocessors do not support it in hardware –Only instructions, private data in cache –Large memory access time variance Implementation more difficult –No shared memory bus to “snoop” –Directory-based protocol needed

Flynn’s Taxonomy Instruction stream Data stream Single vs. multiple Four combinations –SISD –SIMD –MISD –MIMD

SISD Single Instruction, Single Data Single-CPU systems Note: co-processors don’t count –Functional –I/O Example: PCs

SIMD Single Instruction, Multiple Data Two architectures fit this category –Pipelined vector processor (e.g., Cray-1) –Processor array (e.g., Connection Machine)

MISD Multiple Instruction, Single Data Example: systolic array

MIMD Multiple Instruction, Multiple Data Multiple-CPU computers –Multiprocessors –Multicomputers

Summary Commercial parallel computers appeared in 1980s Multiple-CPU computers now dominate Small-scale: Centralized multiprocessors Large-scale: Distributed memory architectures (multiprocessors or multicomputers)

Programming the Beast Task/channel model Algorithm design methodology Case studies

Task/Channel Model Parallel computation = set of tasks Task –Program –Local memory –Collection of I/O ports Tasks interact by sending messages through channels

Task/Channel Model Task Channel

Foster’s Design Methodology Partitioning Communication Agglomeration Mapping

Foster’s Methodology

Partitioning Dividing computation and data into pieces Domain decomposition –Divide data into pieces –Determine how to associate computations with the data Functional decomposition –Divide computation into pieces –Determine how to associate data with the computations

Example Domain Decompositions

Example Functional Decomposition

Partitioning Checklist At least 10x more primitive tasks than processors in target computer Minimize redundant computations and redundant data storage Primitive tasks roughly the same size Number of tasks an increasing function of problem size

Communication Determine values passed among tasks Local communication –Task needs values from a small number of other tasks –Create channels illustrating data flow Global communication –Significant number of tasks contribute data to perform a computation –Don’t create channels for them early in design

Communication Checklist Communication operations balanced among tasks Each task communicates with only small group of neighbors Tasks can perform communications concurrently Task can perform computations concurrently

Agglomeration Grouping tasks into larger tasks Goals –Improve performance –Maintain scalability of program –Simplify programming In MPI programming, goal often to create one agglomerated task per processor

Agglomeration Can Improve Performance Eliminate communication between primitive tasks agglomerated into consolidated task Combine groups of sending and receiving tasks

Agglomeration Checklist Locality of parallel algorithm has increased Replicated computations take less time than communications they replace Data replication doesn’t affect scalability Agglomerated tasks have similar computational and communications costs Number of tasks increases with problem size Number of tasks suitable for likely target systems Tradeoff between agglomeration and code modifications costs is reasonable

Mapping Process of assigning tasks to processors Centralized multiprocessor: mapping done by operating system Distributed memory system: mapping done by user Conflicting goals of mapping –Maximize processor utilization –Minimize interprocessor communication

Mapping Example

Optimal Mapping Finding optimal mapping is NP-hard Must rely on heuristics

Mapping Decision Tree Static number of tasks –Structured communication Constant computation time per task –Agglomerate tasks to minimize comm –Create one task per processor Variable computation time per task –Cyclically map tasks to processors –Unstructured communication –Use a static load balancing algorithm Dynamic number of tasks

Mapping Strategy Static number of tasks Dynamic number of tasks –Frequent communications between tasks Use a dynamic load balancing algorithm –Many short-lived tasks Use a run-time task-scheduling algorithm

Mapping Checklist Considered designs based on one task per processor and multiple tasks per processor Evaluated static and dynamic task allocation If dynamic task allocation chosen, task allocator is not a bottleneck to performance If static task allocation chosen, ratio of tasks to processors is at least 10:1

Case Studies Boundary value problem Finding the maximum The n-body problem Adding data input

Boundary Value Problem Ice waterRodInsulation

Rod Cools as Time Progresses

Finite Difference Approximation

Partitioning One data item per grid point Associate one primitive task with each grid point Two-dimensional domain decomposition

Communication Identify communication pattern between primitive tasks Each interior primitive task has three incoming and three outgoing channels

Agglomeration and Mapping Agglomeration

Sequential execution time  – time to update element n – number of elements m – number of iterations Sequential execution time: m (n-1) 

Parallel Execution Time p – number of processors – message latency Parallel execution time m(  (n-1)/p  +2 )

Reduction Given associative operator  a 0  a 1  a 2  …  a n-1 Examples –Add –Multiply –And, Or –Maximum, Minimum

Parallel Reduction Evolution

Binomial Trees Subgraph of hypercube

Finding Global Sum 4207 -35-6-3 8123 -446

Finding Global Sum 17-64 4582

Finding Global Sum 8-2 910

Finding Global Sum 178

Finding Global Sum 25 Binomial Tree

Agglomeration

The n-body Problem

Partitioning Domain partitioning Assume one task per particle Task has particle’s position, velocity vector Iteration –Get positions of all other particles –Compute new position, velocity

Gather

All-gather

Complete Graph for All-gather

Hypercube for All-gather

Communication Time Hypercube Complete graph

Adding Data Input

Scatter

Scatter in log p Steps 12345678 567812345612 7834

Summary: Task/channel Model Parallel computation –Set of tasks –Interactions through channels Good designs –Maximize local computations –Minimize communications –Scale up

Summary: Design Steps Partition computation Agglomerate tasks Map tasks to processors Goals –Maximize processor utilization –Minimize inter-processor communication

Summary: Fundamental Algorithms Reduction Gather and scatter All-gather

High Throughput Computing Easy problems – formerly known as “embarrassingly parallel” – now known as “pleasingly parallel” Basic idea – “Gee – I have a whole bunch of jobs (single run of a program) that I need to do, why not run them concurrently rather than sequentially” Sometimes called “bag of tasks” or parameter sweep problems

Bag-of-tasks

Examples A large number of proteins – each represented by a different file – to “dock” with a target protein –For all files x, execute f(x,y) Exploring a parameter space in n- dimensions –Uniform –Non-uniform Monte carlo’s

Tools Most common tool is a queuing system – sometimes called a load management system, or a local resource manager PBS, LSF, and SGE are the three most common. Condor is also often used. They all have the same basic functions, we’ll use PBS as an exemplar. Script languages (bash, Perl, etc.)

PBS qsub options script-file –Submit the script to run –Options can specify number of processors, other required resources (memory, etc.) –Returns the job ID (a string)

Other PBS qstat – give the status of jobs submitted to the queue qdel – delete a job from the queue

Blasting a set of jobs

Issues Overhead per job is substantial –Don’t want to run millisecond jobs –May need to “bundle them up” May not be enough jobs to saturate resources –May need to break up jobs IO System may become saturated –Copy large files to /tmp, check for existence in your shell script, copy if not there May be more jobs than the queuing system can handle (many start to break down at several thousand jobs) Jobs may fail for no good reason –Develop scripts to check for output and re-submit upto k jobs

Homework 1.Submit a simple job to the queue that echo’s the host name, redirect output to a file of your choice. 2.Via a script submit 100 “hostname” jobs to a script. Output should be “output.X” where X is the output number 3.For each file in a rooted directory tree run a “wc” to count the words. Maintain the results in a “shadow” directory tree. Your script should be able to detect results that have already been computed.

Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size.

Similar presentations

Presentation on theme: "Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size.

Similar presentations

Presentation on theme: "Day 2. Agenda Parallelism basics Parallel machines Parallelism again High Throughput Computing –Finding the right grain size."— Presentation transcript:

Similar presentations

About project

Feedback