Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.

Slides:

Advertisements

Similar presentations

Scaling Up Graphical Model Inference

Advertisements

Parallel Algorithms.

Exact Inference. Inference Basic task for inference: – Compute a posterior distribution for some query variables given some observed evidence – Sum out.

PRAM Algorithms Sathish Vadhiyar. PRAM Model - Introduction Parallel Random Access Machine Allows parallel-algorithm designers to treat processing power.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.

Distributed Systems CS

Optimal PRAM algorithms: Efficiency of concurrent writing “Computer science is no more about computers than astronomy is about telescopes.” Edsger Dijkstra.

Super computers Parallel Processing By: Lecturer \ Aisha Dawood.

Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.

Efficient Parallel Algorithms COMP308

Limited Time and Experience: Parallelism in CS1 Fourth NSF/TCPP Workshop on Parallel and Distributed Computing Education (EduPar-14) Steven Bogaerts.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Advanced Topics in Algorithms and Data Structures Classification of the PRAM model In the PRAM model, processors communicate by reading from and writing.

Uzi Vishkin.  Introduction  Objective  Model of Parallel Computation ▪ Work Depth Model ( ~ PRAM) ▪ Informal Work Depth Model  PRAM Model  Technique:

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

Reference: Message Passing Fundamentals.

Slide 1 Parallel Computation Models Lecture 3 Lecture 4.

Parallel Programming on the SGI Origin2000 With thanks to Moshe Goldberg, TCC and Igor Zacharov SGI Taub Computer Center Technion Mar 2005 Anne Weill-Zrahia.

AstroBEAR Parallelization Options. Areas With Room For Improvement Ghost Zone Resolution MPI Load-Balancing Re-Gridding Algorithm Upgrading MPI Library.

Mapping Techniques for Load Balancing

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

Reference: / Parallel Programming Paradigm Yeni Herdiyeni Dept of Computer Science, IPB.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

OpenMP OpenMP A.Klypin Shared memory and OpenMP Simple Example Threads Dependencies Directives Handling Common blocks Synchronization Improving load balance.

Parallelization: Area Under a Curve. AUC: An important task in science Neuroscience – Endocrine levels in the body over time Economics – Discounting:

GraphLab: how I understood it with sample code Aapo Kyrola, Carnegie Mellon Univ. Oct 1, 2009.

Performance Measurement n Assignment? n Timing #include double When() { struct timeval tp; gettimeofday(&tp, NULL); return((double)tp.tv_sec + (double)tp.tv_usec.

An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,

Performance Measurement. A Quantitative Basis for Design n Parallel programming is an optimization problem. n Must take into account several factors:

Chapter 3 Parallel Programming Models. Abstraction Machine Level – Looks at hardware, OS, buffers Architectural models – Looks at interconnection network,

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

Hybrid MPI and OpenMP Parallel Programming

RAM, PRAM, and LogP models

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

1 CMPE 511 HIGH PERFORMANCE COMPUTING CLUSTERS Dilek Demirel İşçi.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.

Static Process Scheduling

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

A Pattern Language for Parallel Programming Beverly Sanders University of Florida.

09/02/2010CS4961 CS4961 Parallel Programming Lecture 4: CTA, cont. Data and Task Parallelism Mary Hall September 2,

Exploring Parallelism with Joseph Pantoga Jon Simington.

3/12/2013Computer Engg, IIT(BHU)1 PRAM ALGORITHMS-3.

Parallel Computing Presented by Justin Reschke

Background Computer System Architectures Computer System Software.

1 Hierarchical Parallelization of an H.264/AVC Video Encoder A. Rodriguez, A. Gonzalez, and M.P. Malumbres IEEE PARELEC 2006.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 April 28, 2005 Session 29.

1 A simple parallel algorithm Adding n numbers in parallel.

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Distributed Shared Memory

Parallel Programming By J. H. Wang May 2, 2017.

Multi-Processing in High Performance Computer Architecture:

XMT Another PRAM Architectures

A Distributed Bucket Elimination Algorithm

Background and Motivation

Hybrid Programming with OpenMP and MPI

Multithreading Why & How.

Chapter 4: Threads & Concurrency

Coevolutionary Automated Software Correction

Presentation transcript:

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008

Roadmap Brief Overview of Parallel Computing Brief Overview of Parallel Computing U. Maryland work: U. Maryland work: PRAM prototype PRAM prototype XMT programming model XMT programming model Current Standards: Current Standards: MPI MPI OpenMP OpenMP Parallel Algorithms for Bayesian Networks, Gibbs Sampling Parallel Algorithms for Bayesian Networks, Gibbs Sampling

Why Parallel Computing? Moore’s law will eventually end. Moore’s law will eventually end. Processors are becoming cheaper. Processors are becoming cheaper. Parallel computing provides significant time and memory savings! Parallel computing provides significant time and memory savings!

Parallel Computing Goal is to maximize efficiency / speedup: Goal is to maximize efficiency / speedup: Efficiency = T seq / (P * T par ) < 1 Efficiency = T seq / (P * T par ) < 1 Speedup = T seq / * T par < P Speedup = T seq / * T par < P In practice, time savings are substantial. In practice, time savings are substantial. Assuming communication costs are low and processor idle time is minimized. Assuming communication costs are low and processor idle time is minimized. Orthogonal to: Orthogonal to: Advancements in processor speeds Advancements in processor speeds Code optimization and data structure techniques Code optimization and data structure techniques

Some issues to consider Implicit vs. Explicit Parallelization Implicit vs. Explicit Parallelization Distributed vs. Shared Memory Distributed vs. Shared Memory Homogeneous vs. Heterogeneous Machines Homogeneous vs. Heterogeneous Machines Static vs Dynamic Load Balancing Static vs Dynamic Load Balancing Other Issues: Other Issues: Communication Costs Communication Costs Fault-Tolerance Fault-Tolerance Scalability Scalability

Main Questions How can we design parallel algorithms? How can we design parallel algorithms? Need to think of places in the algorithm that can be made concurrent Need to think of places in the algorithm that can be made concurrent Need to understand data dependencies Need to understand data dependencies (“critical path” = longest chain of dependent calculations) How do we implement these algorithms? How do we implement these algorithms? An engineering issue with many different options An engineering issue with many different options

U. Maryland Work (Vishkin) FPGA-Based Prototype of a PRAM-On- Chip Processor FPGA-Based Prototype of a PRAM-On- Chip Processor Xingzhi Wen, Uzi Vishkin, ACM Computing Frontiers, 2008 Xingzhi Wen, Uzi Vishkin, ACM Computing Frontiers,

Goals Find a parallel computing framework that: Find a parallel computing framework that: is easy to program is easy to program gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming fits current chip technology and scales with it fits current chip technology and scales with it They claim that PRAM/XMT can meet these goals

What is PRAM? “Parallel Random Access Machine” “Parallel Random Access Machine” Virtual model of computation with some simplifying assumptions: Virtual model of computation with some simplifying assumptions: No limit to number of processors. No limit to number of processors. No limit on amount of shared memory. No limit on amount of shared memory. Any number of concurrent accesses to a shared memory take the same time as a single access. Any number of concurrent accesses to a shared memory take the same time as a single access. Simple model that can be analyzed theoretically Simple model that can be analyzed theoretically Eliminates focus on details like synchronization and communication Eliminates focus on details like synchronization and communication Different types: Different types: EREW: Exclusive read, exclusive write. EREW: Exclusive read, exclusive write. CREW: Concurrent read, exclusive write. CREW: Concurrent read, exclusive write. CRCW: Concurrent read, concurrent write. CRCW: Concurrent read, concurrent write.

XMT Programming Model XMT = “Explicit Multi-Threading” XMT = “Explicit Multi-Threading” Assumes CRCW PRAM Assumes CRCW PRAM Multithreaded extension of C with 3 commands: Multithreaded extension of C with 3 commands: Spawn: starts parallel execution mode Spawn: starts parallel execution mode Join: Resumes serial mode Join: Resumes serial mode Prefix-sum: atomic command for incrementing variable Prefix-sum: atomic command for incrementing variable

RAM vs. PRAM

Simple Example Task: Copy nonzero elements from A to B $ is the thread-ID PS is Prefix-Sum

Architecture of PRAM prototype MTCU: “Master Thread Control Unit”: handles sequential portions TCU clusters: handles parallel portions 64 separate processors, each 75MHz 1 GB RAM, 32KB per cache (8 shared cache modules) Shared cache Shared PS unit: only way to communicate!

Envisioned Processor

Performance Results Using 64 procs Projected results 75Mhz -> 800Mhz

Human Results “As PRAM algorithms are based on first principles that require relatively little background, a full day (300-minute) PRAM/XMT tutorial was offered to a dozen high-school students in September Followed up with only a weekly office-hour by an undergraduate assistant, some strong students have been able to complete 5 of 6 assignments given in a graduate course on parallel algorithms.” “As PRAM algorithms are based on first principles that require relatively little background, a full day (300-minute) PRAM/XMT tutorial was offered to a dozen high-school students in September Followed up with only a weekly office-hour by an undergraduate assistant, some strong students have been able to complete 5 of 6 assignments given in a graduate course on parallel algorithms.” In other words: XMT is an easy way to program in parallel

Main Claims “First commitment to silicon for XMT” “First commitment to silicon for XMT” An actual attempt to implement a PRAM An actual attempt to implement a PRAM “Timely case for the education enterprise” “Timely case for the education enterprise” XMT can be learned easily, even by high schoolers. XMT can be learned easily, even by high schoolers. “XMT is a candidate for the Processor of the Future” “XMT is a candidate for the Processor of the Future”

My Thoughts Making parallel programming as pain-free as possible is desirable, and XMT makes a good attempt to do this. Making parallel programming as pain-free as possible is desirable, and XMT makes a good attempt to do this. Performance is a secondary goal. Performance is a secondary goal. Their technology does not seem to be ready for prime- time yet: Their technology does not seem to be ready for prime- time yet: 75 Mhz processors 75 Mhz processors No floating point operations, no OS No floating point operations, no OS

MPI Overview MPI (“Message Passing Interface”) is the standard for distributed computing MPI (“Message Passing Interface”) is the standard for distributed computing Basically it is an extension of C/Fortran that allows processors to send messages to each other. Basically it is an extension of C/Fortran that allows processors to send messages to each other. A tutorial: A tutorial:

OpenMP overview OpenMP is the standard for shared memory computing OpenMP is the standard for shared memory computing Extends C with compiler directives to denote parallel sections Extends C with compiler directives to denote parallel sections Normally used for the parallelization of “for” loops. Normally used for the parallelization of “for” loops. Tutorial: Tutorial:

Parallel Computing in AI/ML Parallel Inference in Bayesian networks Parallel Inference in Bayesian networks Parallel Gibbs Sampling Parallel Gibbs Sampling Parallel Constraint Satisfaction Parallel Constraint Satisfaction Parallel Search Parallel Search Parallel Neural Networks Parallel Neural Networks Parallel Expectation Maximization, etc. Parallel Expectation Maximization, etc.

Finding Marginals in Parallel through “Pointer Jumping” (Pennock, UAI 1998) Each variable assigned to a separate processor Each variable assigned to a separate processor Processors rewrite conditional probabilities in terms of grandparent: Processors rewrite conditional probabilities in terms of grandparent:

Algorithm

Evidence Propagation “Arc Reversal” + “Evidence Absorption” “Arc Reversal” + “Evidence Absorption” Step 1: Make evidence variable root node and create a preorder walk Step 1: Make evidence variable root node and create a preorder walk (can be done in parallel) (can be done in parallel) Step 2: Reverse arcs not consistent with that preorder walk Step 2: Reverse arcs not consistent with that preorder walk (can be done in parallel), and absorb evidence (can be done in parallel), and absorb evidence Step 3: Run the “Parallel Marginals” algorithm Step 3: Run the “Parallel Marginals” algorithm

Generalizing to Polytrees Note: Converting Bayesian Networks to Junction Trees can also be done in parallel Namasivayam, et. al. Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. 18 th Int. Symp. on Comp. Arch. And High Perf. Comp., 2006.

Complexity Time complexity: Time complexity: O(log n) for polytree networks! O(log n) for polytree networks! Assuming 1 processor per variable Assuming 1 processor per variable n = # of processors/variables n = # of processors/variables O(r 3w log n) for arbitrary networks O(r 3w log n) for arbitrary networks r = domain size, w=largest cluster size r = domain size, w=largest cluster size

Parallel Gibbs Sampling Running multiple parallel chains is trivial. Running multiple parallel chains is trivial. Parallelizing a single chain can be difficult: Parallelizing a single chain can be difficult: Can use Metropolis-Hastings step to sample from joint distribution correctly. Can use Metropolis-Hastings step to sample from joint distribution correctly. Related ideas: Metropolis-coupled MCMC, Parallel Tempering, Population MCMC Related ideas: Metropolis-coupled MCMC, Parallel Tempering, Population MCMC

Recap Many different ways to implement parallel algorithms (XMT, MPI, OpenMP) Many different ways to implement parallel algorithms (XMT, MPI, OpenMP) In my opinion, designing efficient parallel algorithms is the harder part. In my opinion, designing efficient parallel algorithms is the harder part. Parallel computing in context of AI/ML still not fully explored! Parallel computing in context of AI/ML still not fully explored!