Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008

Roadmap Brief Overview of Parallel Computing Brief Overview of Parallel Computing U. Maryland work: U. Maryland work: PRAM prototype PRAM prototype XMT programming model XMT programming model Current Standards: Current Standards: MPI MPI OpenMP OpenMP Parallel Algorithms for Bayesian Networks, Gibbs Sampling Parallel Algorithms for Bayesian Networks, Gibbs Sampling

Why Parallel Computing? Moore’s law will eventually end. Moore’s law will eventually end. Processors are becoming cheaper. Processors are becoming cheaper. Parallel computing provides significant time and memory savings! Parallel computing provides significant time and memory savings!

Parallel Computing Goal is to maximize efficiency / speedup: Goal is to maximize efficiency / speedup: Efficiency = T seq / (P * T par ) < 1 Efficiency = T seq / (P * T par ) < 1 Speedup = T seq / * T par < P Speedup = T seq / * T par < P In practice, time savings are substantial. In practice, time savings are substantial. Assuming communication costs are low and processor idle time is minimized. Assuming communication costs are low and processor idle time is minimized. Orthogonal to: Orthogonal to: Advancements in processor speeds Advancements in processor speeds Code optimization and data structure techniques Code optimization and data structure techniques

Some issues to consider Implicit vs. Explicit Parallelization Implicit vs. Explicit Parallelization Distributed vs. Shared Memory Distributed vs. Shared Memory Homogeneous vs. Heterogeneous Machines Homogeneous vs. Heterogeneous Machines Static vs Dynamic Load Balancing Static vs Dynamic Load Balancing Other Issues: Other Issues: Communication Costs Communication Costs Fault-Tolerance Fault-Tolerance Scalability Scalability

Main Questions How can we design parallel algorithms? How can we design parallel algorithms? Need to think of places in the algorithm that can be made concurrent Need to think of places in the algorithm that can be made concurrent Need to understand data dependencies Need to understand data dependencies (“critical path” = longest chain of dependent calculations) How do we implement these algorithms? How do we implement these algorithms? An engineering issue with many different options An engineering issue with many different options

U. Maryland Work (Vishkin) FPGA-Based Prototype of a PRAM-On- Chip Processor FPGA-Based Prototype of a PRAM-On- Chip Processor Xingzhi Wen, Uzi Vishkin, ACM Computing Frontiers, 2008 Xingzhi Wen, Uzi Vishkin, ACM Computing Frontiers, 2008 http://videos.webpronews.com/2007/06/28/supercomputer-arrives/Video:

Goals Find a parallel computing framework that: Find a parallel computing framework that: is easy to program is easy to program gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code gives good performance with any amount of parallelism provided by the algorithm; namely, up- and down-scalability including backwards compatibility on serial code supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming supports application programming (VHDL/Verilog, OpenGL, MATLAB) and performance programming fits current chip technology and scales with it fits current chip technology and scales with it They claim that PRAM/XMT can meet these goals

What is PRAM? “Parallel Random Access Machine” “Parallel Random Access Machine” Virtual model of computation with some simplifying assumptions: Virtual model of computation with some simplifying assumptions: No limit to number of processors. No limit to number of processors. No limit on amount of shared memory. No limit on amount of shared memory. Any number of concurrent accesses to a shared memory take the same time as a single access. Any number of concurrent accesses to a shared memory take the same time as a single access. Simple model that can be analyzed theoretically Simple model that can be analyzed theoretically Eliminates focus on details like synchronization and communication Eliminates focus on details like synchronization and communication Different types: Different types: EREW: Exclusive read, exclusive write. EREW: Exclusive read, exclusive write. CREW: Concurrent read, exclusive write. CREW: Concurrent read, exclusive write. CRCW: Concurrent read, concurrent write. CRCW: Concurrent read, concurrent write.

XMT Programming Model XMT = “Explicit Multi-Threading” XMT = “Explicit Multi-Threading” Assumes CRCW PRAM Assumes CRCW PRAM Multithreaded extension of C with 3 commands: Multithreaded extension of C with 3 commands: Spawn: starts parallel execution mode Spawn: starts parallel execution mode Join: Resumes serial mode Join: Resumes serial mode Prefix-sum: atomic command for incrementing variable Prefix-sum: atomic command for incrementing variable

RAM vs. PRAM

Simple Example Task: Copy nonzero elements from A to B $ is the thread-ID PS is Prefix-Sum

Architecture of PRAM prototype MTCU: “Master Thread Control Unit”: handles sequential portions TCU clusters: handles parallel portions 64 separate processors, each 75MHz 1 GB RAM, 32KB per cache (8 shared cache modules) Shared cache Shared PS unit: only way to communicate!

Envisioned Processor

Performance Results Using 64 procs Projected results 75Mhz -> 800Mhz

Human Results “As PRAM algorithms are based on first principles that require relatively little background, a full day (300-minute) PRAM/XMT tutorial was offered to a dozen high-school students in September 2007. Followed up with only a weekly office-hour by an undergraduate assistant, some strong students have been able to complete 5 of 6 assignments given in a graduate course on parallel algorithms.” “As PRAM algorithms are based on first principles that require relatively little background, a full day (300-minute) PRAM/XMT tutorial was offered to a dozen high-school students in September 2007. Followed up with only a weekly office-hour by an undergraduate assistant, some strong students have been able to complete 5 of 6 assignments given in a graduate course on parallel algorithms.” In other words: XMT is an easy way to program in parallel

Main Claims “First commitment to silicon for XMT” “First commitment to silicon for XMT” An actual attempt to implement a PRAM An actual attempt to implement a PRAM “Timely case for the education enterprise” “Timely case for the education enterprise” XMT can be learned easily, even by high schoolers. XMT can be learned easily, even by high schoolers. “XMT is a candidate for the Processor of the Future” “XMT is a candidate for the Processor of the Future”

My Thoughts Making parallel programming as pain-free as possible is desirable, and XMT makes a good attempt to do this. Making parallel programming as pain-free as possible is desirable, and XMT makes a good attempt to do this. Performance is a secondary goal. Performance is a secondary goal. Their technology does not seem to be ready for prime- time yet: Their technology does not seem to be ready for prime- time yet: 75 Mhz processors 75 Mhz processors No floating point operations, no OS No floating point operations, no OS

MPI Overview MPI (“Message Passing Interface”) is the standard for distributed computing MPI (“Message Passing Interface”) is the standard for distributed computing Basically it is an extension of C/Fortran that allows processors to send messages to each other. Basically it is an extension of C/Fortran that allows processors to send messages to each other. A tutorial: http://www.cs.gsu.edu/~cscyip/csc4310/MPI1.ppt A tutorial: http://www.cs.gsu.edu/~cscyip/csc4310/MPI1.ppt http://www.cs.gsu.edu/~cscyip/csc4310/MPI1.ppt

OpenMP overview OpenMP is the standard for shared memory computing OpenMP is the standard for shared memory computing Extends C with compiler directives to denote parallel sections Extends C with compiler directives to denote parallel sections Normally used for the parallelization of “for” loops. Normally used for the parallelization of “for” loops. Tutorial: http://vergil.chemistry.gatech.edu/resources/programming/OpenMP.pdf Tutorial: http://vergil.chemistry.gatech.edu/resources/programming/OpenMP.pdf http://vergil.chemistry.gatech.edu/resources/programming/OpenMP.pdf

Parallel Computing in AI/ML Parallel Inference in Bayesian networks Parallel Inference in Bayesian networks Parallel Gibbs Sampling Parallel Gibbs Sampling Parallel Constraint Satisfaction Parallel Constraint Satisfaction Parallel Search Parallel Search Parallel Neural Networks Parallel Neural Networks Parallel Expectation Maximization, etc. Parallel Expectation Maximization, etc.

Finding Marginals in Parallel through “Pointer Jumping” (Pennock, UAI 1998) Each variable assigned to a separate processor Each variable assigned to a separate processor Processors rewrite conditional probabilities in terms of grandparent: Processors rewrite conditional probabilities in terms of grandparent:

Algorithm

Evidence Propagation “Arc Reversal” + “Evidence Absorption” “Arc Reversal” + “Evidence Absorption” Step 1: Make evidence variable root node and create a preorder walk Step 1: Make evidence variable root node and create a preorder walk (can be done in parallel) (can be done in parallel) Step 2: Reverse arcs not consistent with that preorder walk Step 2: Reverse arcs not consistent with that preorder walk (can be done in parallel), and absorb evidence (can be done in parallel), and absorb evidence Step 3: Run the “Parallel Marginals” algorithm Step 3: Run the “Parallel Marginals” algorithm

Generalizing to Polytrees Note: Converting Bayesian Networks to Junction Trees can also be done in parallel Namasivayam, et. al. Scalable Parallel Implementation of Bayesian Network to Junction Tree Conversion for Exact Inference. 18 th Int. Symp. on Comp. Arch. And High Perf. Comp., 2006.

Complexity Time complexity: Time complexity: O(log n) for polytree networks! O(log n) for polytree networks! Assuming 1 processor per variable Assuming 1 processor per variable n = # of processors/variables n = # of processors/variables O(r 3w log n) for arbitrary networks O(r 3w log n) for arbitrary networks r = domain size, w=largest cluster size r = domain size, w=largest cluster size

Parallel Gibbs Sampling Running multiple parallel chains is trivial. Running multiple parallel chains is trivial. Parallelizing a single chain can be difficult: Parallelizing a single chain can be difficult: Can use Metropolis-Hastings step to sample from joint distribution correctly. Can use Metropolis-Hastings step to sample from joint distribution correctly. Related ideas: Metropolis-coupled MCMC, Parallel Tempering, Population MCMC Related ideas: Metropolis-coupled MCMC, Parallel Tempering, Population MCMC

Recap Many different ways to implement parallel algorithms (XMT, MPI, OpenMP) Many different ways to implement parallel algorithms (XMT, MPI, OpenMP) In my opinion, designing efficient parallel algorithms is the harder part. In my opinion, designing efficient parallel algorithms is the harder part. Parallel computing in context of AI/ML still not fully explored! Parallel computing in context of AI/ML still not fully explored!

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.

Similar presentations

Presentation on theme: "Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008.

Similar presentations

Presentation on theme: "Parallel Computing Approaches & Applications Arthur Asuncion April 15, 2008."— Presentation transcript:

Similar presentations

About project

Feedback