LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.

Slides:



Advertisements
Similar presentations
Energy-Efficient Distributed Algorithms for Ad hoc Wireless Networks Gopal Pandurangan Department of Computer Science Purdue University.
Advertisements

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
Introduction to Computer Networks Spanning Tree 1.
ECE 667 Synthesis and Verification of Digital Circuits
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Improving the Scalability of Data Center Networks with Traffic-aware Virtual Machine Placement Authors:Xiaoqiao Meng, Vasileios Pappas, Li Zhang Presented.
Slide 1 Parallel Computation Models Lecture 3 Lecture 4.
 Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.
1 Friday, September 29, 2006 If all you have is a hammer, then everything looks like a nail. -Anonymous.
1 CSE 591-S04 (lect 14) Interconnection Networks (notes by Ken Ryu of Arizona State) l Measure –How quickly it can deliver how much of what’s needed to.
Lecture 6 Objectives Communication Complexity Analysis Collective Operations –Reduction –Binomial Trees –Gather and Scatter Operations Review Communication.
Models of Parallel Computation
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Interconnection Network PRAM Model is too simple Physically, PEs communicate through the network (either buses or switching networks) Cost depends on network.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Design, Implementation, and Evaluation of Differentiated Caching Services Ying Lu, Tarek F. Abdelzaher, Avneesh Saxena IEEE TRASACTION ON PARALLEL AND.
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
CS 240A: Complexity Measures for Parallel Computation.
Heterogeneous and Grid Computing2 Communication models u Modeling the performance of communications –Huge area –Two main communities »Network designers.
Delay Efficient Sleep Scheduling in Wireless Sensor Networks Gang Lu, Narayanan Sadagopan, Bhaskar Krishnamachari, Anish Goel Presented by Boangoat(Bea)
Lecture 1, 1Spring 2003, COM1337/3501Computer Communication Networks Rajmohan Rajaraman COM1337/3501 Textbook: Computer Networks: A Systems Approach, L.
Broadcast & Convergecast Downcast & Upcast
Synchronous Algorithms I Barrier Synchronizations and Computing LBTS.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Bulk Synchronous Parallel Processing Model Jamie Perkins.
1 Parallel Sorting Algorithms. 2 Potential Speedup O(nlogn) optimal sequential sorting algorithm Best we can expect based upon a sequential sorting algorithm.
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
Lecture 3 Innerconnection Networks for Parallel Computers
Network and Communications Ju Wang Chapter 5 Routing Algorithm Adopted from Choi’s notes Virginia Commonwealth University.
-1.1- Chapter 2 Abstract Machine Models Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
RAM, PRAM, and LogP models
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Parallel Computing Department Of Computer Engineering Ferdowsi University Hossain Deldari.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.
Parallel graph algorithms Antonio-Gabriel Sturzu, SCPD Adela Diana Almasi, SCPD Adela Diana Almasi, SCPD Iulia Alexandra Floroiu, ISI Iulia Alexandra Floroiu,
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
McGraw-Hill©The McGraw-Hill Companies, Inc., 2004 Connecting Devices CORPORATE INSTITUTE OF SCIENCE & TECHNOLOGY, BHOPAL Department of Electronics and.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Programming for Performance CS433 Spring 2001 Laxmikant Kale.
MPI implementation – collective communication MPI_Bcast implementation.
Data Structures and Algorithms in Parallel Computing
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
Team LDPC, SoC Lab. Graduate Institute of CSIE, NTU Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin.
Super computers Parallel Processing
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
On Optimizing Collective Communication UT/Texas Advanced Computing Center UT/Computer Science Avi Purkayastha Ernie Chan, Marcel Heinrich Robert van de.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
Programming for Performance Laxmikant Kale CS 433.
Parallel and Distributed Simulation Deadlock Detection & Recovery: Performance Barrier Mechanisms.
Programming for Performance Laxmikant Kale CS 433.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Distributed and Parallel Processing
Parallel computation models
Complexity Measures for Parallel Computation
Architectural Interactions in High Performance Clusters
LoGPC: Modeling Network Contention in Message-Passing Programs
Latency Tolerance: what to do when it just won’t go away
Presentation transcript:

LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms Converging Hardware –Independent from Network Topology –Programming Models Assumption –Number of PE much bigger than data elements

Parameters L: Latency –delay on the network o: Overhead on PE g: gap –minimum interval between consecutive messages (due to bandwidth) P: Number of PEs Note: L,o,g : independent from P or node distances Message length: short message L,o,g are per word or per message of fixed length k word message: k short messages (k*o overhead) L independent from message length

Parameters (continue) Bandwidth: 1/g * unit message length Number of messages to send or receive for each PE: L/g Send to Receive total time : L+2o if o >> g, ignore o –Similar to BSP except no synchronization step –No communication computation overlapping –Speed-up factor at most two

Broadcast Optimal Broad cast tree P1 P0 P5 P3P4P2 P6P7 P=8, L=6, g=4, o= o g L p1 p0p0

Optimal Sum Given time T, how many items we can add? Approach: recursive –At root, if T <= L+2o use a single PE (can add T+1 items) –If T > L+2o, Root should have data ready at T, and sender must have sum ready at T - L - 2o - 1 Recursively construct the sum tree at the sender If T - g > L+2o, Root also can receive data, and compute the sum with T-g as the root.

Applications FFT on the Butterfly network –Data Placement cyclic layout - First log n/P local comm, last log P global blocked layout - First log P global comm, remaining local hybrid: After log (n/P) iteration, re-map to cyclic so that remaining can be also local Communication time: g* (n/P**2) (P-1) + L each PE has n/P data, each of 1/P goes to each other PE Total time is (1+g/logn) optimal –All to all Communication schedule Approach 1: each PE sends PE1, PE2, … => bottle neck at PE1, PE2 in this order Approach 2 (staggered re-map) -- no congestion –PE1 sends PE2, PE3,.. –PE2 sends PE3, PE4, etc

Implementation on CM5 CM: –33MHz –Fat Trees –Global Control for scan/prefix/broadcast –one CM MFLOPs –FFT on local: MFLOPs (cache effect) each cycle: –multiply and add : 4.5 us –o: 2us –L: 6us –g: 4us –load ans store overhaed per cycle 1us communication time : n/P max (1us + 2o, g) + L bottleneck: processing and overhead, not bw

LU decomposition Data arrangement critical

Matching machine with real machines Average Distance topology independent usually works for n=1024 nodes. The difference between average distance and max distance are not such different

Potential Concerns Algorithmic concern –Theory? –Too complex? Communication concerns –how to use trivial comm such as local exchange –topology dependencies?

Comparison with BSP Length of superstep message not usable till next step special hardware for sync virtual/physical large, context switching may be expensive