Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple.

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
MCS 312: NP Completeness and Approximation algorithms Instructor Neelima Gupta
Dense Matrix Algorithms. Topic Overview Matrix-Vector Multiplication Matrix-Matrix Multiplication Solving a System of Linear Equations.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Parallel Databases Michael French, Spencer Steele, Jill Rochelle When Parallel Lines Meet by Ken Rudin (BYTE, May 98)
CS 484. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
Distributed databases
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Design of parallel algorithms
CS 584. Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square.
1/26 Design of parallel algorithms Linear equations Jari Porras.
An Approach to Generalized Hashing Michael Klipper With Dan Blandford Guy Blelloch.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Dense Matrix Algorithms CS 524 – High-Performance Computing.
Lines Identifying Different Types of Lines. LINES There are many different types of lines. Can you think of any? vertical horizontal diagonal perpendicular.
Table of Contents Matrices - Inverse of a 3  3 Matrix Recall that to find an inverse of a 2  2 matrix, apply the formula... The formula for the inverse.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
Determinants. Determinant - a square array of numbers or variables enclosed between parallel vertical bars. **To find a determinant you must have a SQUARE.
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
Chapter An Introduction to Problem Solving 1 1 Copyright © 2013, 2010, and 2007, Pearson Education, Inc.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
CS 584 l Assignment. Systems of Linear Equations l A linear equation in n variables has the form l A set of linear equations is called a system. l A solution.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Chapter 3 Whole Numbers Section 3.6 Algorithms for Whole-Number Multiplication and Division.
Duality between Reading and Writing with Applications to Sorting Jeff Vitter Department of Computer Science Center for Geometric & Biological Computing.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Cost/Performance Tradeoffs: a case study
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Lines Identifying Different Types of Lines Next Type your name and send: Next.
Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
UNIT 14: LESSON 1 Fundamentals of “box” factoring trinomials Sum & product practice.
Chapter VI What should I know about the sizes and speeds of computers?
Group 5 Algorithms Presentation. Agenda Items and Presenters Bell Numbers All Pairs Shortest Path Shell Sort and Radix Sort Psuedocode.
Dynamic Programming … Continued 0-1 Knapsack Problem.
2/19/ ITCS 6114 Dynamic programming 0-1 Knapsack problem.
Dan Boneh Intro. Number Theory Arithmetic algorithms Online Cryptography Course Dan Boneh.
Lecture 3: Designing Parallel Programs. Methodological Design Designing and Building Parallel Programs by Ian Foster www-unix.mcs.anl.gov/dbpp.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
PowerPoint Etiquette What works in the world of presentations…color, fonts, and transitions.
Introduction to Algorithms: Brute-Force Algorithms.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
A few words on locality and arrays
Algorithm Design Methods
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Parallel Matrix Operations
Determinants.
PTAS for Bin-Packing.
Merge Sort 1/18/ :45 AM Dynamic Programming Dynamic Programming.
Dynamic Programming Merge Sort 1/18/ :45 AM Spring 2007
Multiplication Grids.
Merge Sort 2/22/ :33 AM Dynamic Programming Dynamic Programming.
This is the side of a rectangle with 16 blocks total
Merge Sort 4/28/ :13 AM Dynamic Programming Dynamic Programming.
Parallel k-means++ for Multiple Shared-Memory Architectures
DRILL.
CS334: Logisim program lab6
Dynamic Programming Merge Sort 5/23/2019 6:18 PM Spring 2008
Identifying Different Types of Lines
Multiplying Negative Numbers © T Madas.
Division Grids.
Presentation transcript:

Given UPC algorithm – Cyclic Distribution Simple algorithm does cyclic distribution This means that data is not local unless item weight is a multiple of THREADS Worse if CAPACITY+1 is not divisible by THREADS (checkerboard pattern) NOTE: For this table T capacity is horizontal and Items are vertical

Possible algorithm – BlockCyclic Distribution A simple fix: block-cyclic this algorithm gives the benefit that each previous item of same capacity has same affinity more local only computations can be performed if the items are sorted by weight beforehand so that processors generally only go locally initially for data 1234

Possible algorithm – BlockCyclic Distribution Looking at communication for a processor Algorithm communicates a lot of data with every item depending on the weight of the item Data is communicated with two processors in an unknown communication pattern 1234

Possible algorithm – BlockCyclic Distribution More detailed look at communication Since communication is going to be the most important part lets focus our attention at a subset of processor’s 3 data and look at what it needs It can be seen that almost all the data required is horizontal for the processor with very little required vertically 1234

New algorithm – Blocked Distribution More detailed look at communication Change layout to fully blocked which makes most data needed local Only communication required now for the subset is the parts coming from processor

New algorithm – Blocked Distribution BIG PROBLEM: Algorithm Serial Processor 3 requires data from processor 2 before continuing computation if it were to do entire data IDEA: run subsets of data while sticking with the blocked distribution

New algorithm – Blocked Distribution BIG PROBLEM: Algorithm Serial Processor 3 requires data from processor 2 before continuing computation if it were to do entire data IDEA: run subsets of data while sticking with the blocked distribution

New algorithm – Blocked Distribution BIG PROBLEM: Algorithm Serial Processor 3 requires data from processor 2 before continuing computation if it were to do entire data IDEA: run subsets of data while sticking with the blocked distribution

New algorithm – Blocked Distribution BIG PROBLEM: Algorithm Serial Processor 3 requires data from processor 2 before continuing computation if it were to do entire data IDEA: run subsets of data while sticking with the blocked distribution

New algorithm – Blocked Distribution BIG PROBLEM: Algorithm Serial Processor 3 requires data from processor 2 before continuing computation if it were to do entire data IDEA: run subsets of data while sticking with the blocked distribution

Pipeline algorithm – Blocked Distribution New pipelined algorithm Processors run in parallel diagonally with processor 1 starting the work and processor 4 finishing Full Parallelism achieved only when pipeline full

Pipeline algorithm – Blocked Distribution New pipelined algorithm Processors run in parallel diagonally with processor 1 starting the work and processor 4 finishing Full Parallelism achieved only when pipeline full

Pipeline algorithm – Blocked Distribution New pipelined algorithm Processors run in parallel diagonally with processor 1 starting the work and processor 4 finishing Full Parallelism achieved only when pipeline full

Pipeline algorithm – Blocked Distribution New pipelined algorithm Processors run in parallel diagonally with processor 1 starting the work and processor 4 finishing Full Parallelism achieved only when pipeline full

Pipeline algorithm – Blocked Distribution New pipelined algorithm Processors run in parallel diagonally with processor 1 starting the work and processor 4 finishing Full Parallelism achieved only when pipeline full

Pipeline algorithm – Blocked Distribution New pipelined algorithm Processors run in parallel diagonally with processor 1 starting the work and processor 4 finishing Full Parallelism achieved only when pipeline full

Pipeline algorithm – Blocked Distribution New pipelined algorithm Processors run in parallel diagonally with processor 1 starting the work and processor 4 finishing Full Parallelism achieved only when pipeline full

Pipeline algorithm – Other considerations Different layouts for different problem sizes If the problem isn’t very square could consider changing the layout so that pipeline gets filled earlier The optimal choices are a case of tuning Other optimizations If items are sorted in decreasing order this will help fill in the pipeline earlier (top left corner filled with zero's) Sorting is O(n log n) Most of the table is really local so can avoid keeping entire T table shared (just keep last row between processors shared and use upc_get/upc_put) ’s