 Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James.

Slides:



Advertisements
Similar presentations
Parallel Sorting Sathish Vadhiyar. Sorting  Sorting n keys over p processors  Sort and move the keys to the appropriate processor so that every key.
Advertisements

CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Lecture 7-2 : Distributed Algorithms for Sorting Courtesy : Michael J. Quinn, Parallel Programming in C with MPI and OpenMP (chapter 14)
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
1 Tuesday, November 14, 2006 “UNIX was never designed to keep people from doing stupid things, because that policy would also keep them from doing clever.
Chapter 10 in textbook. Sorting Algorithms
Models of Parallel Computation
Sorting Algorithms CS 524 – High-Performance Computing.
1 Lecture 8 Architecture Independent (MPI) Algorithm Design Parallel Computing Fall 2007.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
CSCI 4440 / 8446 Parallel Computing Three Sorting Algorithms.
SECTIONS 13.1 – 13.3 Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin SECONDARY STORAGE MANAGEMENT.
Sorting Algorithms: Topic Overview
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
Sorting Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley,
CS 584. Sorting n One of the most common operations n Definition: –Arrange an unordered collection of elements into a monotonically increasing or decreasing.
CSCI-455/552 Introduction to High Performance Computing Lecture 22.
CS 240A: Complexity Measures for Parallel Computation.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
CSE 373 Data Structures Lecture 15
Parallel Programming in C with MPI and OpenMP
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
CS668- Lecture 2 - Sept. 30 Today’s topics Parallel Architectures (Chapter 2) Memory Hierarchy Busses and Switched Networks Interconnection Network Topologies.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
Lecture 12: Parallel Sorting Shantanu Dutt ECE Dept. UIC.
04/27/2009CS267 Lecture 241 Parallel Sorting James Demmel
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Lecture 7 – Data Reorganization Pattern Data Reorganization Pattern Parallel Computing CIS 410/510 Department of Computer and Information Science.
August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.
Adaptive Parallel Sorting Algorithms in STAPL Olga Tkachyshyn, Gabriel Tanase, Nancy M. Amato
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
Parallel ICA Algorithm and Modeling Hongtao Du March 25, 2004.
Outline  introduction  Sorting Networks  Bubble Sort and its Variants 2.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
RAM, PRAM, and LogP models
LogP Model Motivation BSP Model Limited to BW of Network (g) and Load of PE Requires large load per super steps. Need Better Models for Portable Algorithms.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Mar. 1, 2001Parallel Processing1 Parallel Processing (CS 730) Lecture 9: Distributed Memory FFTs * Jeremy R. Johnson Wed. Mar. 1, 2001 *Parts of this lecture.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
1. 2 Sorting Algorithms - rearranging a list of numbers into increasing (strictly nondecreasing) order.
Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.
A Comparison of Parallel Sorting Algorithms on Different Architectures Nancy M. Amato, Ravishankar Iyer, Sharad Sundaresan and Yan Wu Texas A&M University.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Data Structures and Algorithms in Parallel Computing Lecture 1.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
“Sorting networks and their applications”, AFIPS Proc. of 1968 Spring Joint Computer Conference, Vol. 32, pp
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
Parallel Programming - Sorting David Monismith CS599 Notes are primarily based upon Introduction to Parallel Programming, Second Edition by Grama, Gupta,
Basic Communication Operations Carl Tropper Department of Computer Science.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Unit-8 Sorting Algorithms Prepared By:-H.M.PATEL.
Programming for Performance Laxmikant Kale CS 433.
Divide and Conquer Algorithms Sathish Vadhiyar. Introduction  One of the important parallel algorithm models  The idea is to decompose the problem into.
Distributed and Parallel Processing
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
CMSC 611: Advanced Computer Architecture
11/17/2018 Parallel Sorting James Demmel 04/13/2010 CS267 Lecture 23
Bitonic Sorting and Its Circuit Design
Architectural Interactions in High Performance Clusters
Interconnection Network Design Lecture 14
Parallel Sorting Algorithms
Latency Tolerance: what to do when it just won’t go away
Lecture 19: Coherence and Synchronization
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

 Culler 1997 CS267 L28 Sort.1 CS 267 Applications of Parallel Computers Lecture 28: LogP and the Implementation and Modeling of Parallel Sorts James Demmel (taken from David Culler, Lecture 18, CS267, 1997)

 Culler 1997 CS267 L28 Sort.2 Practical Performance Target (circa 1992) Sort one billion large keys in one minute on one thousand processors. Good sort on a workstation can do 1 million keys in about 10 seconds –just fits in memory –16 bit Radix Sort Performance unit: µs per key per processor – s ~ 10 for single Sparc 2

 Culler 1997 CS267 L28 Sort.3 Studies on Parallel Sorting Sorting Networks PRAM Sorts MEM ppp °°° Sorting on Network Y P M network P M P M °°° LogP Sorts Sorting on Machine X

 Culler 1997 CS267 L28 Sort.4 The Study Interesting Parallel Sorting Algorithms Analyze under LogP Parameters for CM-5 Estimate Execution Time Implement in Split-C Execute on CM-5 Compare ?? (Bitonic, Column, Histo- radix, Sample)

 Culler 1997 CS267 L28 Sort.5 LogP

 Culler 1997 CS267 L28 Sort.6 Deriving the LogP Model ° Processing – powerful microprocessor, large DRAM, cache=> P ° Communication + significant latency (100's of cycles)=> L + limited bandwidth (1 – 5% of memory bw)=> g + significant overhead(10's – 100's of cycles)=> o - on both ends – no consensus on topology => should not exploit structure + limited capacity – no consensus on programming model => should not enforce one

 Culler 1997 CS267 L28 Sort.7 LogP Interconnection Network MP MPMP ° ° ° P ( processors ) Limited Volume (L/ g to or from a proc) o (overhead) L (latency) o g (gap) L atency in sending a (small) mesage between modules o verhead felt by the processor on sending or receiving msg g ap between successive sends or receives (1/BW) P rocessors

 Culler 1997 CS267 L28 Sort.8 Using the Model ° Send n messages from proc to proc in time 2o + L + g(n-1) – each processor does o n cycles of overhead – has (g-o)(n-1) + L available compute cycles ° Send n messages from one to many in same time ° Send n messages from many to one in same time – all but L/g processors block so fewer available cycles oL o o o g L time P P

 Culler 1997 CS267 L28 Sort.9 Use of the Model (cont) ° Two processors sending n words to each other (i.e., exchange) in time 2o + L + max(g,2o) (n-1)  max(g,2o) + L ° P processors each sending n words to all processors (n/P each) in a static, balanced pattern without conflicts, e.g., transpose, fft, cyclic-to-block, block-to-cyclic same exercise: what’s wrong with the formula above? Assumes optimal pattern of send/receive, so could underestimate time

 Culler 1997 CS267 L28 Sort.10 LogP "philosophy" Think about: – mapping of N words onto P processors – computation within a processor, its cost, and balance – communication between processors, its cost, and balance given a charaterization of processor and network performance Do not think about what happens within the network This should be good enough!

 Culler 1997 CS267 L28 Sort.11 Typical Sort Exploits the n = N/P grouping ° Significant local computation ° Very general global communication / transformation ° Computation of the transformation

 Culler 1997 CS267 L28 Sort.12 Split-C Global Address Space P0P0 P procs-1 P1P1 local Explicitly parallel C 2D global address space –linear ordering on local spaces Local and Global pointers –spread arrays too Read/Write Get/Put (overap compute and comm) –x := G;... –sync(); Signaling store (one-way) –G :– x;... –store_sync(); or all_store_sync(); Bulk transfer Global comm.

 Culler 1997 CS267 L28 Sort.13 Basic Costs of operations in Split-C Read, Writex = *G, *G = x2 (L + 2o) Store*G :– xL + 2o Getx := *Go....2L + 2o sync();o –with interval g Bulk store (n words with  words/message) 2o + (n-1)g + L Exchange2o + 2L + (  n  L/g) max(g,2o) One to many Many to one

 Culler 1997 CS267 L28 Sort.14 LogP model CM5: –L = 6 µs –o = 2.2 µs –g = 4 µs –P varies from 32 to 1024 NOW –L = 8.9 –o = 3.8 –g = 12.8 –P varies up to 100 What is the processor performance?

 Culler 1997 CS267 L28 Sort.15 Sorting

 Culler 1997 CS267 L28 Sort.16 Local Sort Performance (11 bit radix sort of 32 bits numbers) Log N/P µs / Key Entropy in Key Values Entropy = -  i p i log p i, p i = Probability of key i

 Culler 1997 CS267 L28 Sort.17 Local Computation Parameters - Empirical ParameterOperationµs per keySort SwapSimulate cycle butterfly per key0.025 lg NBitonic mergesortSort bitonic sequence1.0 scatterMove key for Cyclic-to-block0.46 gatherMove key for Block-to-cyclic0.52 if n<=64k or P<=64 Bitonic & Column 1.1 otherwise local sortLocal radix sort (11 bit)4.5 if n < 64K (281000/n) mergeMerge sorted lists1.5Column copyShift Key0.5 zeroClear histogram bin0.2Radix histproduce histogram1.2 addproduce scan value1.0 bsumadjust scan of bins2.5 addressdetermine desitination4.7 comparecompare key to splitter0.9Sample localsort8local radix sort of samples5.0

 Culler 1997 CS267 L28 Sort.18 Bottom Line (Preview) N/P us/key Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Good fit between predicted and measured (10%) Different sorts for different sorts –scaling by processor, input size, sensitivity All are global / local hybrids –the local part is hard to implement and model

 Culler 1997 CS267 L28 Sort.19 Odd-Even Merge - classic parallel sort N values to be sorted A 0 A 1 A 2 A 3 A M-1 B 0 B 1 B 2 B 3 B M-1 Treat as two lists of M = N/2 Sort each separately A 0 A 2 … A M-2 B 0 B 2 … B M-2 Redistribute into even and odd sublists A 1 A 3 … A M-1 B 1 B 3 … B M-1 Merge into two sorted lits E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 Pairwise swaps of Ei and Oi will put it in order

 Culler 1997 CS267 L28 Sort.20 Where’s the Parallelism? E 0 E 1 E 2 E 3 E M-1 O 0 O 1 O 2 O 3 O M-1 1xN 4xN/4 2xN/2

 Culler 1997 CS267 L28 Sort.21 Mapping to a Butterfly (or Hypercube) A0A1A2A3B0B1B2B3 A0A1A2A3 A0A1A2A3 B0B1B3B2 B3B1B0 A0A1A2A3B2B3B1B0 Reverse Order of one list via cross edges two sorted sublists Pairwise swaps on way back

 Culler 1997 CS267 L28 Sort.22 Bitonic Sort with N/P per node all_bitonic(int A[PROCS]::[n]) sort(tolocal(&A[ME][0]),n,0) for (d = 1; d <= logProcs; d++) for (i = d-1; i >= 0; i--) { swap(A,T,n,pair(i)); merge(A,T,n,mode(d,i)); } sort(tolocal(&A[ME][0]),n,mask(i)); sort swap A bitonic sequence decreases and then increases (or vice versa) Bitonic sequences can be merged like monotonic sequences

 Culler 1997 CS267 L28 Sort.23 Bitonic Sort lg N/p stages are local sort Block Layout remaining stages involve Block-to-cyclic, local merges (i - lg N/P cols) cyclic-to-block, local merges ( lg N/p cols within stage)

 Culler 1997 CS267 L28 Sort.24 Analysis of Bitonic How do you do transpose? Reading Exercise

 Culler 1997 CS267 L28 Sort.25 Bitonic Sort: time per key Predicted N/P us/key Measured N/P us/key

 Culler 1997 CS267 L28 Sort.26 Bitonic: Breakdown P= 512, random

 Culler 1997 CS267 L28 Sort.27 Bitonic: Effect of Key Distributions P = 64, N/P = 1 M

 Culler 1997 CS267 L28 Sort.28 Column Sort (3) Sort (2) Transpose - block to cyclic (1) Sort (4) Transpose - cyclic to block w/o scatter (6) shift (5) Sort (8) Unshift (7) merge work efficient Treat data like n x P array, with n >= P^2, I.e. N >= P^3

 Culler 1997 CS267 L28 Sort.29 Column Sort: Times Predicted N/P us/key Measured N/P us/key Only works for N >= P^3

 Culler 1997 CS267 L28 Sort.30 Column: Breakdown P= 64, random

 Culler 1997 CS267 L28 Sort.31 Column: Key distributions Entropy (bits) µs / key Merge Sorts Remaps Shifts P = 64, N/P = 1M

 Culler 1997 CS267 L28 Sort.32 Histo-radix sort P n=N/P Per pass: 1. compute local histogram 2. compute position of 1st member of each bucket in global array – 2^r scans with end- around 3. distribute all the keys Only r = 8,11,16 make sense for sorting 32 bit numbers 2^r 2 3

 Culler 1997 CS267 L28 Sort.33 Histo-Radix Sort (again) Local Data Local Histograms Each Pass form local histograms form global histogram globally distribute data P

 Culler 1997 CS267 L28 Sort.34 Radix Sort: Times Predicted N/P us/key Measured N/P us/key

 Culler 1997 CS267 L28 Sort.35 Radix: Breakdown

 Culler 1997 CS267 L28 Sort.36 Radix: Key distribution Slowdown due to contention in redistribution

 Culler 1997 CS267 L28 Sort.37 Radix: Stream Broadcast Problem n (P-1) ( 2o + L + (n-1) g ) ? Need to slow first processor to pipeline well

 Culler 1997 CS267 L28 Sort.38 What’s the right communication mechanism? Permutation via writes –consistency model? –false sharing? Reads? Bulk Transfers? –what do you need to change in the algorithm? Network scheduling?

 Culler 1997 CS267 L28 Sort.39 Sample Sort 1. compute P-1 values of keys that would split the input into roughly equal pieces. – take S~64 samples per processor – sort PS keys – take key S, 2S,... (P-1)S – broadcast splitters 2. Distribute keys based on splitters 3. Local sort [4.] possibly reshift

 Culler 1997 CS267 L28 Sort.40 Sample Sort: Times Predicted N/P us/key Measured N/P us/key

 Culler 1997 CS267 L28 Sort.41 Sample Breakdown N/P us/key Split Sort Dist Split-m Sort-m Dist-m

 Culler 1997 CS267 L28 Sort.42 Comparison N/P us/key Bitonic 1024 Bitonic 32 Column 1024 Column 32 Radix 1024 Radix 32 Sample 1024 Sample 32 Good fit between predicted and measured (10%) Different sorts for different sorts –scaling by processor, input size, sensitivity All are global / local hybrids –the local part is hard to implement and model

 Culler 1997 CS267 L28 Sort.43 Conclusions Distributed memory model leads to hybrid global / local algorithms LogP model is good enough for the global part –bandwidth (g) or overhead (o) matter most –including end-point contention –latency (L) only matters when BW doesn’t –g is going to be what really matters in the days ahead (NOW) Local computational performance is hard! –dominated by effects of storage hierarchy (TLBs) –getting trickier with multilevels »physical address determines L2 cache behavior –and with real computers at the nodes (VM) –and with variations in model »cycle time, caches,... See See –disk-to-disk parallel sorting