Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Background Virtual memory – separation of user logical memory from physical memory. Only part of the program needs to be in memory for execution. Logical.
By Snigdha Rao Parvatneni
Register Allocation Zach Ma.
CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.
Analysis of Algorithms
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.
Tuan Tran. What is CISC? CISC stands for Complex Instruction Set Computer. CISC are chips that are easy to program and which make efficient use of memory.
Analysis of Algorithms CS 477/677 Linear Sorting Instructor: George Bebis ( Chapter 8 )
CSCI-455/552 Introduction to High Performance Computing Lecture 11.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
1 Hash-Based Indexes Module 4, Lecture 3. 2 Introduction As for any index, 3 alternatives for data entries k* : – Data record with key value k – –Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Linear Sorts Counting sort Bucket sort Radix sort.
Chapter 11 (3 rd Edition) Hash-Based Indexes Xuemin COMP9315: Database Systems Implementation.
Data Structures Using C++ 2E
Join Processing in Databases Systems with Large Main Memories
Index tuning Hash Index. overview Introduction Hash-based indexes are best for equality selections. –Can efficiently support index nested joins –Cannot.
Parallell Processing Systems1 Chapter 4 Vector Processors.
1 Hash-Based Indexes Yanlei Diao UMass Amherst Feb 22, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Characterizing the Sort Operation on Multithreaded Architectures Layali Rashid, Wessam M. Hassanein, and Moustafa A. Hammad* The Advanced Computer Architecture.
Tirgul 4 Subjects of this Tirgul: Counting Sort Radix Sort Bucket Sort.
Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.
CS 104 Introduction to Computer Science and Graphics Problems
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
1 Hash-Based Indexes Chapter Introduction  Hash-based indexes are best for equality selections. Cannot support range searches.  Static and dynamic.
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
CSE 326: Data Structures Sorting Ben Lerner Summer 2007.
Hashing General idea: Get a large array
Data Structures, Spring 2004 © L. Joskowicz 1 Data Structures – LECTURE 5 Linear-time sorting Can we do better than comparison sorting? Linear-time sorting.
Data Structures Using C++ 2E Chapter 9 Searching and Hashing Algorithms.
Main Memory. Background Program must be brought (from disk) into memory and placed within a process for it to be run Main memory and registers are only.
Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.
© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Part II: Addressing Modes
CSE 373 Data Structures Lecture 15
Chapter 7 Memory Management
Linear Sorts Chapter 12.3, Last Updated: :39 AM CSE 2011 Prof. J. Elder Linear Sorts?
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.
Copyright © Curt Hill Query Evaluation Translating a query into action.
Comp 335 File Structures Hashing.
Parallel Algorithms Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
Hashing Sections 10.2 – 10.3 CS 302 Dr. George Bebis.
1 Memory Management (b). 2 Paging  Logical address space of a process can be noncontiguous; process is allocated physical memory whenever the latter.
Communication and Computation on Arrays with Reconfigurable Optical Buses Yi Pan, Ph.D. IEEE Computer Society Distinguished Visitors Program Speaker Department.
Data Structures and Algorithms Hashing First Year M. B. Fayek CUFE 2010.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11 Modified by Donghui Zhang Jan 30, 2006.
Introduction to Database, Fall 2004/Melikyan1 Hash-Based Indexes Chapter 10.
1.1 CS220 Database Systems Indexing: Hashing Slides courtesy G. Kollios Boston University via UC Berkeley.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 10.
1 CSCD 326 Data Structures I Hashing. 2 Hashing Background Goal: provide a constant time complexity method of searching for stored data The best traditional.
1 Algorithms CSCI 235, Fall 2015 Lecture 17 Linear Sorting.
CSCI-455/552 Introduction to High Performance Computing Lecture 23.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
8.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Fragmentation External Fragmentation – total memory space exists to satisfy.
Parallel Databases.
Review Graph Directed Graph Undirected Graph Sub-Graph
Main Memory Background Swapping Contiguous Allocation Paging
Lecture 3: Main Memory.
Parallel Sorting Algorithms
Lecture 2- Query Processing (continued)
Hashing Sections 10.2 – 10.3 Lecture 26 CS302 Data Structures
COMP755 Advanced Operating Systems
CSE 542: Operating Systems
Presentation transcript:

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers What is Radix Sorting? Sort by least significant digit instead of most significant digit Better than sorting by most significant digit since it saves having to keep track of multiple sort jobs

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Properties of Radix Sorting Algorithms Treat keys as multidigit numbers, where each digit is an integer from where m is the radix The radix m is variable, and chosen to minimize running time Example: 32-bit key as 4 digit number m is equal to the number of distinct digits so m = Performance: Runs in O(n) Other comparison based sorts such as quicksort run in O(n log n) time ***Not advantageous for machines w/cache

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Serial Radix Sort N = # of keys to sort K = array of keys D = array of r-bit digits Values of Bucket[] after each phase: Histogram-Keys: Bucket[i] contains the number of digits having value i Scan-Buckets: Bucket[i] contains the number of digits with values < i Rank-And-Permute: Each key of value i is placed in its final location by getting the offset from Bucket[i] and incrementing the bucket

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers How Can We Parallelize the Serial Radix Sort? Problem: Loop dependencies in all three phases Solution: Use a separate set of buckets for each processor Each processor takes care of N/P keys where P is number of processors. This resolves the data dependencies, but creates a new problem with Scan-Buckets: How can we sort the digits globally instead of just within the scope of each individual processor.

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Fully Parallelizing Scan-Buckets Instead of having each processor simply scan its own buckets, after doing Scan-Buckets we would like the value of Buckets[i,j] to be: The sum can be calculated by flattening the matrix and executing a Scan-Buckets on the flattened matrix

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Techniques Used In the Data-Parallel Radix Sort Virtual Processors Loop Rakings Processor Memory Layout

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Virtual Processors Vector multiprocessors offer two levels of parallelism: multiprocessor facilities and vector facilities. To take advantage of this, view each element of a vector register as a virtual processor. So a machine with register length L and P processors has L x P virtual processors. Now the total number of keys can be divided into L x P sets.

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Loop Raking Usually operations on arrays are vectorized using strip mining. In strip mining an element of a vector register handles every Lth-element Unfortunately using strip mining each virtual processor will have to handle a strided set of keys instead of a contiguous block as required by the parallel algorithm Using a technique called loop raking, each virtual processor handles a contiguous block of keys. Loop raking uses a constant stride of N/L to access elements

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Processor Memory Layout A memory location X is contained in bank (X mod B) where B is the number of banks Simultaneous accesses to the same bank result in delay There are two possible ways to lay out the buckets in memory: Place the buckets for each virtual processor in contiguous memory locations: This approach could cause multiple virtual processors to access the same bank simultaneously. Place the buckets so that the buckets used by each virtual processor are in separate memory banks (i.e. Place all the buckets of a certain value from all virtual processors in contiguous memory locations): This approach keeps multiple virtual processors from accessing the same bank simultaneously

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Processor Memory Layout: Example

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Implementation of Radix Sort on 8-processor CRAY Y-MP Four Routines: 1.Extract Digit: Extracts current digit from keys and computes an index into the array of buckets Uses loop raking Time for routine: T Extract-Digit =1.2. N/P 2.Histogram Keys: Uses loop raking Time for routine: 2 steps T Clear-Buckets = r. L T Histogram-Keys =2.4. N/P 3.Scan Buckets: Uses loop raking Time for routing: T Scan-Buckets = r. L. P/P= r. L

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Implementation of Radix Sort on 8-processor CRAY Y-MP 4.Permute Keys: Uses loop raking Time to permute a vector ranges from 1.3 cycles/element to 5.5 cycles/element Time for routine: T Rank-And-Permute =3.5. N/p

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Performance Analysis Total sorting times: T Counting-Sort =L. 2 r. T bucket +N/P. T key T Radix-Sort =b/r(L. 2 r. T bucket +N/P. T key) Choice of Radix: The optimal value for r increases with the number of elements per processor Choosing r below the optimal value puts too much work on keys, choosing r above the optimal value puts too much work on buckets Value for r and approximation of total sort time:

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Choosing a Value for r

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Predicted vs. Measured Performance

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Other Factors of Performance Vector Length Multiple Processors Space

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Varying Vector Length Decreasing the vector length decreases the number of virtual processors Advantage: decreases the time for cleaning and scanning buckets Disadvantage: increases the cost per element for performing the histogram, T key Conclusion: Reducing the vector length is only beneficial if (N/P < 9000)

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Change in Performance with Number of Processors If N/P is held constant, speedup is linear with increase in P If N is fixed, speedup is not linear with increase in P due to changes in the optimal r

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Memory Issues Memory needed for Radix Sort: Temporary array of size N to extract current digit + an array of size N for destination of permute + array of size L. 2 r. P for the buckets  2.5N Possible ways to conserve memory: Extract digit as needed instead of using temporary array Lower radix (i.e. 2 r term) Reduce vector length (L)

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Conclusions on Vectorized Radix Sort Radix sort can be Vectorized using three major techniques 1.Virtual processors 2.Loop raking 3.Efficient memory allocation Overall performance can be optimized by adjusting 1.The radix r 2.The vector length L 3.Number of processors 4.Memory considerations

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Introduction to Hash-Join The join operation is one of the most time-consuming and data-intensive operations performed in databases The join operation is frequently executed and used Idea: vectorize the computational aspects of the hash and join phases

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Equijoin

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Naive Approach This approach is too expensive and runs in O(n 2 )

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Reduction of Loads by Hashing By hashing the tuples of each relation into buckets, we change from having to compare the entire area to just the areas in which keys hash to the same bucket (shaded areas).

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Grace-Hash Join Two Phases: 1.Relations are hashed into buckets so that each bucket is small enough to fit into main memory 2. A bucket from one relation is brought into memory and hashed. Then every key of the second relation is hashed and compared to ever key of the first relation which hashed to the same bucket.

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Phases of Sequential Hash Extract-Buckets and Histogram Phases Scan Phase Rank and Permute Phase

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Extract-Buckets and Histogram Phase Hash function used is key mod (number of buckets)

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Scan Phase

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Rank and Permute Phase After this phase the result and buckets arrays form a hash table

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Sequential Join Algorithm The disk bucket R i is brought into memory Each record of S i is hashed and compared to every record in R i Any matches that are found are concatenated and written to final output file

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Vectorized Algorithm Use two techniques: 1.Virtual processors 2.Loop raking

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Join Mask vector is generated by a scalar-vector comparison

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Problems that Occurred Compiler exhibited problems vectorizing certain parts of the code: Getting the compiler to vectorize certain loops in the code The compiler would not vectorize compress in the Join phase

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Results (using CRAY C90)

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Results Scalar: Vector:

6.893Ripal NathujiRadix Sort and Hash-Join for Vector Computers Results