Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan.

Slides:



Advertisements
Similar presentations
Memory.
Advertisements

Introduction to Algorithms Quicksort
Chapter 6: Memory Management
Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
1 Thursday, July 06, 2006 “Experience is something you don't get until just after you need it.” - Olivier.
Lecture 34: Chapter 5 Today’s topic –Virtual Memories 1.
Allocating Memory.
1 Sorting Problem: Given a sequence of elements, find a permutation such that the resulting sequence is sorted in some order. We have already seen: –Insertion.
CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.
S.1 Review: The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
Sorting Chapter 10.
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Virtual Memory BY JEMINI ISLAM. What is Virtual Memory Virtual memory is a memory management system that gives a computer the appearance of having more.
Sorting Chapter 10. Chapter 10: Sorting2 Chapter Objectives To learn how to use the standard sorting methods in the Java API To learn how to implement.
By: A. LaMarca & R. Lander Presenter : Shai Brandes The Influence of Caches on the Performance of Sorting.
Section 8.4 Insertion Sort CS Insertion Sort  Another quadratic sort, insertion sort, is based on the technique used by card players to arrange.
Rensselaer Polytechnic Institute CSC 432 – Operating Systems David Goldschmidt, Ph.D.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
1. Memory Manager 2 Memory Management In an environment that supports dynamic memory allocation, the memory manager must keep a record of the usage of.
CHAPTER 09 Compiled by: Dr. Mohammad Omar Alhawarat Sorting & Searching.
CSE431 L22 TLBs.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 22. Virtual Memory Hardware Support Mary Jane Irwin (
Merge Sort. What Is Sorting? To arrange a collection of items in some specified order. Numerical order Lexicographical order Input: sequence of numbers.
Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
IT253: Computer Organization
Sorting Chapter 10. Chapter Objectives  To learn how to use the standard sorting methods in the Java API  To learn how to implement the following sorting.
Memory Management Techniques
Subject: Operating System.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
Sorting. Pseudocode of Insertion Sort Insertion Sort To sort array A[0..n-1], sort A[0..n-2] recursively and then insert A[n-1] in its proper place among.
Computer Architecture Memory organization. Types of Memory Cache Memory Serves as a buffer for frequently accessed data Small  High Cost RAM (Main Memory)
Memory Management during Run Generation in External Sorting – Larson & Graefe.
Chapter 8 Sorting and Searching Goals: 1.Java implementation of sorting algorithms 2.Selection and Insertion Sorts 3.Recursive Sorts: Mergesort and Quicksort.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Quicksort Data Structures and Algorithms CS 244 Brent M. Dingle, Ph.D. Game Design and Development Program Department of Mathematics, Statistics, and Computer.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
Copyright © Curt Hill Sorting Ordering an array.
Sorting divide and conquer. Divide-and-conquer  a recursive design technique  solve small problem directly  divide large problem into two subproblems,
Chapter 9: Sorting1 Sorting & Searching Ch. # 9. Chapter 9: Sorting2 Chapter Outline  What is sorting and complexity of sorting  Different types of.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Sorting – Lecture 3 More about Merge Sort, Quick Sort.
CMPT 238 Data Structures More on Sorting: Merge Sort and Quicksort.
Chapter 7: Main Memory CS 170, Fall Program Execution & Memory Management Program execution Swapping Contiguous Memory Allocation Paging Structure.
Memory Hierarchy Ideal memory is fast, large, and inexpensive
Virtual memory.
Computer Architecture Chapter (14): Processor Structure and Function
Chapter 2 Memory and process management
Lecture 12 Virtual Memory.
CSC 4250 Computer Architectures
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 8 11/24/2018.
Computer Architecture
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 12/1/2018.
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
Lecture 2- Query Processing (continued)
Memory Management Lectures notes from the text supplement by Siberschatz and Galvin Modified by B.Ramamurthy Chapter 9 4/5/2019.
CSC3050 – Computer Architecture
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Cache Performance Improvements
Core Assessments Core #1: This Friday (5/4) Core #2: Tuesday, 5/8.
Operating Systems: Internals and Design Principles, 6/E
Presentation transcript:

Cache effective mergesort and quicksort Nir Zepkowitz Based on: “Improving Memory Performance of Sorting Algorithms” by Li Xiao, Xiaodong Zhang, Stefan A. Kubricht

2 The Goal Optimize the performance of mergesort and quicksort. We do this by restructuring them.

3 What we saw already The algorithms we saw tried to reduce capacity misses on direct-mapped caches. The new algorithms will try to reduce other types of cache misses, such as conflict misses and TLB misses

4 What do we use For the best optimization the algorithms use both tilling and padding techniques, data set repartitioning and knowledge of the processor HW (such as cache and TLB association).

5 Parameter usage We will work with generic unit element to specify the cache capacity. N: the size of the data set. C: the data cache size. L: the size of a cache line. K: the cache associativity. Ts: the number of entries in a TLB set. K TLB : the TLB associativity. Ps: the size of a memory page.

6 Old mergesort – Tiled mergesort First phase: subarrays of length C/2 (half the cache size) are sorted by the base mergesort. Second phase: use base mergesort to complete the sorting of the entire data set. the first phase allows the algorithm to avoid capacity misses and to fully use the data that is loaded in the cache.

7 Old mergesort - multimergesort The first phase: like tiled mergesort The second phase: a multiway merge method is used to merge all the sorted subarrays together in a single pass. We do that by holding the heads of the lists (the sorted subarrays) to be merged.

8 areas for improvement The algorithms significantly reduce capacity misses but do not sufficiently reduce the conflict misses. In cache with low associativity, mapping conflicts occur frequently among the elements in the three subarrays (the target and the two sources). Reducing the TLB misses is not considered. TLB misses can severely damage the execution performance.

9 Tiled mergesort - the problem In the second phase of Tiled mergesort, pairs of sorted subarrays are sorted and merged into a destination array. At each time we are holding three elements two sources and one target element. These three data elements can potentially be in conflicting cache blocks because they may be mapped to the same block in a direct-mapped cache and in 2-way associative cache.

10 Tiled mergesort with padding we insert L elements (or a spacing the size of cache line) to separate every section of C elements in the data set in the second phase of the tiled mergesort. These padding elements can significantly reduce the cache conflicts. The extra memory is trivial when compared to the size of the data set.

11 Multimergesort – The problem In the second phase of the multimergesort, the multiple subarrays are completely sorted in a single pass. this is done by using a heap structure for each of the subarrays. However, the working set is much larger than that of the base mergesort. This large working set causes TLB misses which degrade performance.

12 TLB - reminder The TLB is a special cache that stores most recently used virtual-physical page translation for memory access. A TLB cache miss forces the system to retrieve the missing translation from the page table in the memory, and then to replace an existing TLB entry with this translation.

13 Multimergesort with TLB padding In the second phase of multimergesort, we insert P S elements (or a page space) to separate every sorted subarray in the data set in order to reduce or eliminate the TLB cache conflict misses. The padding changes the base address of these lists in page units to avoid potential TLB conflict misses.

14

15

16 Trade offs The algorithm increases the instruction count, because we need to move the elements. This leads to additional CPU cycles. But, Memory accesses are far more expensive than CPU cycles.

17 Measurement results Tiled mergesort with padding is highly effective in reducing conflict misses on machines with direct-mapped caches. Multimergesort with TLB padding performs very well on all types of architecture.

18 Old quicksort: memory-tuned quicksort a modification of the basic quicksort. Instead of saving small subarrays to sort in the end, the memory-tuned quicksort sorts these subarrays when they are first encountered in order to reuse the data elements in the cache.

19 Old quicksort - Multiquicksort Divides the full data set into multiple subarrays, with the hope that each subarray will be smaller than the cache capacity. The performance gain of these two algorithms from experiments reported is modest.

20 The challenge In practice, the quicksort algorithms exploit cache locality well on balanced data sets. These algorithms were not efficient in unbalanced data sets. The challenge is to make quicksort perform well on unbalanced data sets.

21 Flash quicksort A combination of flashsort and quicksort.

22 Flashsort The maximum and the minimum values are first identified in the data set to identify the data range. The data range is then evenly divided into classes to form subarrays.

23 Flashsort… Three steps: “classification” to determine the size of each class. “permutation” to move each element into its class by using a single temporary variable to hold the replaced element. “straight insertion” to sort elements in each class by using insertion sort.

24 Flashsort (cont.) If the data set is balanced the sizes of the subarrays after the first two steps are similar and small enough to fit in the cache. However, if the data set is unbalanced the sizes of the generated subarrays are disproportionate, causing ineffective usage of the cache, and making flashsort as slow as insertion sort in the worst case.

25 The good and bad In comparison with the pivoting process of quicksort, the classification step of flashsort is more likely to generate balanced subarrays, which favors better cache utilization. Quicksort outperforms insertion sort on unbalanced data sets.

26 Flash quicksort By combining the advantages of flashsort and quicksort we make flash quicksort. The first two stages are as in flash sort (“classification” & “permutation”). The last step uses quicksort to sort the elements in each class.

27 Inplaced flash quicksort An improvement to the flash quicksort. The only change is in the second phase. We use an additional array as a buffer to hold the permuted elements. A cache line usually holds more than one element. we try to reuse elements in the cache before their replacement.

28 Measurement results On balanced data set the performance of memory-tuned quicksort, flash quicksort and implaced flash quicksort is similar with a small advantage to the memory-tuned quicksort. On unbalanced data set the flash quicksort and the inplaced flash quicksort significantly outperformed the memory-tuned quicksort.

29

30 Conclusion We developed cache-effective algorithms for both mergesort and quicksort. The technique of padding, partitioning and buffering can also be used for other for optimizations directed at the cache.

31 Padding The danger of conflict misses exists whenever a program regularly accesses to a large data set, particularly when the algorithm partitions the data sets in sizes that are power of 2. Padding is effective for this kind of program to eliminate or reduce conflict misses.

32 Partitioning When a program sequentially and repeatedly scans a large data set that can not be stored in the cache in its entirely, the program will suffer capacity misses. Partitioning the data set based on the cache size to localize the memory used by a stage in execution is effective for this kind of program.

33 Buffering The buffering technique is effective to reduce or eliminate conflict misses by using additional buffer to temporarily hold data elements for later reuse that would otherwise be swapped out of the cache.

34 The End