Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Merge-Sort(A,i,j)

Slides:



Advertisements
Similar presentations
Garfield AP Computer Science
Advertisements

Equality Join R X R.A=S.B S : : Relation R M PagesN Pages Relation S Pr records per page Ps records per page.
Algoritmi per IR Prologo. References Managing gigabytes A. Moffat, T. Bell e I. Witten, Kaufmann Publisher A bunch of scientific papers available.
1 Divide & Conquer Algorithms. 2 Recursion Review A function that calls itself either directly or indirectly through another function Recursive solutions.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
External Sorting CS634 Lecture 10, Mar 5, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
MergeSort (Example) - 1. MergeSort (Example) - 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
1  Simple Nested Loops Join:  Block Nested Loops Join  Index Nested Loops Join  Sort Merge Join  Hash Join  Hybrid Hash Join Evaluation of Relational.
I/O-Algorithms Lars Arge Spring 2009 January 27, 2009.
I/O-Algorithms Lars Arge Spring 2007 January 30, 2007.
External Sorting R & G Chapter 13 One of the advantages of being
FALL 2006CENG 351 Data Management and File Structures1 External Sorting.
I/O-Algorithms Lars Arge Spring 2006 February 2, 2006.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
CS 4432lecture #31 CS4432: Database Systems II Lecture #3 Professor Elke A. Rundensteiner.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
IR Paolo Ferragina Dipartimento di Informatica Università di Pisa.
External Sorting Access to secondary storage is orders of magnitude slower than memory access. Minimize access to secondary storage (tape or disk).
Using Secondary Storage Effectively In most studies of algorithms, one assumes the "RAM model“: –The data is in main memory, –Access to any item of data.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
MergeSort Source: Gibbs & Tamassia. 2 MergeSort MergeSort is a divide and conquer method of sorting.
1 CSE 326: Data Structures: Sorting Lecture 17: Wednesday, Feb 19, 2003.
External Sorting Problem: Sorting data sets too large to fit into main memory. –Assume data are stored on disk drive. To sort, portions of the data must.
Algorithms for Information Retrieval Is algorithmic design a 5-mins thinking task ???
Advanced Algorithm Design and Analysis (Lecture 2) SW5 fall 2004 Simonas Šaltenis E1-215b
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
RELATIONAL JOIN Advanced Data Structures. Equality Joins With One Join Column External Sorting 2 SELECT * FROM Reserves R1, Sailors S1 WHERE R1.sid=S1.sid.
Implementing Natural Joins, R. Ramakrishnan and J. Gehrke with corrections by Christoph F. Eick 1 Implementing Natural Joins.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.
1 External Sorting. 2 Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing gpa order.
Review 1 Selection Sort Selection Sort Algorithm Time Complexity Best case Average case Worst case Examples.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
Lecture 28 CSE 331 Nov 9, Mini project report due WED.
1 Parallel Sorting Algorithm. 2 Bitonic Sequence A bitonic sequence is defined as a list with no more than one LOCAL MAXIMUM and no more than one LOCAL.
1 External-Memory Sorting External-memory algorithms When data do not fit in main-memory External-memory sorting Rough idea: sort peaces that fit in main-
1 B + -Trees: Search  If there are n search-key values in the file,  the path is no longer than  log  f/2  (n)  (worst case).
Lecture 1: Basic Operators in Large Data CS 6931 Database Seminar.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapters 13: 13.1—13.5.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
QuickSort. Yet another sorting algorithm! Usually faster than other algorithms on average, although worst-case is O(n 2 ) Divide-and-conquer: –Divide:
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
1 External-Memory Sorting External-memory algorithms When data do not fit in main-memory External-memory sorting Rough idea: sort peaces that fit in main-
CMPT 238 Data Structures More on Sorting: Merge Sort and Quicksort.
Introduction to Algorithms
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Modified from Stanford CS276 slides Lecture 4: Index Construction
MergeSort Source: Gibbs & Tamassia.
Index Construction: sorting
CS222P: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
CSE 2010: Algorithms and Data Structures
Lecture 31: The IO Model 2 Repacking
CS222: Principles of Data Management Lecture #10 External Sorting
Overview of Query Evaluation: JOINS
External Sorting.
CS222P: Principles of Data Management Lecture #10 External Sorting
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #09 External Sorting Instructor: Chen Li.
External Sorting Dina Said
Presentation transcript:

Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine Merge is linear in the #items to be merged

Few key observations Items = (short) strings = atomic... On english wikipedia, about 10 9 tokens to sort  (n log n) memory accesses (I/Os ??) [5ms] * n log 2 n ≈ 3 years In practice it is a “faster”, why?

Recursion log 2 N

Implicit Caching… log 2 N M N/M runs, each sorted in internal memory (no I/Os) 2 passes (one Read/one Write) = 2 * (N/B) I/Os — I/O-cost for binary merge-sort is ≈ 2 (N/B) log 2 (N/M) Log 2 (N/M) 2 passes (R/W)

B A key inefficiency B After few steps, every run is longer than B !!! B We are using only 3 pages But memory contains M/B pages ≈ 2 30 /2 15 = 2 15 B Output Buffer Disk 1, 2, 3 Output Run 4,...

Multi-way Merge-Sort Sort N items with main-memory M and disk-pages B: Pass 1: Produce (N/M) sorted runs. Pass i: merge X = M/B-1 runs  log X N/M passes Main memory buffers of B items Pg for run1 Pg for run X Out Pg Disk Pg for run 2...

How it works … M N/M runs, each sorted in internal memory = 2 (N/B) I/Os 2 passes (one Read/one Write) = 2 * (N/B) I/Os — I/O-cost for X-way merge is ≈ 2 (N/B) I/Os per level Log X (N/M) M X X

Cost of Multi-way Merge-Sort Number of passes = log X N/M  log M/B (N/M) Total I/O-cost is  ( (N/B) log M/B N/M ) I/Os Large fan-out (M/B) decreases #passes In practice M/B ≈ 10 5  #passes = 1  few mins Tuning depends on disk features Compression would decrease the cost of a pass!