CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004.

Slides:



Advertisements
Similar presentations
Lecture 07: Relational Algebra
Advertisements

1 Relational Algebra. Motivation Write a Java program Translate it into a program in assembly language Execute the assembly language program As a rough.
CS 540 Database Management Systems
Lecture 13: Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data.
1 Relational Algebra Lecture #9. 2 Querying the Database Goal: specify what we want from our database Find all the employees who earn more than $50,000.
Query Optimization Goal: Declarative SQL query
Relational Algebra Maybe -- SQL. Confused by Normal Forms ? 3NF BCNF 4NF If a database doesn’t violate 4NF (BCNF) then it doesn’t violate BCNF (3NF) !
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 11 External Sorting.
Lecture 24: Query Execution Monday, November 20, 2000.
1 Lecture 22: Query Execution Wednesday, March 2, 2005.
ACS-4902 Ron McFadyen Chapter 15 Algorithms for Query Processing and Optimization.
1 Lecture 07: Relational Algebra. 2 Outline Relational Algebra (Section 6.1)
External Sorting 198:541. Why Sort?  A classic problem in computer science!  Data requested in sorted order e.g., find students in increasing gpa order.
Relational Schema Design (end) Relational Algebra Finally, querying the database!
One More Normal Form Consider the dependencies: Product Company Company, State Product Is it in BCNF?
1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.
CSE 444: Lecture 24 Query Execution Monday, March 7, 2005.
Lecture 3: Relational Algebra and SQL Tuesday, March 25, 2008.
CSCE Database Systems Chapter 15: Query Execution 1.
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
1 Introduction to Database Systems CSE 444 Lecture 20: Query Execution: Relational Algebra May 21, 2008.
Sorting.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
Transactions, Relational Algebra, XML February 11 th, 2004.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Relational Algebra 2. Relational Algebra Formalism for creating new relations from existing ones Its place in the big picture: Declartive query language.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Lecture 24 Query Execution Monday, November 28, 2005.
CS4432: Database Systems II Query Processing- Part 2.
Lecture 13: Relational Decomposition and Relational Algebra February 5 th, 2003.
The Relational Model Lecture 16. Today’s Lecture 1.The Relational Model & Relational Algebra 2.Relational Algebra Pt. II [Optional: may skip] 2 Lecture.
1 Lecture 10: Database Design and Relational Algebra Monday, October 20, 2003.
Lecture 17: Query Execution Tuesday, February 28, 2001.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
CS 440 Database Management Systems Lecture 5: Query Processing 1.
Introduction to Database Systems1 External Sorting Query Processing: Topic 0.
Hash Tables and Query Execution March 1st, Hash Tables Secondary storage hash tables are much like main memory ones Recall basics: –There are n.
Query Processing – Implementing Set Operations and Joins Chap. 19.
CS 540 Database Management Systems
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.
1 CSE544: Lecture 7 XQuery, Relational Algebra Monday, 4/22/02.
Tallahassee, Florida, 2016 COP5725 Advanced Database Systems Query Processing Spring 2016.
External Sorting. Why Sort? A classic problem in computer science! Data requested in sorted order –e.g., find students in increasing gpa order Sorting.
1 Lecture 23: Query Execution Monday, November 26, 2001.
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
Relational Algebra.
CS 440 Database Management Systems
Lecture 8: Relational Algebra
Database Management Systems (CS 564)
Lecture 33: The Relational Model 2
Query Execution Presented by Jiten Oswal CS 257 Chapter 15
Issues in Indexing Multi-dimensional indexing:
Relational Algebra Friday, 11/14/2003.
Lecture 13: Query Execution
CS639: Data Management for Data Science
Lecture 23: Query Execution
CSE 444: Lecture 25 Query Execution
Lecture 22: Query Execution
Lecture 3: Relational Algebra and SQL
Lecture 11: B+ Trees and Query Execution
CSE 544: Query Execution Wednesday, 5/12/2004.
Lecture 11: Functional Dependencies
Lecture 20: Query Execution
Lecture 20: Representing Data Elements
Presentation transcript:

CSE 544: Relational Operators, Sorting Wednesday, 5/12/2004

Relational Algebra Operates on relations, i.e. sets –Later: we discuss how to extend this to bags Five operators: –Union:  –Difference: - –Selection:  –Projection:  –Cartesian Product:  Derived or auxiliary operators: –Intersection, complement –Joins (natural,equi-join, theta join, semi-join) –Renaming: 

1. Union and 2. Difference R 1  R 2 Example: ActiveEmployees  RetiredEmployees R 1 – R 2 Example: AllEmployees – RetiredEmployees

What about Intersection ? It is a derived operator R 1  R 2 = R 1 – (R 1 – R 2 ) Also expressed as a join (will see later) Example UnionizedEmployees  RetiredEmployees

3. Selection Returns all tuples which satisfy a condition Notation:  c (R) Examples  Salary > (Employee)  name = “Smithh” (Employee) The condition c can be =,, , <>

4. Projection Eliminates columns, then removes duplicates Notation:  A1,…,An (R) Example: project social-security number and names:  SSN, Name (Employee) Output schema: Answer(SSN, Name)

5. Cartesian Product Each tuple in R 1 with each tuple in R 2 Notation: R 1  R 2 Example: Employee  Dependents Very rare in practice; mainly used to express joins

Renaming Changes the schema, not the instance Notation:  B1,…,Bn (R) Example:  LastName, SocSocNo (Employee) Output schema: Answer(LastName, SocSocNo)

Renaming Example Employee NameSSN John Tony LastNameSocSocNo John Tony  LastName, SocSocNo (Employee)

Natural Join Notation: R 1 |  | R 2 Meaning: R 1 |  | R 2 =  A (  C (R 1  R 2 )) Where: –The selection  C checks equality of all common attributes –The projection eliminates the duplicate common attributes

Employee Dependents =  Name, SSN, Dname (  SSN=SSN2 (Employee x  SSN2, Dname (Dependents)) Natural Join Example Employee NameSSN John Tony Dependents SSNDname Emily Joe NameSSNDname John Emily Tony Joe

Natural Join R= S= R |  | S= AB XY XZ YZ ZV BC ZU VW ZV ABC XZU XZV YZU YZV ZVW

Natural Join Given the schemas R(A, B, C, D), S(A, C, E), what is the schema of R |  | S ? Given R(A, B, C), S(D, E), what is R |  | S ? Given R(A, B), S(A, B), what is R |  | S ?

Theta Join A join that involves a predicate R1 |  |  R2 =   (R1  R2) Here  can be any condition: =, 

Eq-join A theta join where  is an equality R 1 |  | A=B R 2 =  A=B (R 1  R 2 ) Example: Employee |  | SSN=SSN Dependents Most useful join in practice

Semijoin R |  S =  A1,…,An (R |  | S) Where A 1, …, A n are the attributes in R Example: Employee |  Dependents

Semijoins in Distributed Databases Semijoins are used in distributed databases SSNName... SSNDnameAge... Employee Dependents network Employee |  | ssn=ssn (  age>71 (Dependents)) T =  SSN  age>71 (Dependents) R = Employee |  T Answer = R |  | Dependents

Complex RA Expressions Person Purchase Person Product  name=fred  name=gizmo  pid  ssn seller-ssn=ssnpid=pidbuyer-ssn=ssn  name

Operations on Bags A bag = a set with repeated elements Relational Engines work on bags, not sets ! All operations need to be defined carefully on bags {a,b,b,c}  {a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f} {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b,d}  C (R): preserve the number of occurrences  A (R): no duplicate elimination Cartesian product, join: no duplicate elimination

Logical Operators in the Bag Algebra Union, intersection, difference Selection  Projection  Join ⋈ Duplicate elimination  Grouping  Sorting  Relational Algebra (on bags)

Example SELECT city, count(*) FROM sales GROUP BY city HAVING sum(price) > 100 SELECT city, count(*) FROM sales GROUP BY city HAVING sum(price) > 100 sales  city, sum(price)→p, count(*) → c  p > 100  city, c T(city,p,c)

Physical Operators SELECT S.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘ ’ SELECT S.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘ ’ Query Plan: logical tree implementation choice at every node scheduling of operations. Purchase Person Buyer=name City=‘seattle’ phone>’ ’ buyer (Simple Nested Loops)  (Table scan)(Index scan) Some operators are from relational algebra, and others (e.g., scan) are not.

Architecture of a Database Engine Parse Query Select Logical Plan Select Physical Plan Query Execution SQL query Query optimization Logical plan Physical plan

Cost Parameters In database systems the data is on disks, not in main memory The cost of an operation = total number of I/Os Cost parameters: B(R) = number of blocks for relation R T(R) = number of tuples in relation R V(R, a) = number of distinct values of attribute a

Cost Parameters Clustered table R: –Blocks consists only of records from this table –B(R)  T(R) / blockSize Unclustered table R: –Its records are placed on blocks with other tables –When R is unclustered: B(R)  T(R) When a is a key, V(R,a) = T(R) When a is not a key, V(R,a)

Cost Cost of an operation = number of disk I/Os needed to: –read the operands –compute the result Cost of writing the result to disk is not included on the following slides Question: the cost of sorting a table with B blocks ? Answer:

Scanning Tables The table is clustered: –Table-scan: if we know where the blocks are –Index scan: if we have a sparse index to find the blocks The table is unclustered –May need one read for each record

Sorting While Scanning Sometimes it is useful to have the output sorted Three ways to scan it sorted: –If there is a primary or secondary index on it, use it during scan –If it fits in memory, sort there –If not, use multi-way merge sort

Cost of the Scan Operator Clustered relation: –Table scan: Unsorted: B(R) Sorted: 3B(R) –Index scan Unsorted: B(R) Sorted: B(R) or 3B(R) Unclustered relation –Unsorted: T(R) –Sorted: T(R) + 2B(R)

Sorting Problem: sort 1 GB of data with 1MB of RAM. Where we need this: –Data requested in sorted order (ORDER BY) –Needed for grouping operations –First step in sort-merge join algorithm –Duplicate removal –Bulk loading of B+-tree indexes.

2-Way Merge-sort: Requires 3 Buffers in RAM Pass 1: Read 1MB, sort it, write it. Pass 2, 3, …, etc.: merge two runs, write them Main memory buffers INPUT 1 INPUT 2 OUTPUT Disk Runs of length L Runs of length 2L

Two-Way External Merge Sort Assume block size is B = 4Kb Step 1  runs of length L = 1MB Step 2  runs of length L = 2MB Step 3  runs of length L = 4MB Step 10  runs of length L = 1GB (why ?) Need 10 iterations over the disk data to sort 1GB

Can We Do Better ? Hint: We have 1MB of main memory, but only used 12KB

Cost Model for Our Analysis B: Block size ( = 4KB) M: Size of main memory ( = 1MB) For later use (won’t need now): N: Number of records in the file R: Size of one record

External Merge-Sort Phase one: load M bytes in memory, sort –Result: runs of length M bytes ( 1MB ) M bytes of main memory Disk... M/R records

Phase Two Merge M/B – 1 runs into a new run (250 runs ) Result: runs of length M (M/B – 1) bytes (250MB) M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output

Phase Three Merge M/B – 1 runs into a new run Result: runs of length M (M/B – 1) 2 records (250*250MB = 62.5GB – larger than the file) M bytes of main memory Disk... Input M/B Input 1 Input 2.. Output Need 3 iterations over the disk data to sort 1GB

Cost of External Merge Sort Number of passes: How much data can we sort with 10MB RAM? –1 pass  10MB data –2 passes  25GB data (M/B = 2500) Can sort everything in 2 or 3 passes !

External Merge Sort The xsort tool in the XML toolkit sorts using this algorithm Can sort 1GB of XML data in about 8 minutes