ICS 624 Spring 2011 Overview of DB & IR Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/12/20111Lipyeow.

Slides:



Advertisements
Similar presentations
Relational Algebra Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
Advertisements

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
INFS614, Fall 08 1 Relational Algebra Lecture 4. INFS614, Fall 08 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of.
CS 166: Database Management Systems
Database Management Systems 1 Raghu Ramakrishnan SQL: Queries, Programming, Triggers Chpt 5.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 SQL: Queries, Programming, Triggers Chapter 5.
1 Relational Algebra & Calculus. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
SQL Review.
SPRING 2004CENG 3521 Query Evaluation Chapters 12, 14.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
FALL 2004CENG 351 File Structures and Data Managemnet1 Relational Algebra.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
1 SQL (Simple Query Language). 2 Query Components A query can contain the following clauses –select –from –where –group by –having –order by Only select.
FALL 2004CENG 351 File Structures and Data Management1 SQL: Structured Query Language Chapter 5.
1 Relational Algebra. 2 Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational model supports.
CS 4432query processing1 CS4432: Database Systems II.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 SQL: Queries, Constraints, Triggers Chapter 5.
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
1 Relational Algebra and Calculus Yanlei Diao UMass Amherst Feb 1, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Rutgers University Relational Algebra 198:541 Rutgers University.
Relational Algebra Chapter 4 - part I. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.  Relational.
CSCD343- Introduction to databases- A. Vaisman1 Relational Algebra.
Relational Algebra.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Introduction to SQL Basics; Christoph F. Eick & R. Ramakrishnan and J. Gehrke 1 Introduction to SQL Basics --- SQL in 45 Minutes Chapter 5.
1 Relational Algebra and Calculus Chapter 4. 2 Relational Query Languages  Query languages: Allow manipulation and retrieval of data from a database.
CSC 411/511: DBMS Design Dr. Nan WangCSC411_L6_SQL(1) 1 SQL: Queries, Constraints, Triggers Chapter 5 – Part 1.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
SQL: Queries, Programming, Triggers. Example Instances We will use these instances of the Sailors and Reserves relations in our examples. If the key for.
ICS 321 Fall 2009 SQL: Queries, Constraints, Triggers Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 9/8/20091Lipyeow.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Relational Algebra.
Relational Algebra  Souhad M. Daraghma. Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational.
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
1 Relational Algebra & Calculus Chapter 4, Part A (Relational Algebra)
1 Relational Algebra and Calculas Chapter 4, Part A.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
Database Management Systems 1 Raghu Ramakrishnan Relational Algebra Chpt 4 Xin Zhang.
Relational Algebra.
ICS 321 Fall 2011 The Relational Model of Data (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 8/29/20111Lipyeow.
1 Database Systems ( 資料庫系統 ) October 24, 2005 Lecture #5.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Database Management Systems 1 Raghu Ramakrishnan Relational Algebra Chpt 4 Xin Zhang.
CSCD34-Data Management Systems - A. Vaisman1 Relational Algebra.
ICS 321 Spring 2011 The Database Language SQL (iii) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/14/20111Lipyeow.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
CMPT 258 Database Systems Relational Algebra (Chapter 4)
CMPT 258 Database Systems SQL Queries (Chapter 5).
Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,
1 SQL: Structured Query Language (‘Sequel’) Chapter 5.
SQL: The Query Language Part 1 R &G - Chapter 5 The important thing is not to stop questioning. Albert Einstein.
1 SQL: The Query Language. 2 Example Instances R1 S1 S2 v We will use these instances of the Sailors and Reserves relations in our examples. v If the.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
1 CS122A: Introduction to Data Management Lecture #7 Relational Algebra I Instructor: Chen Li.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Basic SQL Queries.
1 SQL: The Query Language. 2 Example Instances R1 S1 S2 v We will use these instances of the Sailors and Reserves relations in our examples.
COP Introduction to Database Structures
Relational Algebra Chapter 4, Part A
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Relational Algebra.
Relational Operations
CS 405G: Introduction to Database Systems
Relational Algebra Chapter 4, Sections 4.1 – 4.2
Evaluation of Relational Operations: Other Techniques
CENG 351 File Structures and Data Managemnet
Presentation transcript:

ICS 624 Spring 2011 Overview of DB & IR Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 1/12/20111Lipyeow Lim -- University of Hawaii at Manoa

Example Relations Sailors( sid: integer, sname: string, rating: integer, age: real) Boats( bid: integer, bname: string, color: string) Reserves( sid: integer, bid: string, day: date) sidbidday /10/ /12/96 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa2 sidsnameratingage 22Dustin Lubber Rusty R1 S1 bidbnamecolor 101InterlakeBlue 102InterlakeRed 103Clippergreen 104MarineRed B1

Basic SQL Query relation-list A list of relation names (possibly with a range-variable after each name). target-list A list of attributes of relations in relation-list qualification Comparisons (Attr op const or Attr1 op Attr2, where op is one of, ≤, ≥, =, ≠) combined using AND, OR and NOT. DISTINCT is an optional keyword indicating that the answer should not contain duplicates. Default is that duplicates are not eliminated! 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa3 SELECT [ DISTINCT ] target-list FROM relation-list WHERE qualification

Example Q1 Range variables really needed only if the same relation appears twice in the FROM clause. Good style to always use range variables 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa4 SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND bid=103 SELECT sname FROM Sailors, Reserves WHERE Sailors.sid=Reserves.sid AND bid=103 Without range variables

Conceptual Evaluation Strategy Semantics of an SQL query defined in terms of the following conceptual evaluation strategy: 1. Compute the cross-product of relation-list. 2. Discard resulting tuples if they fail qualifications. 3. Delete attributes that are not in target-list. 4. If DISTINCT is specified, eliminate duplicate rows. This strategy is probably the least efficient way to compute a query! An optimizer will find more efficient strategies to compute the same answers. 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa5

Example Q1: conceptual evaluation Conceptual Evaluation Steps: 1. Compute cross-product 2. Discard disqualified tuples 3. Delete unwanted attributes 4. If DISTINCT is specified, eliminate duplicate rows. 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa6 S.sidsnameratingageR.sidbidday 22Dustin /10/96 22Dustin /12/96 31Lubber /10/96 31Lubber /12/96 58Rusty /10/96 58Rusty /12/96 SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND bid=103 S.sidsnameratingageR.sidbidday 58Rusty /12/96 sname Rusty

Relational Algebra Basic operations: – Selection ( σ ) Selects a subset of rows from relation. – Projection ( π ) Deletes unwanted columns from relation. – Cross-product ( × ) Allows us to combine two relations. – Set-difference ( − ) Tuples in reln. 1, but not in reln. 2. – Union ( U ) Tuples in reln. 1 and in reln. 2. Additional operations: – Intersection, join, division, renaming: Not essential, but (very!) useful. Since each operation returns a relation, operations can be composed! (Algebra is “closed”.) 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa7

Projection Deletes attributes that are not in projection list. Schema of result contains exactly the fields in the projection list, with the same names that they had in the (only) input relation. Projection operator has to eliminate duplicates! (Why??) Note: real systems typically don’t do duplicate elimination unless the user explicitly asks for it. (Why not?) 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa8 π sname, rating (S2) snamerating Yuppy9 Lubber8 Guppy5 Rusty10 π age (S2) age

Selection Selects rows that satisfy selection condition. No duplicates in result! (Why?) Schema of result identical to schema of (only) input relation. Result relation can be the input for another relational algebra operation! (Operator composition.) 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa9 sidsnameratingage 28Yuppy Lubber Guppy Rusty sidsnameratingage 28Yuppy Lubber Guppy Rusty σ rating > 8 (S2) π sname, rating ( σ rating>8 (S2))

Union, Intersection, Set-Difference All of these operations take two input relations, which must be union-compatible: – Same number of fields. – `Corresponding’ fields have the same type. What is the schema of result? 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa10 sidsnameratingage 22Dustin Lubber Rusty S1S2 sidsnameratingage 28Yuppy Lubber Guppy Rusty S1 U S2 sidsnameratingage 22Dustin Yuppy Lubber Guppy Rusty1035.0

Intersection & Set-Difference 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa11 sidsnameratingage 22Dustin Lubber Rusty S1S2 sidsnameratingage 28Yuppy Lubber Guppy Rusty S1 − S2 sidsnameratingage 22Dustin745.0 S1 ∩ S2 sidsnameratingage 31Lubber Rusty1035.0

Cross-Product Consider the cross product of S1 with R1 Each row of S1 is paired with each row of R1. Result schema has one field per field of S1 and R1, with field names `inherited’ if possible. – Conflict: Both S1 and R1 have a field called sid. – Rename to sid1 and sid2 sidsnameratingagesidbidday 22Dustin /10/96 22Dustin /12/96 31Lubber /10/96 31Lubber /12/96 58Rusty /10/96 58Rusty /12/96 S1 × R1 sidbidday /10/ /12/96 sidsnameratingage 22Dustin Lubber Rusty R1 S1 1/12/201112Lipyeow Lim -- University of Hawaii at Manoa

Joins Condition Join: Result schema same as that of cross-product. Fewer tuples than cross-product, might be able to compute more efficiently Sometimes called a theta-join. 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa13 sidsnameratingagesidbidday 22Dustin /12/96 31Lubber /12/96

Equi-Joins & Natural Joins Equi-join: A special case of condition join where the condition c contains only equalities. – Result schema similar to cross-product, but only one copy of fields for which equality is specified. Natural Join: Equi-join on all common fields. 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa14 sidsnameratingagebidday 22Dustin /10/96 58Rusty /12/96

1/12/2011Lipyeow Lim -- University of Hawaii at Manoa15 Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Query SELECT * FROM Reserves WHERE sid=101  Sid=101 Reserves SCAN (sid=101) Reserves IDXSCAN (sid=101) Reserves Index(sid) fetch Pick B AB Evaluate Plan A Optimizer

Parse Query Input : SQL – Eg. SELECT-FROM-WHERE, CREATE TABLE, DROP TABLE statements Output: Some data structure to represent the “query” – Relational algebra ? Also checks syntax, resolves aliases, binds names in SQL to objects in the catalog How ? 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa16 Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Query

Enumerate Plans Input : a data structure representing the “query” Output: a collection of equivalent query evaluation plans Query Execution Plan (QEP): tree of database operators. – high-level: RA operators are used – low-level: RA operators with particular implementation algorithm. Plan enumeration: find equivalent plans – Different QEPs that return the same results – Query rewriting : transformation of one QEP to another equivalent QEP. 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa17 Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Query

Estimate Cost Input : a collection of equivalent query evaluation plans Output: a cost estimate for each QEP in the collection Cost estimation: a mapping of a QEP to a cost – Cost Model: a model of what counts in the cost estimate. Eg. Disk accesses, CPU cost … Statistics about the data and the hardware are used. 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa18 Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Query

Choose Best Plan Input : a collection of equivalent query evaluation plans and their cost estimate Output: best QEP in the collection The steps: enumerate plans, estimate cost, choose best plan collectively called the: Query Optimizer: – Explores the space of equivalent plan for a query – Chooses the best plan according to a cost model 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa19 Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Query

Evaluate Query Plan Input : a QEP (hopefully the best) Output: Query results Often includes a “code generation” step to generate a lower level QEP in executable “code”. Query evaluation engine is a “virtual machine” that executes some code representing low level QEP. 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa20 Parse Query Enumerate Plans Estimate Cost Choose Best Plan Evaluate Query Plan Result Query

Query Execution Plans (QEPs) A tree of database operators: each operator is a RA operator with specific implementation Selection  : Index Scan or Table Scan Projection π: – Without DISTINCT : Table Scan – With DISTINCT : requires sorting or index scan Join : – Nested loop joins (naïve) – Index nested loop joins – Sort merge joins Sort : – In-memory sort – External sort 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa21

QEP Examples 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa22 SELECT S.sname FROM Reserves R, Sailors S WHERE R.sid=S.sid AND R.bid=100 AND S.rating>5  S.rating>5 AND R.bid=100 Reserves Sailors R.sid=S.sid π S.sname Nested Loop Join On the fly (SCAN)  S.rating>5 AND R.bid=100 Reserves Sailors R.sid=S.sid π S.sname  S.rating>5 Reserves Sailors R.sid=S.sid π S.sname  R.bid=100  S.rating>5 Reserves Sailors R.sid=S.sid π S.sname Nested Loop Join On the fly  R.bid=100 (SCAN) Temp T1

Access Paths An access path is a method of retrieving tuples. Eg. Given a query with a selection condition: – File or table scan – Index scan Index matching problem: given a selection condition, which indexes can be used for the selection, i.e., matches the selection ? – Selection condition normalized to conjunctive normal form (CNF), where each term is a conjunct – Eg. (day<8/9/94 AND rname=‘Paul’) OR bid=5 OR sid=3 – CNF: (day<8/9/94 OR bid=5 OR sid=3 ) AND (rname=‘Paul’ OR bid=5 OR sid=3) 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa23  S.rating>5 ReservesSailors R.sid=S.sid π S.sname Nested Loop Join On the fly  R.bid=100 (SCAN) Temp T1 Index(R.bid)  R.bid=100 (IDXSCAN) Fetch Reserves

Index Matching A tree index matches a selection condition if the selection condition is a prefix of the index search key. A hash index matches a selection condition if the selection condition has a term attribute=value for every attribute in the index search key 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa24 I1: Tree Index (a,b,c) I2: Tree Index (b,c,d) I3: Hash Index (a,b,c) Q1:  a=5 AND b=3 Q2:  a=5 AND b>6 Q3:  b=3 Q4:  a=5 AND b=3 AND c=5 Q5:  a>5 AND b=3 AND c=5

Unstructured Text Data Field of “Information Retrieval” Data Model – Collection of documents – Each document is a bag of words (aka terms) Query Model – Keyword + Boolean Combinations – Eg. DBMS and SQL and tutorial Details: – Not all words are equal. “Stop words” (eg. “the”, “a”, “his”...) are ignored. – Stemming : convert words to their basic form. Eg. “Surfing”, “surfed” becomes “surf” 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa25

Inverted Indexes Recall: an index is a mapping of search key to data entries – What is the search key ? – What is the data entry ? Inverted Index: – For each term store a list of postings – A posting consists of pairs 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa26 DBMSdoc doc02538doc0113 SQLdoc06112doc0949doc2012 triggerdoc011215doc091421doc lexicon Posting lists What is the data in an inverted index sorted on ?

Lookups using Inverted Indexes Given a single keyword query “k” (eg. SQL) – Find k in the lexicon – Retrieve the posting list for k – Scan posting list for document IDs [and positions] What if the query is “k1 and k2” ? – Retrieve document IDs for k1 and k2 – Perform intersection 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa27 DBMSdoc doc02538doc0113 SQLdoc06112doc0949doc2012 triggerdoc011215doc091421doc lexicon Posting lists

Too Many Matching Documents Rank the results by “relevance”! Vector-Space Model – Documents are vectors in hi- dimensional space – Each dimension in the vector represents a term – Queries are represented as vectors similarly – Vector distance (dot product) between query vector and document vector gives ranking criteria – Weights can be used to tweak relevance PageRank (later) 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa28 Star Diet Doc about astronomy Doc about movie stars Doc about behavior

How good are the retrieved docs? Precision : Fraction of retrieved docs that are relevant to user’s information need Recall : Fraction of relevant docs in collection that are retrieved 1/12/2011Lipyeow Lim -- University of Hawaii at Manoa29