Download presentation
Presentation is loading. Please wait.
1
Relational Algebra Friday, 11/14/2003
2
Outline Relational Algebra: 5.2, 5.3, 5.4 Reading assignment:
Extensible hash tables: , Linear hash tables: ,
3
Some History The Relational Data Model
Invented by Ted Codd, circa 1970 Main idea: Data = mathematical relations Queries = mathematical formulas (“abstract SQL”) Elegant, clean But practical ?
4
History Continued State of the art in databases in 1970s:
The COBOL/CODASYL approach Data = a file of records, with pointers No queries, only pointer navigation Very efficient But hard to write queries/programs
5
More History For the relational model to prove itself it needed to show how to implement mathematical “queries” efficiently. Three ideas: Intermediate language = relational algebra Efficient implementation of each operator Optimization First proof of concept: System R (1979)
6
Relational Algebra Operates on relations, i.e. sets Five operators:
Later: we discuss how to extend this to bags Five operators: Union: Difference: - Selection: s Projection: P Cartesian Product: Derived or auxiliary operators: Intersection, complement Joins (natural,equi-join, theta join, semi-join) Renaming: r
7
1. Union and 2. Difference R1 R2 Example: R1 – R2
ActiveEmployees RetiredEmployees R1 – R2 AllEmployees – RetiredEmployees
8
What about Intersection ?
It is a derived operator R1 R2 = R1 – (R1 – R2) Also expressed as a join (will see later) Example UnionizedEmployees RetiredEmployees
9
3. Selection Returns all tuples which satisfy a condition
Notation: sc(R) Examples sSalary > (Employee) sname = “Smithh” (Employee) The condition c can be =, <, , >, , <>
10
Find all employees with salary more than $40,000.
s Salary > (Employee)
11
4. Projection Eliminates columns, then removes duplicates
Notation: P A1,…,An (R) Example: project social-security number and names: P SSN, Name (Employee) Output schema: Answer(SSN, Name)
12
P SSN, Name (Employee)
13
5. Cartesian Product Each tuple in R1 with each tuple in R2
Notation: R1 R2 Example: Employee Dependents Very rare in practice; mainly used to express joins
15
Renaming Changes the schema, not the instance Notation: r B1,…,Bn (R)
Example: rLastName, SocSocNo (Employee) Output schema: Answer(LastName, SocSocNo)
16
LastName, SocSocNo (Employee)
Renaming Example Employee Name SSN John Tony LastName, SocSocNo (Employee) LastName SocSocNo John Tony
17
Natural Join Notation: R1 ⋈ R2 Meaning: R1 ⋈ R2 = PA(sC(R1 R2))
Where: The selection sC checks equality of all common attributes The projection eliminates the duplicate common attributes
18
Natural Join Example Employee Name SSN John Tony Dependents SSN Dname Emily Joe Employee Dependents = PName, SSN, Dname(s SSN=SSN2(Employee x rSSN2, Dname(Dependents)) Name SSN Dname John Emily Tony Joe
19
Natural Join R= S= R ⋈ S= A B X Y Z V B C Z U V W A B C X Z U V Y W
20
Natural Join Given the schemas R(A, B, C, D), S(A, C, E), what is the schema of R ⋈ S ? Given R(A, B, C), S(D, E), what is R ⋈ S ? Given R(A, B), S(A, B), what is R ⋈ S ?
21
Theta Join A join that involves a predicate R1 ⋈ q R2 = s q (R1 R2)
Here q can be any condition: =, <, , , >
22
Eq-join A theta join where q is an equality
R1 ⋈A=B R2 = s A=B (R1 R2) Example: Employee ⋈SSN=SSN Dependents Most useful join in practice
23
Semijoin R ⋉ S = P A1,…,An (R ⋈ S)
Where A1, …, An are the attributes in R Example: Employee ⋉ Dependents
24
Semijoins in Distributed Databases
Semijoins are used in distributed databases Dependents Employee SSN Dname Age . . . SSN Name . . . network Employee ⋈ssn=ssn (s age>71 (Dependents)) T = P SSN s age>71 (Dependents) R = Employee ⋉ T Answer = R ⋈ Dependents
25
Complex RA Expressions
P name buyer-ssn=ssn pid=pid seller-ssn=ssn P ssn P pid sname=fred sname=gizmo Person Purchase Person Product
26
Operations on Bags A bag = a set with repeated elements
Relational Engines work on bags, not sets ! All operations need to be defined carefully on bags {a,b,b,c}{a,b,b,b,e,f,f}={a,a,b,b,b,b,b,c,e,f,f} {a,b,b,b,c,c} – {b,c,c,c,d} = {a,b,b,d} sC(R): preserve the number of occurrences PA(R): no duplicate elimination Cartesian product, join: no duplicate elimination
27
Logical Operators in the Bag Algebra
Union, intersection, difference Selection s Projection P Join ⋈ Duplicate elimination d Grouping g Sorting t Relational Algebra (on bags)
28
Example SELECT city, count(*) FROM sales P city, c GROUP BY city
HAVING sum(price) > 100 P city, c s p > 100 T(city,p,c) g city, sum(price)→p, count(*) → c sales
29
Physical Operators Query Plan: logical tree
SELECT S.buyer FROM Purchase P, Person Q WHERE P.buyer=Q.name AND Q.city=‘seattle’ AND Q.phone > ‘ ’ Purchase Person Buyer=name City=‘seattle’ phone>’ ’ buyer (Simple Nested Loops) (Table scan) (Index scan) Query Plan: logical tree implementation choice at every node scheduling of operations. Some operators are from relational algebra, and others (e.g., scan) are not.
30
Architecture of a Database Engine
SQL query Parse Query Logical plan Select Logical Plan Query optimization Select Physical Plan Physical plan Query Execution
31
Question in Class Logical operator:
Product(pname, cname) ⋈ Company(cname, city) Propose three physical operators for the join, assuming the tables are in main memory:
32
Question in Class Product(pname, cname) ⋈ Company(cname, city)
products 1000 companies How much time do the following physical operators take if the data is in main memory ? Nested loop join time = Sort and merge = merge-join time = Hash join time =
33
Cost Parameters In database systems the data is on disks, not in main memory The cost of an operation = total number of I/Os Cost parameters: B(R) = number of blocks for relation R T(R) = number of tuples in relation R V(R, a) = number of distinct values of attribute a
34
Cost Parameters Clustered table R: Unclustered table R:
Blocks consists only of records from this table B(R) T(R) / blockSize Unclustered table R: Its records are placed on blocks with other tables When R is unclustered: B(R) T(R) When a is a key, V(R,a) = T(R) When a is not a key, V(R,a)
35
Cost Cost of an operation = number of disk I/Os needed to:
read the operands compute the result Cost of writing the result to disk is not included on the following slides Question: the cost of sorting a table with B blocks ? Answer:
36
Scanning Tables The table is clustered: The table is unclustered
Table-scan: if we know where the blocks are Index scan: if we have a sparse index to find the blocks The table is unclustered May need one read for each record
37
Sorting While Scanning
Sometimes it is useful to have the output sorted Three ways to scan it sorted: If there is a primary or secondary index on it, use it during scan If it fits in memory, sort there If not, use multi-way merge sort
38
Cost of the Scan Operator
Clustered relation: Table scan: Unsorted: B(R) Sorted: 3B(R) Index scan Unsorted: B(R) Sorted: B(R) or 3B(R) Unclustered relation Unsorted: T(R) Sorted: T(R) + 2B(R)
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.