Presentation is loading. Please wait.

Presentation is loading. Please wait.

Database Management Systems

Similar presentations


Presentation on theme: "Database Management Systems"— Presentation transcript:

1 Database Management Systems
Faculty of Computer Science Technion – Israel Institute of Technology Database Management Systems Course Lecture 3: Relational Algebra

2 Outline

3 The Relational Model A conceptual model for representing data, integrity constraints, and queries All based on the notion of a schema DBMS is responsible for translating specifications into the physical environment at hand Storage in files, caches, indexes Queries translated to query plans (high-level imperative programs) Query plans translated to low-level execution over stored data Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

4 Querying: Which Courses Avia Took?
ID name addr 1234 Avia Haifa 2345 Boris Nesher number name lecturer 363 DB Anna 319 PL Barak sID cNum grade 1234 363 95 319 82 2345 73 Assembly ... mov $1, %rax mov $1, %rdi         mov $message, %rsi       mov $13, %rdx               syscall mov $60, %rax xor %rdi, %rdi         QL SQL SELECT C.name FROM S,C,T WHERE S.name = ‘Avia’ AND S.ID = T.sID AND T.cNum = C.number Logic (RC) {⟨x⟩|∃y,n,z,l,g [S(y,n,'Avia')∧C(z,x,l)∧T(y,z,g)]} Python for s in S: for c in C: for t in T: if s.sName==‘Avia’ and s.ID==t.sID and t.cNum == c.number: print c.name Logic Programming (Datalog) Q(x)  S(y,n,‘Avia’),C(z,x,l),T(y,z,g) Algebra (RA) πC.name(σS.name=‘Avia’, number=cNum, ID=sID(S⨉C⨉T)))

5 The Relational Algebra (RA)
Mathematical query language Introduced by Edgar Codd [E. F. Codd: A Relational Model of Data for Large Shared Data Banks. Commun. ACM 13(6): (1970)] [E. F. Codd: Relational Completeness of Data Base Sublanguages. In: R. Rustin (ed.): Database Systems: 65-98, Prentice Hall and IBM Research Report RJ 987, San Jose, California (1972)] Since invention, developed and studied by Codd and many others Edgar F. Codd ( )

6 Names of students who study DB:
RA Example Names of students who study DB: name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies)))) Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76 name Alma Amir

7 Why RA? Understanding the relational algebra is a key understanding central concepts in databases: SQL, query evaluation, query optimization Tool for building theoretical foundations of various query languages (e.g., SQL) Tool for developing novel data/query models See, e.g., [Fagin, Kimelfeld, Reiss, Vansummeren: Document Spanners: A Formal Approach to Information Extraction. J. ACM 62(2): 12 (2015)]

8 RA vs Other QLs Some subtle (yet important) differences between RA and other languages Can tables have duplicate records? (RA vs. SQL) Are missing (NULL) values allowed? Is there any order among records? Is the answer dependent on the domain from which values are taken (not just the DB)? (RA vs. RC)

9 Relation Schema A relation schema is a finite sequence of distinct attribute names att with a mapping of each to a domain dom of legal values Notation: (att1:dom1,…,attk:domk) Example: (sid:int, name:string, year:int) sid name year

10 Tuples Let s be a relation schema (att1:dom1,…,attk:domk)
A tuple (over s) is a sequence (v1,…,vk) of values vi, where each vi is in domi That is, a tuple is an element of dom1×...×domk sid name year 861 Alma 2

11 Relations A relation R is a pair (s,r) s is a relation schema
Called the header of R r is a finite set r of tuples over s sid name year 861 Alma 2 753 Amir 1 955 Ahuva

12 Ignoring Domains In this lecture we ignore the attribute domains, since they play no special role (Well, almost; they make a difference for query equivalence, but we do not get there…) For example, we will write (sid, name, year) instead of (sid:int, name:string, year:int)

13 Notation Notation 1: Notation 2:
Let R be a relation with the header (att1,…,attk) Let t=(v1,…,vk) be a tuple of R We refer to vi by t.atti Notation 2: Let a1,…,am be attributes among att1,…,attk We denote by t[a1,…,am] the tuple (t.a1,…,t.am) t.sid = 861 t.name = Alma t[sid,name] = (861,Alma) t[name,sid,year] = (Alma,861,2) sid name year 861 Alma 2 753 Amir 1 955 Ahuva t

14 Databases A database schema is finite set of relation names, each mapped into a relation schema Example: Student(sid,name,year) , Course(cid,topic) , Studies(sid,cid) A database (or instance) over a schema consists of a relation for each relation schema Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

15 What is “Algebra”? An abstract algebra consists of: Each operator:
A class of elements A collection of operators Each operator: Has an arity d Has a domain of sequences (e1,…,ed) of elements Maps every sequence in its domain to an element e The definition of an operator allows for composition: o1(o2(x),o1(y,o4(x,z))) Examples: Ring of integers: (𝕫,{+,·}) Boolean algebra: ({true,false},{∧,∨,¬}) Relational algebra

16 The Relational Algebra
In the relational algebra (RA) the elements are relations Recall: a relation is a pair (s,r) RA has 6 primitive operators: Unary: projection, selection, renaming Binary: union, difference, Cartesian product Each of the six is essential (independent)—we cannot define it using the others We will see what exactly this means and how this can be proved We commonly allow many more useful operators that can be defined by the primitive ones For example, intersection via union and difference

17 Outline

18 6 Primitive (Basic) Operators
Projection () Selection () Renaming () Union (∪) Difference (\) Cartesian Product (×)

19 Task (end of this section)
Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 Phrase a query that finds the names of students who get private lessons (i.e., the student takes a course that no one else takes)

20 R = sid,name(R) = year(R) = Projection by Example sid name year 861
Alma 2 753 Amir 1 955 Ahuva R = sid name 861 Alma 753 Amir 955 Ahuva sid,name(R) = year 2 1 year(R) = Fewer tuples (why?)

21 Definition of Projection
Projection is a unary operator of the form A1,…,Ak where each Ai is an attribute name A projection is parameterized by attributes, so we actually have infinitely many different projection operators Legal input: relation R in with attributes A1,…,Ak (and possibly others) A1,…,Ak(R) is the relation S with: Header (A1,…,Ak) Tuple set {t[A1,…,Ak] | t ∊ R} Q: If R has 1000 tuples, how many tuples can A1,…,Ak(R) have?

22 year=1 ⋀ grade>84(R) =
Selection by Example student year course grade Alma 1 DB 80 PL 94 Ahuva 2 72 R = student year course grade Alma 1 DB 80 Ahuva 2 72 course=‘DB’(R) = year=1 ⋀ grade>84(R) = student year course grade Alma 1 PL 94

23 Definition of Selection
Selection is a unary operator of the form c where c is a logical condition (selection predicate) on attributes c consists of comparisons and logical connectors (∧,∨,¬) item1 = item2 price ≥ 500 ∧ price ≤ budget Legal input: A relation with all the attributes mentioned in the selection predicate The condition is applied to each tuple in the input, and each violating tuple is filtered out Formally, c(R) is the relation S with the header of R and the tuple set {t | t ∊ R and t ⊨ c} Q: If R has 1000 tuples, how many tuples can c(R) have?

24 Variants of Selection Various variants of RA may allow different languages for specifying selection predicates e.g., c2>a2+b2 ; name starts with ‘A’, etc. Common to all predicate formalisms: a predicate applies to a single tuple Cannot state cross-tuple conditions, e.g., “there is another tuple with the same name” “contains at least 100 tuples”

25 R = year/level(R) = Renaming by Example student year course grade
Alma 1 DB 80 PL 94 Ahuva 2 72 R = student level course grade Alma 1 DB 80 PL 94 Ahuva 2 72 year/level(R) =

26 Definition of Renaming
Renaming is a unary operator of the form A/B where A and B are attribute names Legal input: A relation with a header that contains A and does not contain B Renaming changes only the header—attribute A becomes B Formally, A/B(R) is the relation S with The header of R with A replaced by B The tuple set of R Q: If R has 1000 tuples, how many tuples can A/B(R) have?

27 Union and Difference by Example
student year Alma 1 Anna Ahuva 2 student year Alma 1 Amir 3 R = S = student year Alma 1 Anna Ahuva 2 Amir 3 student year Anna 1 Ahuva 2 R∪S = R \ S =

28 Definition of Union and Difference
Binary operators, interpreted as operations over the tuple sets Legal input: a pair of relations R and S with the exact same header We then say that R and S are union compatible Formally: R∪S is the relation with the header of R (and S) and the union of the tuple sets R\S is the relation with the header of R (and S) and the difference between the tuple sets Q: If each of R and S have 1000 tuples, how many tuples can be in R∪S? R\S?

29 Cartesian Product by Example
sid name year 861 Alma 2 753 Amir 1 955 Ahuva cid topic 23 PL 45 DB 76 OS R = S = sid name year cid topic 861 Alma 2 23 PL 753 Amir 1 955 Ahuva 45 DB 76 OS R×S =

30 Definition of Cartesian Product
Binary operator, similar to set product, but each output pair is combined into a single tuple Legal input: A pair of relations with disjoint sets of attributes So how to cross-product Mom(ssn) with Dad(ssn)? Formally, let R and S have the headers (A1,…,Ak) and (B1,…,Bm), respectively; then R×S is the relation T with: Header (A1,…,Ak,B1,…,Bm) Tuple set {r∘s | r ∊ R and s ∊ S} ∘ denotes concatenation Q: If each of R and S have 1000 tuples, how many tuples can be in R×S?

31 R = S = R×S = Shorthand Notation
For Cartesian product of named relations (e.g., R, S), we actually allow common attributes, and implicitly assume their renaming to name.attribute sid name year 861 Alma 2 753 Amir 1 955 Ahuva sid cid 861 23 753 45 R = S = R.sid name year S.sid cid 861 Alma 2 23 753 Amir 1 955 Ahuva 45 R×S =

32 Parentheses Convention
We have defined 3 unary operators and 3 binary operators It is acceptable to omit the parentheses from o(R) when o is unary Then, unary operators take precedence over binary ones Example: (course=‘DB’(Course)) ×(cid/cid1(Studies) ) becomes course=‘DB’Course × cid/cid1Studies

33 Names of students who study DB:
Composition Example Names of students who study DB: name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies)))) Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

34 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

35 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
861 23 45 753 955 76 Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

36 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
45 DB sid cid1 861 23 45 753 955 76 Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

37 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
45 DB 861 23 753 955 76 Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

38 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
45 DB 861 753 Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

39 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
45 861 753 Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

40 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
year 861 Alma 2 753 Amir 1 955 Ahuva cid sid 45 861 753 Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

41 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
year cid sid 861 Alma 2 45 753 Amir 1 955 Ahuva Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

42 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
year cid sid 861 Alma 2 45 753 Amir 1 Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

43 name(sid=sid1(sid/sid1Student × sid,cid(cid=cid1(topic=‘DB’Course × cid/cid1Studies))))
Alma Amir Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 955 76

44 (i.e., the student takes a course that no one else takes)
Task Student sid name year 861 Alma 2 753 Amir 1 955 Ahuva Course cid topic 23 PL 45 DB 76 OS Studies sid cid 861 23 45 753 Phrase a query that finds the names of students who get private lessons (i.e., the student takes a course that no one else takes)

45 Outline

46 Implied Operators We now discuss relational operators that are:
Not among the 6 basic operators Can be expressed in RA (implied) Very common in practice Enhancing the available operator set with the implied operators is a good idea! Easier to write queries Easier to understand/maintain queries Easier for DBMS to apply specialized optimizations

47 Outline

48 Joins Cartesian product is rarely standalone without selection, and is commonly followed by projection The combination π× is referred to generally as “join” There are several common cases that apply specific selections and projections, which we introduce here

49 Conditional Join Binary operator R ⨝c S where c is a condition over the header of R×S Shorthand notation for: c(R×S) Example: R ⨝a=b ⋀ c<d S

50 Theta Join and Equijoin
Theta join is a special case of conditional join ⨝c where c has the form AθB or Aθv where A and B are attributes, v a constant value, and θ a comparison operator Example: R ⨝c<d S Equijoin is the special case where c has the form A=B where A and B belong to the left and right operands, respectively Example: Course⨝name=courseStudies

51 S = T = S⨝sid=studT = Equijoin Example sid name year 861 Alma 2 753
Amir 1 955 Ahuva stud course 861 PL DB 762 OS 955 S = T = sid name year stud course 861 Alma 2 PL DB 955 Ahuva OS S⨝sid=studT =

52 πB1,...,Bm,...,C1...,ClA1=A'1,…,Ak=A'k (R×A1/A'1,…,Ak/A’kS)
Natural Join ⨝ Cartesian product, equality on all common attributes, projection on unique attributes Formally, R ⨝ S is equivalent to: πB1,...,Bm,...,C1...,ClA1=A'1,…,Ak=A'k (R×A1/A'1,…,Ak/A’kS) where: (B1,...,Bm) is the header of R A1,...,Ak are the attributes common to R and S (C1,...,Cl) is the header of S with A1,...,Ak removed Should we care about which new names are defined by renaming?

53 S = T = S⨝T = Natural Join Example sid name year 861 Alma 2 753 Amir 1
955 Ahuva sid course 861 PL DB 762 OS 955 S = T = sid name year course 861 Alma 2 PL DB 955 Ahuva OS S⨝T =

54 where (A1,...,Am) is the header of R
Semijoin Semijoin of R and S is the restriction of R to the tuples that can naturally join with S Formally: R⋉S is the operator equivalent to πA1,...,Am(R⨝S) where (A1,...,Am) is the header of R

55 S = T = Semijoin Example S⋉T = sid name year 861 Alma 2 753 Amir 1 955
Ahuva sid course 861 PL DB 762 OS 955 S = T = sid name year 861 Alma 2 955 Ahuva S⋉T =

56 Intersection The usual binary set-theoretic operator ∩
Legal input: a pair of relations that are union compatible (i.e., same header) Special case of natural join and semijoin If R and S have the same header, then R ⨝ S and R ⋉ S are equal to R ∩ S

57 Outline

58 Who took all core courses?
Studies CourseType sid student course 861 Alma DB PL 753 Amir AI 955 Ahuva course type DB core PL AI elective DC Who took all core courses?

59 Consider two relations R(X,Y) and S(Y)
Division Consider two relations R(X,Y) and S(Y) Here, X and Y are tuples of attributes R  S is the relation T(X) that contains all the Xs that occur with every Y in S

60 {t[X]∊πXR | t[X,Y]∊R for all s[Y]∊S}
Formal Definition Legal input: (R,S) such that R has all the attributes of S RS is the relation T with: The header of R, with all attributes of S removed Tuple set {t[X]∊πXR | t[X,Y]∊R for all s[Y]∊S} This is an abuse of notation, since the attributes in X need not necessarily come before those of Y

61 ? ? ? ?  =  = = = Questions (RxS)S (RxS)R sid student course 861
Alma DB PL 753 Amir AI 955 Ahuva course AI = course DB PL AI ? = Q: If R has 1000 tuples and S has 100 tuples, how many tuples can be in RS? ? (RxS)S = Q: If R has 1000 tuples and S has 1001 tuples, how many tuples can be in RS? ? (RxS)R =

62 Who took all core courses?
Studies CourseType sid student course 861 Alma DB PL 753 Amir AI 955 Ahuva course type DB core PL AI elective DC Who took all core courses? Studies  πcoursetype=‘core’CourseType

63 RS in Primitive RA πXR \ πX( (πXR × S) \ R ) R(X,Y) S(Y) RS
Each X of R w/ each Y of S (X,Y) s.t. X in R, Y in S, but (X,Y) not in R Xs in R where for some Y in S, (X,Y) is not in R RS

64 Examples of Inexpressible Queries
Some very useful queries cannot be expressed in RA! follower followed Amir Alma Ahuva Anna Aggregates: How many followers does Ahuva have? How many persons does one follow on average? Transitive closure: Is there a follower path from Anna to Amir? Is there a cycle? (How can one prove inexpressiveness?)

65 Outline

66 RA Expressions (Queries)
Let S be a relation schema Recall: S is a finite set of named relation schemas An RA expression (RA query) over S is an expression in RA, applied to the relation names of S For example: πsid(sid=stud(Student × sid/studStudies)) Student sid name year Course cid topic Studies sid cid

67 Query Result Let S be a database schema Let φ be an RA query over S
Let I be a database over S The result of evaluating φ over I, denoted φ(I), is the relation obtained by applying φ to the relations of I That is, every relation name is replaced with the corresponding relation in I

68 Equivalence of RA Expressions
Let S be a database schema, and let φ and ψ be two RA queries over S We say that φ and ψ are equivalent, denoted φ ≡ ψ, if: for every database I over S it holds that φ(I) = ψ(I)

69 Who Cares? Query optimization: we wish to allow DBMS to replace a query with an equivalent one that is more efficient to evaluate Expressiveness: do different sets of operators “give the same” class of expressible questions? Examples on R(A,B), S(A,B), T(A,B) A=‘a’ (R⨝S) ≡ (A=‘a’ R)⨝(A=‘a’ S) (selection push) πA(R∪S) ≡ πA(R)∪πA(S) (R⨝S)⨝T ≡ (T⨝S)⨝R Is it true that R.A/AπR.A(R×S) ≡ πAR ?

70 Q: How does containment relate to equivalence?
Let S be a database schema, and let φ and ψ be two RA queries over S We say that φ is contained in ψ, denoted φ⊆ψ, if for every instance I over S we have φ(I) ⊆ ψ(I) Q: How does containment relate to equivalence?

71 ? ? πsid(Student ⨝ Studies)) ≡ πsidStudent ∩ πsidStudies
name year Course cid topic Studies sid cid ? πsid(Student ⨝ Studies)) ≡ πsidStudent ∩ πsidStudies ? πsid(Student ⨝ Studies)) ⊆ πsidStudent ∩ πsidStudies Q: How do we prove containment? equivalence?

72 ? ? ? πsid(Student ∩ TA)) ≡ πsidStudent ∩ πsidTA
cid year TA sid cid year ? πsid(Student ∩ TA)) ≡ πsidStudent ∩ πsidTA ? πsid(Student ∩ TA)) ⊆ πsidStudent ∩ πsidTA ? πsid(Student ∩ TA)) ⊇ πsidStudent ∩ πsidTA Q: How do we prove non-containment? non-equivalence?

73 6 Primitive Operators Projection () Selection () Renaming ()
Union (∪) Difference (\) Cartesian Product (×) Q: Is this a "good" set of primitives? Could we drop an operator “without losing anything”?

74 Independence Let o be an RA operator, and let A be a set of RA operators We say that o is independent of A if o cannot be expressed in A; that is, no expression in A is equivalent to o

75 Independence among Primitives
π σ ρ ×∪- Theorem: Each of the six primitives is independent of the other five Proof: Separate argument for each of the six Arguments follow a common pattern (next slide) We will do one operator here (union)

76 Recipe for Proving Independence
Proving that operator o is independent: Fix a schema S and an instance I over S Find a property P over relations Prove that for every expression φ over S that does not use o, the relation φ(I) satisfies P Such proofs are typically by induction on the size of the expression, since operators compose Find an expression ψ such that ψ uses o and ψ(I) violates P

77 Independence of Union Fix a schema S and an instance I over S
S: R(A), S(A) I: {R(0), S(1)} Find a property P over relations #tuples < 2 Prove that for every expression φ that does not use o, the relation φ(I) satisfies P Induction base: R and S have #tuples<2 Inductive: If φ1(I) and φ2(I) have #tuples<2, then so do σc(φ1(I)), A(φ1(I)), ρA/B(φ1(I)), φ1(I)×φ2(I), φ1(I)\φ2(I) Find an expression ψ such that ψ uses o and ψ(I) violates P ψ=R∪S

78 Outline

79 Task: find Israelis who like albums with dog pictures
Person(ssn,country) Picture(pid,topic,album) Likes(ssn,album) Task: find Israelis who like albums with dog pictures Which of the equivalent expressions is more efficient to apply? πssntopic=‘dog’⋀country=‘Israel’ (Likes ⨝ (Person ⨝ Picture)) × πssntopic=‘dog’⋀country=‘Israel’ ((Likes ⨝ Person) ⨝ Picture)) πssn ((Likes ⨝ country=‘Israel’Person) ⨝ topic=‘dog’Picture)) πssn ((Likes ⨝ country=‘Israel’Person) ⨝ πalbumtopic=‘dog’Picture)) (πssncountry=‘Israel’Person) ⨝ (πssn (Likes ⨝ πalbumtopic=‘dog’Picture)))

80 Rules of Thumb for Optimization
Main computational challenges in RA: Large intermediate results Join is expensive Make intermediate results as small as possible before joining (while preserving equivalence) Apply selection and projection as early as possible (“push select/projection”) Reorder joins to minimize intermediate relations Some optimization decisions are “always beneficial” (e.g., push selection) while others require knowledge on the data (e.g., join order)

81 Pushing Projection πX (πZ1R1 ⨝ πZ2R2)
Projection reduces the length of each row, and can substantially reduce the number of rows Example: Person(ssn,country) Consider the query πX(R1 ⨝ R2); denote: Y = R1∩R2 (i.e. the attributes in both R1 and R2) X1 = X∩R1 X2 = X∩R2 (Note the abuse of notation – we mix attribute sequences with attributes sets) We would like to push projections into the join, that is: πX (πZ1R1 ⨝ πZ2R2) Which Z1 and Z2 can work (equivalence preserved)?

82 Correct Projection Push
Y = R1∩R2 X1 = X∩R1 X2 = X∩R2 ? πX (R1 ⨝ R2) πX1(R1) ⨝ πX2(R2) ? πX (R1 ⨝ R2) πX1Y(R1) ⨝ πX2Y(R2) ? πX (R1 ⨝ R2) πX (πX1Y(R1) ⨝ πX2Y(R2)) When we push projection, we need to retain all the attributes that are used for (1) joining, and (2) operations outside the join

83 Pushing Down the Expression Tree
Y = R1∩R2 X1 = X∩R1 X2 = X∩R2 πX (R1 ⨝ R2) πX (πX1Y(R1) ⨝ πX2Y(R2)) πX πX πX1Y πX2Y R1 R2 R1 R2

84 Selection Push Can we rewrite c(R1 ⨝ R2) as (cR1 ⨝ cR2)?
If all the attributes of C are in R1, then c(R1 ⨝ R2) ≡ (cR1 ⨝ R2) If all the attributes of C are in R2, then c(R1 ⨝ R2) ≡ (R1 ⨝ cR2) If all the attributes of C in both R1 and R2, then c(R1 ⨝ R2) ≡ (cR1 ⨝ cR2) Pushing selection is generally beneficial; we may need some rewriting to get opportunities...

85 Examples of Rewriting Operations
Splitting conjunctions: c⋀d(R) ≡ c(d(R)) ≡ d(c(R)) Applies to disjunction as well? Pushing through selection: c(d(R)) ≡ d(c(R)) Pushing through projection: c(πA(R)) ≡ πA(c(R)) Assuming that c uses only attributes from A!

86 Pushing Down the Expression Tree
c⋀dπX (R1 ⨝ R2) πXc⋀d (R1 ⨝ R2) πXcd (R1 ⨝ R2) πXc (R1 ⨝ dR2) c⋀d πX πX πX πX c⋀d c c d R1 R2 R1 R2 R1 d R1 R2 R2

87 Rewriting Joins Up to order of attributes, the natural join is commutative and associative Commutative: R ⨝ S ≡ S ⨝ R Associative: (R ⨝ S) ⨝ T = R ⨝ (S ⨝ T) Proof: straightforward So, given an RA query that involves only natural joins, apply the joins in whatever order you want (similarly to addition) We may need to reorder attributes... nonissue

88 Example Person(ssn,country) Picture(pid,topic,album) Likes(ssn,album)
πssntopic=‘dog’⋀country=‘Israel’ (Likes ⨝ (Person ⨝ Picture)) Reorder joins πssntopic=‘dog’⋀country=‘Israel’ (Person ⨝ (Likes ⨝ Picture)) Split selection πssntopic=‘dog’country=‘Israel’ (Person ⨝ (Likes ⨝ Picture)) Push selection πssntopic=‘dog’ ((country=‘Israel’Person) ⨝ (Likes ⨝ Picture)) Push selection (x2) πssn ((country=‘Israel’Person) ⨝ (Likes ⨝ (topic=‘dog’Picture)))

89 Example (cont’d) Person(ssn,country) Picture(pid,topic,album) Likes(ssn,album) πssn ((country=‘Israel’Person) ⨝ (Likes ⨝ (topic=‘dog’Picture))) Push projection πssn ((country=‘Israel’Person) ⨝ πssn(Likes ⨝ (topic=‘dog’Picture))) Push projection πssn ((country=‘Israel’Person) ⨝ πssn(Likes ⨝(πalbumtopic=‘dog’Picture))) Push projection πssn ((πssncountry=‘Israel’Person) ⨝ πssn(Likes ⨝ (πalbumtopic=‘dog’Picture))) Remove redundant projection (πssncountry=‘Israel’Person) ⨝ πssn(Likes ⨝ (πalbumtopic=‘dog’Picture))

90 Perspective on Query-Plan Optimization
Algorithms for RA query-plan optimization have been the subject of much research One of the first and common algorithms is the “Sellinger algorithm” from IBM Almaden [Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, Thomas G. Price: Access Path Selection in a Relational Database Management System. SIGMOD Conference 1979: 23-34] Idea: dynamic programming; compute cost & size estimation for every possible subquery, using the costs of smaller subqueries General toolkit and concepts apply to many data/query models: algebra, equivalence, cost, plan optimization

91 Note on Alternative Approaches
In a recent line of research, several alternative algorithms for RA computation are developed These algorithm do not construct intermediate results from sub-queries Rather, compute answers by simultaneously scanning all input relations More reading: LogicBlox’s Leapfrog Trie Join [Todd L. Veldhuizen: Triejoin: A Simple, Worst-Case Optimal Join Algorithm. ICDT 2014: ] Stanford’s Minesweeper [Hung Q. Ngo, Dung T. Nguyen, Christopher Re, Atri Rudra: Beyond worst-case analysis for joins with minesweeper. PODS 2014: ] Not discussed in this course


Download ppt "Database Management Systems"

Similar presentations


Ads by Google