Presentation is loading. Please wait.

Presentation is loading. Please wait.

Databases 2 Storage optimization: functional dependencies, normal forms, data ware houses.

Similar presentations


Presentation on theme: "Databases 2 Storage optimization: functional dependencies, normal forms, data ware houses."— Presentation transcript:

1 Databases 2 Storage optimization: functional dependencies, normal forms, data ware houses

2 Functional Dependencies X -> A is an assertion about a relation R that whenever two tuples of R agree on all the attributes of X, then they must also agree on the attribute A. – Say “X -> A holds in R.” – Notice convention: …,X, Y, Z represent sets of attributes; A, B, C,… represent single attributes. – Convention: no set formers in sets of attributes, just ABC, rather than {A,B,C }. 2

3 Example Drinkers(name, addr, beersLiked, manf, favBeer). Reasonable FD’s to assert: 1.name -> addr (the name determine the address) 2.name -> favBeer (the name determine the favourite beer) 3.beersLiked -> manf (every beer has only one manufacturer) 3

4 Example 4 nameaddrBeersLikedmanfFavBeer JanewayVoyagerBudA.B.WickedAle JanewayVoyagerWickedAlePete’sWickedAle SpockEnterpriseBudA.B.Bud name -> addr name -> FavBeerBeersLiked -> manf

5 FD’s With Multiple Attributes No need for FD’s with more than one attribute on right. – But sometimes convenient to combine FD’s as a shorthand. – Example: name -> addr and name -> favBeer become name -> addr favBeer More than one attribute on left may be essential. – Example: bar beer -> price 5

6 Keys of Relations K is a key for relation R if: 1.Set K functionally determines all attributes of R 2.For no proper subset of K is (1) true. wIf K satisfies (1), but perhaps not (2), then K is a superkey. wConsequence: there are no two tuples having the same value in every attribute of the key. wNote E/R keys have no requirement for minimality, as in (2) for relational keys. 6

7 Example Consider relation Drinkers(name, addr, beersLiked, manf, favBeer). {name, beersLiked} is a superkey because together these attributes determine all the other attributes. – name -> addr favBeer – beersLiked -> manf 7

8 Example 8 nameaddrBeersLikedmanfFavBeer JanewayVoyagerBudA.B.WickedAle JanewayVoyagerWickedAlePete’sWickedAle SpockEnterpriseBudA.B.Bud Every pair is different => rest of the attributes are determined

9 Example, Cont. {name, beersLiked} is a key because neither {name} nor {beersLiked} is a superkey. – name doesn’t -> manf; beersLiked doesn’t -> addr. In this example, there are no other keys, but lots of superkeys. – Any superset of {name, beersLiked}. 9

10 Example 10 nameaddrBeersLikedmanfFavBeer JanewayVoyagerBudA.B.WickedAle JanewayVoyagerWickedAlePete’sWickedAle SpockEnterpriseBudA.B.Bud name doesn’t -> manf BeersLiked doesn’t -> addr

11 E/R and Relational Keys Keys in E/R are properties of entities Keys in relations are properties of tuples. Usually, one tuple corresponds to one entity, so the ideas are the same. But --- in poor relational designs, one entity can become several tuples, so E/R keys and Relational keys are different. 11

12 Example 12 nameaddrBeersLikedmanfFavBeer JanewayVoyagerBudA.B.WickedAle JanewayVoyagerWickedAlePete’sWickedAle SpockEnterpriseBudA.B.Bud nameaddr JanewayVoyager SpockEnterprise beermanf BudA.B. WickedAlePete’s BeersDrinkers name, Beersliked relational key In E/R name is a key for Drinkers, and beersLiked is a key for Beers

13 Where Do Keys Come From? 1.We could simply assert a key K. Then the only FD’s are K -> A for all atributes A, and K turns out to be the only key obtainable from the FD’s. 2.We could assert FD’s and deduce the keys by systematic exploration. uE/R gives us FD’s from entity-set keys and many- one relationships. 13

14 Armstrong’s axioms Let X,Y,Z  R (i.e. X,Y,Z are attribute sets of R) Armstrong’s axioms: Reflexivity: if Y  X then X -> Y Augmentation: if X -> Y then for every Z: XZ -> YZ Transitivity: if X -> Y and Y -> Z then X -> Z 14

15 FD’s From “Physics” While most FD’s come from E/R keyness and many-one relationships, some are really physical laws. Example: “no two courses can meet in the same room at the same time” tells us: hour room -> course. 15

16 Inferring FD’s: Motivation In order to design relation schemas well, we often need to tell what FD’s hold in a relation. We are given FD’s X 1 -> A 1, X 2 -> A 2,…, X n -> A n, and we want to know whether an FD Y -> B must hold in any relation that satisfies the given FD’s. – Example: If A -> B and B -> C hold, surely A -> C holds, even if we don’t say so. 16

17 Inference Test To test if Y -> B, start assuming two tuples agree in all attributes of Y. Use the given FD’s to infer that these tuples must also agree in certain other attributes. If B is eventually found to be one of these attributes, then Y -> B is true; otherwise, the two tuples, with any forced equalities form a two-tuple relation that proves Y -> B does not follow from the given FD’s. 17

18 Closure Test An easier way to test is to compute the closure of Y, denoted Y +. Basis: Y + = Y. Induction: Look for an FD’s left side X that is a subset of the current Y +. If the FD is X -> A, add A to Y +. 18

19 19 Y+Y+ new Y + XA

20 Finding All Implied FD’s Motivation: “normalization,” the process where we break a relation schema into two or more schemas. Example: ABCD with FD’s AB ->C, C ->D, and D ->A. – Decompose into ABC, AD. What FD’s hold in ABC ? – Not only AB ->C, but also C ->A ! 20

21 Basic Idea To know what FD’s hold in a projection, we start with given FD’s and find all FD’s that follow from given ones. Then, restrict to those FD’s that involve only attributes of the projected schema. 21

22 Simple, Exponential Algorithm 1.For each set of attributes X, compute X +. 2.Add X ->A for all A in X + - X. 3.However, drop XY ->A whenever we discover X ->A. uBecause XY ->A follows from X ->A. 4.Finally, use only FD’s involving projected attributes. 22

23 A Few Tricks Never need to compute the closure of the empty set or of the set of all attributes: – ∅ + = ∅ – R + =R If we find X + = all attributes, don’t bother computing the closure of any supersets of X: – X + = R and X  Y => Y + = R 23

24 Example ABC with FD’s A ->B and B ->C. Project onto AC. – A + =ABC ; yields A ->B, A ->C. We do not need to compute AB + or AC +. – B + =BC ; yields B ->C. – C + =C ; yields nothing. – BC + =BC ; yields nothing. 24

25 Example, Continued Resulting FD’s: A ->B, A ->C, and B ->C. Projection onto AC : A ->C. – Only FD that involves a subset of {A,C }. 25

26 A Geometric View of FD’s Imagine the set of all instances of a particular relation. That is, all finite sets of tuples that have the proper number of components. Each instance is a point in this space. 26

27 Example: R(A,B) 27 {(1,2), (3,4)} {} {(1,2), (3,4), (1,3)} {(5,1)}

28 An FD is a Subset of Instances For each FD X -> A there is a subset of all instances that satisfy the FD. We can represent an FD by a region in the space. Trivial FD : an FD that is represented by the entire space. – Example: A -> A. 28

29 Example: A -> B for R(A,B) 29 {(1,2), (3,4)} {} {(1,2), (3,4), (1,3)} {(5,1)} A -> B

30 Representing Sets of FD’s If each FD is a set of relation instances, then a collection of FD’s corresponds to the intersection of those sets. – Intersection = all instances that satisfy all of the FD’s. 30

31 Example 31 A->B B->C CD->A Instances satisfying A->B, B->C, and CD->A

32 Implication of FD’s If an FD Y -> B follows from FD’s X 1 -> A 1,…, X n -> A n, then the region in the space of instances for Y -> B must include the intersection of the regions for the FD’s X i -> A i. – That is, every instance satisfying all the FD’s X i - > A i surely satisfies Y -> B. – But an instance could satisfy Y -> B, yet not be in this intersection. 32

33 Example 33 A->B B->C A->C

34 34 Normalization: Anomalies Goal of relational schema design is to avoid anomalies and redundancy. – Update anomaly : one occurrence of a fact is changed, but not all occurrences. – Deletion anomaly : valid fact is lost when a tuple is deleted.

35 35 Example of Bad Design Data is redundant, because each of the ???’s can be figured out by using the FD’s name -> addr favBeer and beersLiked -> manf. nameaddrBeersLikedmanfFavBeer JanewayVoyagerBudA.B.WickedAle Janeway???WickedAlePete’s??? SpockEnterpriseBud???Bud Drinkers(name, addr, beersLiked, manf, favBeer)

36 36 This Bad Design Also Exhibits Anomalies Update anomaly: if Janeway is transferred to Intrepid, will we remember to change each of her tuples? Deletion anomaly: If nobody likes Bud, we lose track of the fact that Anheuser-Busch manufactures Bud. nameaddrBeersLikedmanfFavBeer JanewayVoyagerBudA.B.WickedAle JanewayVoyagerWickedAlePete’sWickedAle SpockEnterpriseBudA.B.Bud

37 37 Boyce-Codd Normal Form We say a relation R is in BCNF : if whenever X ->A is a nontrivial FD that holds in R, X is a superkey. – Remember: nontrivial means A is not a member of set X. – Remember, a superkey is any superset of a key (not necessarily a proper superset).

38 38 Example Drinkers(name, addr, beersLiked, manf, favBeer) FD’s: name->addr favBeer, beersLiked->manf Only key is {name, beersLiked}. In each FD, the left side is not a superkey. Any one of these FD’s shows Drinkers is not in BCNF

39 39 Another Example Beers(name, manf, manfAddr) FD’s: name->manf, manf->manfAddr Only key is {name}. name->manf does not violate BCNF, but manf->manfAddr does.

40 40 Decomposition into BCNF Given: relation R with FD’s F. Look among the given FD’s for a BCNF violation X ->B. – If any FD following from F violates BCNF, then there will surely be an FD in F itself that violates BCNF. Compute X +. – Not all attributes, or else X is a superkey.

41 41 Decompose R Using X -> B Replace R by relations with schemas: 1. R 1 = X +. 2. R 2 = (R – X + ) U X. wProject given FD’s F onto the two new relations. 1.Compute the closure of F = all nontrivial FD’s that follow from F. 2.Use only those FD’s whose attributes are all in R 1 or all in R 2.

42 42 Decomposition Picture R-X +R-X + XX +-XX +-X R2R2 R1R1 R

43 43 Example Drinkers(name, addr, beersLiked, manf, favBeer) F = name->addr, name -> favBeer, beersLiked->manf Pick BCNF violation name->addr. Close the left side: {name} + = {name, addr, favBeer}. Decomposed relations: 1.Drinkers1(name, addr, favBeer) 2.Drinkers2(name, beersLiked, manf)

44 44 Example, Continued We are not done; we need to check Drinkers1 and Drinkers2 for BCNF. Projecting FD’s is complex in general, easy here. For Drinkers1(name, addr, favBeer), relevant FD’s are name->addr and name->favBeer. – Thus, name is the only key and Drinkers1 is in BCNF.

45 45 Example, Continued For Drinkers2(name, beersLiked, manf), the only FD is beersLiked->manf, and the only key is {name, beersLiked}. – Violation of BCNF. beersLiked + = {beersLiked, manf}, so we decompose Drinkers2 into: 1.Drinkers3(beersLiked, manf) 2.Drinkers4(name, beersLiked)

46 46 Example, Concluded The resulting decomposition of Drinkers : 1.Drinkers1(name, addr, favBeer) 2.Drinkers3(beersLiked, manf) 3.Drinkers4(name, beersLiked) wNotice: Drinkers1 tells us about drinkers, Drinkers3 tells us about beers, and Drinkers4 tells us the relationship between drinkers and the beers they like.

47 47 Third Normal Form - Motivation There is one structure of FD’s that causes trouble when we decompose. AB ->C and C ->B. – Example: A = street address, B = city, C = zip code. There are two keys, {A,B } and {A,C }. C ->B is a BCNF violation, so we must decompose into AC, BC.

48 48 We Cannot Enforce FD’s The problem is that if we use AC and BC as our database schema, we cannot enforce the FD AB ->C by checking FD’s in these decomposed relations. Example with A = street, B = city, and C = zip on the next slide.

49 49 An Unenforceable FD street zip 545 Tech Sq.02138 545 Tech Sq.02139 city zip Cambridge02138 Cambridge02139 Join tuples with equal zip codes. street city zip 545 Tech Sq.Cambridge02138 545 Tech Sq.Cambridge02139 Although no FD’s were violated in the decomposed relations, FD street city -> zip is violated by the database as a whole.

50 50 3NF Let’s Us Avoid This Problem 3 rd Normal Form (3NF) modifies the BCNF condition so we do not have to decompose in this problem situation. An attribute is prime if it is a member of any key. X ->A violates 3NF if and only if X is not a superkey, and also A is not prime.

51 51 Example In our problem situation with FD’s AB ->C and C ->B, we have keys AB and AC. Thus A, B, and C are each prime. Although C ->B violates BCNF, it does not violate 3NF.

52 52 What 3NF and BCNF Give You There are two important properties of a decomposition: 1.Recovery : it should be possible to project the original relations onto the decomposed schema, and then reconstruct the original. 2.Dependency preservation : it should be possible to check in the projected relations whether all the given FD’s are satisfied.

53 53 3NF and BCNF, Continued We can get (1) with a BCNF decompsition. – Explanation needs to wait for relational algebra. We can get both (1) and (2) with a 3NF decomposition. But we can’t always get (1) and (2) with a BCNF decomposition. – street-city-zip is an example.

54 54 A New Form of Redundancy Multivalued dependencies (MVD’s) express a condition among tuples of a relation that exists when the relation is trying to represent more than one many-many relationship. Then certain attributes become independent of one another, and their values must appear in all combinations.

55 55 Example Drinkers(name, addr, phones, beersLiked) A drinker’s phones are independent of the beers they like. Thus, each of a drinker’s phones appears with each of the beers they like in all combinations. This repetition is unlike redundancy due to FD’s, of which name->addr is the only one.

56 56 Tuples Implied by Independence If we have tuples: Then these tuples must also be in the relation. nameaddrphonesbeersLiked Sueap1b1 Sueap2b2 Sueap1b2 Sueap2b1

57 57 Definition of MVD A multivalued dependency (MVD) X ->->Y is an assertion that if two tuples of a relation agree on all the attributes of X, then their components in the set of attributes Y may be swapped, and the result will be two tuples that are also in the relation.

58 58 Example The name-addr-phones-beersLiked example illustrated the MVD name->->phones and the MVD name ->-> beersLiked.

59 59 Picture of MVD X ->->Y XY others equal exchange

60 60 MVD Rules Every FD is an MVD. – If X ->Y is a FD, then swapping Y ’s between two tuples that agree on X doesn’t change the tuples. – Therefore, the “new” tuples are surely in the relation, and we know X ->->Y. Complementation : If X ->->Y, and Z is all the other attributes, then X ->->Z.

61 61 Splitting Doesn’t Hold Like FD’s, we cannot generally split the left side of an MVD. But unlike FD’s, we cannot split the right side either --- sometimes you have to leave several attributes on the right side.

62 62 Example Consider a drinkers relation: Drinkers(name, areaCode, phone, beersLiked, manf) A drinker can have several phones, with the number divided between areaCode and phone (last 7 digits). A drinker can like several beers, each with its own manufacturer.

63 63 Example, Continued Since the areaCode-phone combinations for a drinker are independent of the beersLiked- manf combinations, we expect that the following MVD’s hold: name ->-> areaCode phone name ->-> beersLiked manf

64 64 Example Data Here is possible data satisfying these MVD’s: nameareaCodephonebeersLikedmanf Sue650555-1111BudA.B. Sue650555-1111WickedAlePete’s Sue415555-9999BudA.B. Sue415555-9999WickedAlePete’s But we cannot swap area codes or phones my themselves. That is, neither name ->-> areaCode nor name ->-> phone holds for this relation.

65 65 Fourth Normal Form The redundancy that comes from MVD’s is not removable by putting the database schema in BCNF. There is a stronger normal form, called 4NF, that (intuitively) treats MVD’s as FD’s when it comes to decomposition, but not when determining keys of the relation.

66 66 4NF Definition A relation R is in 4NF if whenever X ->->Y is a nontrivial MVD, then X is a superkey. – “Nontrivial means that: 1.Y is not a subset of X, and 2.X and Y are not, together, all the attributes. – Note that the definition of “superkey” still depends on FD’s only.

67 67 BCNF Versus 4NF Remember that every FD X ->Y is also an MVD, X ->->Y. Thus, if R is in 4NF, it is certainly in BCNF. – Because any BCNF violation is a 4NF violation. But R could be in BCNF and not 4NF, because MVD’s are “invisible” to BCNF.

68 68 Decomposition and 4NF If X ->->Y is a 4NF violation for relation R, we can decompose R using the same technique as for BCNF. 1.XY is one of the decomposed relations. 2.All but Y – X is the other.

69 69 Example Drinkers(name, addr, phones, beersLiked) FD: name -> addr MVD’s: name ->-> phones name ->-> beersLiked Key is {name, phones, beersLiked}. All dependencies violate 4NF.

70 70 Example, Continued Decompose using name -> addr: 1.Drinkers1(name, addr) – In 4NF, only dependency is name -> addr. 2.Drinkers2(name, phones, beersLiked) – Not in 4NF. MVD’s name ->-> phones and name ->-> beersLiked apply. No FD’s, so all three attributes form the key.

71 71 Example: Decompose Drinkers2 Either MVD name ->-> phones or name ->-> beersLiked tells us to decompose to: – Drinkers3(name, phones) – Drinkers4(name, beersLiked)

72 On-Line Application Processing Warehousing Data Cubes Data Mining 72

73 73 Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time- consuming, complex queries. New architectures have been developed to handle complex “analytic” queries efficiently.

74 74 The Data Warehouse The most common form of data integration. – Copy sources into a single DB (warehouse) and try to keep it up-to-date. – Usual method: periodic reconstruction of the warehouse, perhaps overnight. – Frequently essential for analytic queries.

75 75 OLTP Most database operations involve On-Line Transaction Processing (OTLP). – Short, simple, frequent queries and/or modifications, each involving a small number of tuples. – Examples: Answering queries from a Web interface Sales at cash registers Selling airline tickets

76 76 OLAP Of increasing importance are On-Line Analytical Processing (OLAP) queries. – Few, but complex queries --- may run for hours. – Queries do not depend on having an absolutely up-to-date database. Sometimes called Data Mining.

77 77 OLAP Examples 1.Amazon analyzes purchases by its customers to come up with an individual screen with products of likely interest to the customer. 2.Analysts at Wal-Mart look for items with increasing sales in some region.

78 78 Common Architecture Databases at store branches handle OLTP. Local store databases copied to a central warehouse overnight. Analysts use the warehouse for OLAP.

79 79 Star Schemas A star schema is a common organization for data at a warehouse. It consists of: 1.Fact table : a very large accumulation of facts such as sales. wOften “insert-only.” 2.Dimension tables : smaller, generally static information about the entities involved in the facts.

80 80 Example: Star Schema Suppose we want to record in a warehouse information about every beer sale: the bar, the brand of beer, the drinker who bought the beer, the day, the time, and the price charged. The fact table is a relation: Sales(bar, beer, drinker, day, time, price)

81 81 Example, Continued The dimension tables include information about the bar, beer, and drinker “dimensions”: Bars(bar, addr, license) Beers(beer, manf) Drinkers(drinker, addr, phone)

82 82 Dimensions and Dependent Attributes Two classes of fact-table attributes: 1.Dimension attributes : the key of a dimension table. 2.Dependent attributes : a value determined by the dimension attributes of the tuple.

83 83 Example: Dependent Attribute price is the dependent attribute of our example Sales relation. It is determined by the combination of dimension attributes: bar, beer, drinker, and the time (combination of day and time attributes).

84 84 Approaches to Building Warehouses 1.ROLAP = “relational OLAP”: Tune a relational DBMS to support star schemas. 2.MOLAP = “multidimensional OLAP”: Use a specialized DBMS with a model such as the “data cube.”

85 85 ROLAP Techniques 1.Bitmap indexes : For each key value of a dimension table (e.g., each beer for relation Beers) create a bit-vector telling which tuples of the fact table have that value. 2.Materialized views : Store the answers to several useful queries (views) in the warehouse itself.

86 86 Typical OLAP Queries Often, OLAP queries begin with a “star join”: the natural join of the fact table with all or most of the dimension tables. Example: SELECT * FROM Sales, Bars, Beers, Drinkers WHERE Sales.bar = Bars.bar AND Sales.beer = Beers.beer AND Sales.drinker = Drinkers.drinker;

87 87 Typical OLAP Queries --- 2 The typical OLAP query will: 1.Start with a star join. 2.Select for interesting tuples, based on dimension data. 3.Group by one or more dimensions. 4.Aggregate certain attributes of the result.

88 88 Example: OLAP Query For each bar in Palo Alto, find the total sale of each beer manufactured by Anheuser- Busch. 2.Filter: addr = “Palo Alto” and manf = “Anheuser-Busch”. 3.Grouping: by bar and beer. 4.Aggregation: Sum of price.

89 89 Example: In SQL SELECT bar, beer, SUM(price) FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’ GROUP BY bar, beer;

90 90 Using Materialized Views A direct execution of this query from Sales and the dimension tables could take too long. If we create a materialized view that contains enough information, we may be able to answer our query much faster.

91 91 Example: Materialized View Which views could help with our query? Key issues: 1.It must join Sales, Bars, and Beers, at least. 2.It must group by at least bar and beer. 3.It must not select out Palo-Alto bars or Anheuser- Busch beers. 4.It must not project out addr or manf.

92 92 Example --- Continued Here is a materialized view that could help: CREATE VIEW BABMS(bar, addr, beer, manf, sales) AS SELECT bar, addr, beer, manf, SUM(price) sales FROM Sales NATURAL JOIN Bars NATURAL JOIN Beers GROUP BY bar, addr, beer, manf; Since bar -> addr and beer -> manf, there is no real grouping. We need addr and manf in the SELECT.

93 93 Example --- Concluded Here’s our query using the materialized view BABMS: SELECT bar, beer, sales FROM BABMS WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’;

94 94 MOLAP and Data Cubes Keys of dimension tables are the dimensions of a hypercube. – Example: for the Sales data, the four dimensions are bars, beers, drinkers, and time. Dependent attributes (e.g., price) appear at the points of the cube.

95 95 Marginals The data cube also includes aggregation (typically SUM) along the margins of the cube. The marginals include aggregations over one dimension, two dimensions,…

96 96 Example: Marginals Our 4-dimensional Sales cube includes the sum of price over each bar, each beer, each drinker, and each time unit (perhaps days). It would also have the sum of price over all bar-beer pairs, all bar-drinker-day triples,…

97 97 Structure of the Cube Think of each dimension as having an additional value *. A point with one or more *’s in its coordinates aggregates over the dimensions with the *’s. Example: Sales(“Joe’s Bar”, “Bud”, *, *) holds the sum over all drinkers and all time of the Bud consumed at Joe’s.

98 98 Drill-Down Drill-down = “de-aggregate” = break an aggregate into its constituents. Example: having determined that Joe’s Bar sells very few Anheuser-Busch beers, break down his sales by particular A.-B. beer.

99 99 Roll-Up Roll-up = aggregate along one or more dimensions. Example: given a table of how much Bud each drinker consumes at each bar, roll it up into a table giving total amount of Bud consumed for each drinker.

100 100 Materialized Data-Cube Views Data cubes invite materialized views that are aggregations in one or more dimensions. Dimensions may not be completely aggregated --- an option is to group by an attribute of the dimension table.

101 101 Example A materialized view for our Sales data cube might: 1.Aggregate by drinker completely. 2.Not aggregate at all by beer. 3.Aggregate by time according to the week. 4.Aggregate according to the city of the bar.

102 102 Data Mining Data mining is a popular term for queries that summarize big data sets in useful ways. Examples: 1.Clustering all Web pages by topic. 2.Finding characteristics of fraudulent credit-card use.

103 103 Market-Basket Data An important form of mining from relational data involves market baskets = sets of “items” that are purchased together as a customer leaves a store. Summary of basket data is frequent itemsets = sets of items that often appear together in baskets.

104 104 Example: Market Baskets If people often buy hamburger and ketchup together, the store can: 1.Put hamburger and ketchup near each other and put potato chips between. 2.Run a sale on hamburger and raise the price of ketchup.

105 105 Finding Frequent Pairs The simplest case is when we only want to find “frequent pairs” of items. Assume data is in a relation Baskets(basket, item). The support threshold s is the minimum number of baskets in which a pair appears before we are interested.

106 106 Frequent Pairs in SQL SELECT b1.item, b2.item FROM Baskets b1, Baskets b2 WHERE b1.basket = b2.basket AND b1.item < b2.item GROUP BY b1.item, b2.item HAVING COUNT(*) >= s; Look for two Basket tuples with the same basket and different items. First item must precede second, so we don’t count the same pair twice. Create a group for each pair of items that appears in at least one basket. Throw away pairs of items that do not appear at least s times.

107 107 A-Priori Trick --- 1 Straightforward implementation involves a join of a huge Baskets relation with itself. The a-priori algorithm speeds the query by recognizing that a pair of items {i,j } cannot have support s unless both {i } and {j } do.

108 108 A-Priori Trick --- 2 Use a materialized view to hold only information about frequent items. INSERT INTO Baskets1(basket, item) SELECT * FROM Baskets WHERE item IN ( SELECT ITEM FROM Baskets GROUP BY item HAVING COUNT(*) >= s ); Items that appear in at least s baskets.

109 109 A-Priori Algorithm 1.Materialize the view Baskets1. 2.Run the obvious query, but on Baskets1 instead of Baskets. Baskets1 is cheap, since it doesn’t involve a join. Baskets1 probably has many fewer tuples than Baskets. – Running time shrinks with the square of the number of tuples involved in the join.

110 110 Example: A-Priori Suppose: 1.A supermarket sells 10,000 items. 2.The average basket has 10 items. 3.The support threshold is 1% of the baskets. At most 1/10 of the items can be frequent. Probably, the minority of items in one basket are frequent -> factor 4 speedup.


Download ppt "Databases 2 Storage optimization: functional dependencies, normal forms, data ware houses."

Similar presentations


Ads by Google