Query Flocks Umar Hammoud Elizabeth Cash March 25, 2003.

Query Flocks Umar Hammoud Elizabeth Cash March 25, 2003

Presentation Based On Paper Title Query Flocks: A Generalization of Association-Rule Mining Authors Dick Tsur Jeffrey D. Ullman Serge Abiteboul Chris Clifton Rajeev Motwani Svetlozar Nestorov Arnon Rosenthal

Association-Rule The goal is to find sets of items that are associated The fact of their association is called association-rule

Market Basket Mining Understand the behavior of the customers when they shop to improve marketing An attempt by retail store to learn what items its customers purchase together A way to find items that tend to appear together in a market basket

Precise Measures of Association Given a relation, baskets(BID, Item) where BID is basket ID 1. Support: The items must appear in many baskets. 2. Confidence: The probability of one item given that the others are in the basket must be high. 3. Interest: That probability must be significantly higher or lower than the expected probability if items were purchased at random.

Examples: Measures of Association People who buy milk often by cereal. {cereal, milk } 1. High support means that many people buy both cereal and milk 2. High confidence means that a lot of people who buy cereal also buy milk. 3. High interest means that if you buy cereal, then you are much more likely to buy milk than the general population.

Association-Rule Optimization Can be optimized by taking advantage of many of the query optimization ideas (e.g. “a-priori”)

The A-Priori Optimization Using this technique tuples can be eliminated before the join Let S be a set of items that appear in at least n baskets And S’ is subset of S Then S’ appears in at least n baskets

A-Priori Generalization Extended to provide efficient mining of very large databases, for many different kinds of patterns. Can be used for: general-purpose mining systems future generation of query optimizers. Known as “Query flocks”

Query Flocks A parameterized query with a filter condition to eliminate the “uninteresting” values of the parameters Represented in Datalog

Mining Languages Can SQL be used as a mining Language? In principal, it can, but right optimization is not there.

SQL: What’s the Problem? SELECT i1.Item, i2.Item FROM baskets i1, baskets i2 WHERE i1.Item <i2.Item AND i1.BID = i2.BID GROUP BY i1.Item, i2.Item HAVING 20 <= COUNT(i1.BID) The A-Priori trick has not been implemented by any conventional optimizer Better performance can be achieved if the query is rewritten in the following way: First find those items that appeared in at least 20 baskets Join the set of these items with the baskets relation

Mining with Flocks Many data mining problems can benefit from the A-priori for code optimization The Formalism of “query flocks” is an important tool for building better optimizers

Query Flocks A family of identical queries that are asked simultaneously The answers to these queries are filtered The ones filtered enable their parameters to become part of the answer

Query Flock Settings Queries are parameterized by one or more parameter Ability to express filter conditions about the results of the query

Query Flock Designation One or more predicates that represent data stored as relations –A set of parameters with names starting with $ –A query –A filter that specifies a condition

Language for Flocks “Conjunctive Queries” augmented with arithmetic and with union Datalog is used rather than SQL because it gives the following capabilities: The notion of “safe query” for Datalog figures into potential optimizations The set of options for adapting the A-priori trick to arbitrary flocks is most easily expressed in Datalog SQL is used for the filter language only

Market Basket as a Query Flock QUERY: FILTER: Answer(B) :- baskets(B,$1) AND baskets(B,$2) COUNT(answer.B) >= 20

Language Extensions To apply query optimizations proposed, extensions must be added –Negated subgoals –Arithmetic subgoals for variables and parameters

Extensions Usage Add arithmetic extension to the previous query to restrict item pairs to appear in lexicographic order Answer(B) :- baskets(B,$1) AND baskets(B,$2) AND $1 < $2

Extensions Usage Given the following relations diagnoses(Patient, Disease) exhibits(Patient, Symptom) treatments(Patient,Medicine ) causes(Disease, Symptom) Find unexplained side effects QUERY: answer(P) :- exhibits(P,$s) AND treatment(P,$m) AND diagnosis(P,D) AND NOT causes(D,$s) FILTER: COUNT(answer.P) >= 20

Generalizing A-Priori Techniques Evaluate the less expensive query first The answer allows us to upper bound the size of the answer obtained with certain parameters. If the bound is less than the filter threshold, eliminate the certain values of parameters without further consideration For Query Q1 to puts an upper bound on the size of the result of query Q2 It must be provable that the result of Q2 is a subset of the result of Q1 The containment-mapping theorem says: Q2  Q1 can hold if Q1 is constructed from Q2 by: 1.Taking a subset of the subgoals of Q2, and 2.Splitting zero or more variables into several variables.

Safe Query Example answer(B) :- baskets(B,$1) AND baskets(B,$2) AND Two formed by taking two proper subsets of subgoals answer(B) :- baskets(B,$1) and answer(B) :- baskets(B,$2)

Safe Query Example cont. Any other value of $1 can be eliminated as member of a pair of items meeting the filter condition If we take the first, we can ask: What values of $1 does the query answer (B) : - baskets (B, $1) Produce a number of values of B that is over the threshold given in the filter.

Search for Optimal Query-Flock Evaluators R(P) := FILTER(P,Q,C) P set of parameters Q query involving parameters P R relation whose tuples are values of parameter P C condition on the result of the query Q

A Query Plan okS($s) := FILTER($s, answer(P) :- exhibits(P,$s), COUNT(answer.P) >= 20); okM($m) := FILTER($m, answer(P) :- treatments(P,$m), COUNT(answer.P) >= 20); ok($s,$m) := FILTER({$s,$m}, answer(P) :- okS($s) AND okM($m) AND diagnoses(P,D) AND exhibits(P,$s) AND treatments(P,$m) AND NOT causes(D,$s), COUNT(answer.P) >= 20);

Is there a Rule for Generating the Query Plans????? Consider only sequences of filter steps that satisfy these conditions: –Steps must use same filter condition as original query flock query –Each step must define a uniquely named relation –The final step must not delete any subgoals of the original query; it may have additional subgoals derived from previous steps, of course. –Each step derived from the given query flock by following: Start with original query flock Add in zero or more subgoals that are copies of the left side of the assignment ( := ) in some previous filter step Delete zero or more subgoals but, following the optimization principle for conjunctive queries, make sure that the resulting query is safe.

Exponential Search: Query Plan Candidate for best possible –Long sequence of steps in which each uses the results of the previous step How to restrict the search –Select sets of parameters –Select list of subsets of the subgoals of the original query that form safe queries.

Dynamic Selection of Filter Steps We let the sizes of intermediate relations determine whether or not to apply filters The important special case When the set of parameters for a relation has not previously been encountered. –If support threshold is low, then it is likely to be useful to filter –If support threshold is high, then it is unlikely a useful filter

Possible Query Plan temp1($s) := FILTER($s, answer(P) :- exhibits(P.$s), COUNT(answer.X) >= 20 ) temp2(P,$s,$m) := (temp1($s) JOIN exhibits(P,$s)) JOIN treatments(P,$m) temp3($s,$m) := FILTER({$s,$m}, answer(P) :- temp2($s,$m)., COUNT(answer.X) >= 20 ) temp4(P,D,$s,$m) := ((temp3($s,$m) JOIN temp2(P,$s,$m)) JOIN diagnoses(P,D)) JOIN (NOT causes(D,$s) sideEffect($s,$m) := FILTER({$s,$m}, answer(P) :- temp4(P,D,$s,$m), COUNT(answer.X) >= 20 )

Conclusions It’s a generate-and-test model for data- mining problems Uses "parts of queries" constructively to prune answer sets for main queries Provides a parameterized way to specify a set of queries, whose answer is the parameter(s)

So What Should Tim Tell His Mother? In one sentence: Generalization of query optimization techniques to be used for data mining. And Questions?

Query Flocks Umar Hammoud Elizabeth Cash March 25, 2003.

Similar presentations

Presentation on theme: "Query Flocks Umar Hammoud Elizabeth Cash March 25, 2003."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Query Flocks Umar Hammoud Elizabeth Cash March 25, 2003.

Similar presentations

Presentation on theme: "Query Flocks Umar Hammoud Elizabeth Cash March 25, 2003."— Presentation transcript:

Similar presentations

About project

Feedback