Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino

Outline Motivations Knowledge Discovery from Database (KDD), Inductive Databases Constraint-Based Mining Incremental Constraint Evaluation Association Rule Mining Incremental Algorithms Constraints properties Item Dependent Constraints (IDC) Context Dependent Constraints (CDC) Incremental Algorithms for IDC and CDC Performance results and Conclusions

Motivations: KDD process and Inductive Databases (IDB) KDD process consists of a non trivial extraction of implicit, previously unknown, and potentially useful information from data Inductive Databases have been proposed by Mannila and Imielinski [CACM’96] as a support for KDD KDD is an interactive and iterative process Inductive Databases contain both data and inductive generalizations (e.g. patterns, models) extracted from the data. users can query the inductive database with an advanced, ad-hoc data mining query language constrained-based queries

Motivations: Constraint-Based Mining and Incrementality Why constraints?  can be pushed in the pattern computation and pruning the search space;  provide to the user a tool to express her interests (both in data and in knowledge). In IDB constraint-based queries are very often a refinement of previous ones Explorative process Reconciling backgroung and extracted knowledge Why executing each query always from scratch? The new query can be executed incrementally! [Baralis et al., DaWak’99]

 The number of such groups must be sufficient (user defined statistical evaluation measures, such as support),,, G  from the groups of the database (grouping constraints), I of set of items (itemsets) (on some schema I)  satisfying some user defined constraints (mining constraints),  (M) T  extraction from a source table A Generic Mining Language R=Q( )  A very generic constraint-based mining query requests:  extraction from a source table T  In our case R contains association rules  from the groups of the database (grouping constraints), G  satisfying some user defined constraints (mining constraints),  (M)  The number of such groups must be sufficient (user defined statistical evaluation measures, such as support),,  of set of items (itemsets) (on some schema I), I

An Example purchase R=Q(purchase,customer,product,price>100,support_count>=2) transactioncustomerproductdatepricequantity 11001hiking boots12/7/981401 11001ski pants12/7/981801 31001jacket17/7/983001 22256col shirt12/7/98252 22256ski pants13/7/981801 22256jacket13/7/983001 43441col shirt13/7/98253 53441jacket20/8/983002 R Mining Query 2{ski pants} 2{jacket, ski pants} 3{jacket} support_countitemset 2/3 frequency 1jacketski pants 2/3ski pantsjacket confidenceheadbody

Incremental Algorithms  We studied an incremental approach to answer new constraint- based queries which makes use of the information (rules with support and confidence) contained in previous results  We individuated two classes of query constraints:  item dependent (IDC)  context dependent (CDC) We propose two newly developed incremental algorithms which allow the exploitation of past results in the two cases (IDC and CDC)

Relationships between two queries  Query equivalence: R 1 =R 2 no computation is needed [FQAS’04]  Query containment: [This paper]  Inclusion: R 2  R 1 and common elements have the same statistical measures. R 2 =  C (R 1 )  Dominance: R 2  R 1 but common elements do not have the same statistical measures. R 2  C (R 1 ) We can speed up the execution time of a new query using results of previous queries. Which previous queries? How can we recongnize inclusion or dominance between two constraints-based queries?

IDC vs CDC transactioncustomerproductdatepricequantity 11001ski pants12/7/981401 11001hiking boots12/7/981801 21001jacket17/7/983002 22256col shirt12/7/98252 22256ski pants13/7/981402 32256jacket13/7/983001 42256col shirt13/7/98253 42256jacket20/8/983002 CDC: qty > 1 2 2 2  Item Dependent Constraints (IDC )  are functionally dependent on the item extracted  are satisfied for a given itemset either for all the groups in the database or for none  if an itemset is common to R1 and R2, it will have the same support: inclusion  Context Dependent Constraints (CDC )  depend on the transactions in the database  might be satisfied for a given itemset only for some groups in the database  a common itemset to R1 and R2 might not have the same support: dominance IDC: price > 150

Incremental Algorithm for IDC Q2 ….. Constraint: price >10 ….. Current query Fail Item Domain Table itemprice A B C 12 14 8 category hi-tech housing item C belongs to a row that does not satisfy the new IDC constraint Rules in memory BODYHEAD AB … 1 R1R1 Q1 ….. Constraint: price > 5 ….. Previous query SUPPCONF A C 2 … ……… … BODYHEAD AB21 R2R2 SUPPCONF ………… (R 2 =  P (R 1 )) delete from R 1 all rules containing item C

Incremental Algorithm for CDC Q2 ….. Constraint: qty >10 ….. Current query read the DB find groups -in which new constraints are satisfied -containing items belonging to BHF update support counters in BHF R2R2 BODYHEAD ………… SUPPCONF build BHF … Q1 ….. Constraint: qty > 5 ….. Previous query Rules in memory BODYHEAD ………… SUPPCONF R1R1

Body-Head Forest (BHF) g m a (4) f g (3)  body (head) tree contains itemsets which are candidates for being in the body (head) part of the rule  an itemset is represented as a single path in the tree and vice versa  each path in the body (head) tree is associated to a counter representing the body (rule) support a f g rule: rule support = 3 confidence = 3/4

Experiments (1): IC vs CD algorithm ID algorithm execution time vs constraint selectivity execution time vs volume of previous result (a) (b) CD algorithm (c)(d)

Experiments(2): CARE vs Incremental execution time vs cardinality of previous result (a)(b) (c) execution time vs support threshold execution time vs selectivity of constraints

Conclusions and future works  We proposed two incremental algorithms to constraint-based mining which make use of the information contained in previous result to answer new queries.  The first algorithm deals with item dependent constraints, while the second one with context dependent ones.  We evaluated the incremental algorithms on a pretty large dataset. The result set shows that the approach reduces drastically the execution time. An interesting direction for future research: integration of condensed representations with these incremental techniques

the end questions?? questions??

condensed representation It is well known that the set of association rules can rapidly grows to be unwieldy, especially as the frequency bound decreases. Since most of these rules turn out to be redundant, it is not necessary to mine rules from all frequent itemsets, but it is sufficient to consider only the rules among closed frequent itemsets In fact, frequent closed itemsets are a small subsets (or condensed representation) of frequent itemsets without information loss For these reason, mining the frequent closed itemsets instead of frequent itemsets takes great advantages.

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.

Similar presentations

Presentation on theme: "Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.

Similar presentations

Presentation on theme: "Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta."— Presentation transcript:

Similar presentations

About project

Feedback