Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta.

Slides:



Advertisements
Similar presentations
Recap: Mining association rules from large datasets
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
A distributed method for mining association rules
Data Mining Techniques Association Rule
Data Mining (Apriori Algorithm)DCS 802, Spring DCS 802 Data Mining Apriori Algorithm Spring of 2002 Prof. Sung-Hyuk Cha School of Computer Science.
Data Mining in Clinical Databases by using Association Rules Department of Computing Charles Lo.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
Rakesh Agrawal Ramakrishnan Srikant
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms
Frequent Pattern Mining Toon CaldersBart Goethals ADReM research group.
Association Analysis: Basic Concepts and Algorithms.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Data Mining Association Analysis: Basic Concepts and Algorithms
Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
DATA MINING -ASSOCIATION RULES-
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Fast Algorithms for Association Rule Mining
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
Mining and Summarizing Customer Reviews
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
What Is Sequential Pattern Mining?
ICMLC2007, Aug. 19~22, 2007, Hong Kong 1 Incremental Maintenance of Ontology- Exploiting Association Rules Ming-Cheng Tseng 1, Wen-Yang Lin 2 and Rong.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
CS 474 Database Design and Application Terminology Jan 11, 2000.
Approximate Frequency Counts over Data Streams Loo Kin Kong 4 th Oct., 2002.
Approximate Frequency Counts over Data Streams Gurmeet Singh Manku, Rajeev Motwani Standford University VLDB2002.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura
Data Mining By Dave Maung.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
3-1 Data Mining Kelby Lee. 3-2 Overview ¨ Transaction Database ¨ What is Data Mining ¨ Data Mining Primitives ¨ Data Mining Objectives ¨ Predictive Modeling.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
Algorithms For Time Series Knowledge Mining Fabian Moerchen 沈奕聰.
1 Inverted Matrix: Efficient Discovery of Frequent Items in Large Datasets in the Context of Interactive Mining -SIGKDD’03 Mohammad El-Hajj, Osmar R. Zaïane.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
HEMANTH GOKAVARAPU SANTHOSH KUMAR SAMINATHAN Frequent Word Combinations Mining and Indexing on HBase.
Data Resource Management Agenda What types of data are stored by organizations? How are different types of data stored? What are the potential problems.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
1 Discovering Calendar-based Temporal Association Rules SHOU Yu Tao May. 21 st, 2003 TIME 01, 8th International Symposium on Temporal Representation and.
Presented by Kyumars Sheykh Esmaili Description Logics for Data Bases (DLHB,Chapter 16) Semantic Web Seminar.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
1 Data Mining Lecture 6: Association Analysis. 2 Association Rule Mining l Given a set of transactions, find rules that will predict the occurrence of.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor : Jia-Ling Koh Speaker : Tsui-Feng Yen.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Queensland University of Technology
Data Mining Association Analysis: Basic Concepts and Algorithms
Predictive Analytics in SQL and Datalog
Introduction to Data Mining
Chapter 13 The Data Warehouse
Introduction C.Eng 714 Spring 2010.
Frequent Pattern Mining
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
Association Rule Mining
A Parameterised Algorithm for Mining Association Rules
Mining Sequential Patterns
Chapter 2: Intro to Relational Model
Presentation transcript:

Optimization of Association Rules Extraction Through Exploitation of Context Dependent Constraints Arianna Gallo, Roberto Esposito, Rosa Meo, Marco Botta Dipartimento di Informatica, Università di Torino

Outline Motivations Knowledge Discovery from Database (KDD), Inductive Databases Constraint-Based Mining Incremental Constraint Evaluation Association Rule Mining Incremental Algorithms Constraints properties Item Dependent Constraints (IDC) Context Dependent Constraints (CDC) Incremental Algorithms for IDC and CDC Performance results and Conclusions

Motivations: KDD process and Inductive Databases (IDB) KDD process consists of a non trivial extraction of implicit, previously unknown, and potentially useful information from data Inductive Databases have been proposed by Mannila and Imielinski [CACM’96] as a support for KDD KDD is an interactive and iterative process Inductive Databases contain both data and inductive generalizations (e.g. patterns, models) extracted from the data. users can query the inductive database with an advanced, ad-hoc data mining query language constrained-based queries

Motivations: Constraint-Based Mining and Incrementality Why constraints?  can be pushed in the pattern computation and pruning the search space;  provide to the user a tool to express her interests (both in data and in knowledge). In IDB constraint-based queries are very often a refinement of previous ones Explorative process Reconciling backgroung and extracted knowledge Why executing each query always from scratch? The new query can be executed incrementally! [Baralis et al., DaWak’99]

 The number of such groups must be sufficient (user defined statistical evaluation measures, such as support),,, G  from the groups of the database (grouping constraints), I of set of items (itemsets) (on some schema I)  satisfying some user defined constraints (mining constraints),  (M) T  extraction from a source table A Generic Mining Language R=Q( )  A very generic constraint-based mining query requests:  extraction from a source table T  In our case R contains association rules  from the groups of the database (grouping constraints), G  satisfying some user defined constraints (mining constraints),  (M)  The number of such groups must be sufficient (user defined statistical evaluation measures, such as support),,  of set of items (itemsets) (on some schema I), I

An Example purchase R=Q(purchase,customer,product,price>100,support_count>=2) transactioncustomerproductdatepricequantity 11001hiking boots12/7/ ski pants12/7/ jacket17/7/ col shirt12/7/ ski pants13/7/ jacket13/7/ col shirt13/7/ jacket20/8/ R Mining Query 2{ski pants} 2{jacket, ski pants} 3{jacket} support_countitemset 2/3 frequency 1jacketski pants 2/3ski pantsjacket confidenceheadbody

Incremental Algorithms  We studied an incremental approach to answer new constraint- based queries which makes use of the information (rules with support and confidence) contained in previous results  We individuated two classes of query constraints:  item dependent (IDC)  context dependent (CDC) We propose two newly developed incremental algorithms which allow the exploitation of past results in the two cases (IDC and CDC)

Relationships between two queries  Query equivalence: R 1 =R 2 no computation is needed [FQAS’04]  Query containment: [This paper]  Inclusion: R 2  R 1 and common elements have the same statistical measures. R 2 =  C (R 1 )  Dominance: R 2  R 1 but common elements do not have the same statistical measures. R 2  C (R 1 ) We can speed up the execution time of a new query using results of previous queries. Which previous queries? How can we recongnize inclusion or dominance between two constraints-based queries?

IDC vs CDC transactioncustomerproductdatepricequantity 11001ski pants12/7/ hiking boots12/7/ jacket17/7/ col shirt12/7/ ski pants13/7/ jacket13/7/ col shirt13/7/ jacket20/8/ CDC: qty >  Item Dependent Constraints (IDC )  are functionally dependent on the item extracted  are satisfied for a given itemset either for all the groups in the database or for none  if an itemset is common to R1 and R2, it will have the same support: inclusion  Context Dependent Constraints (CDC )  depend on the transactions in the database  might be satisfied for a given itemset only for some groups in the database  a common itemset to R1 and R2 might not have the same support: dominance IDC: price > 150

Incremental Algorithm for IDC Q2 ….. Constraint: price >10 ….. Current query Fail Item Domain Table itemprice A B C category hi-tech housing item C belongs to a row that does not satisfy the new IDC constraint Rules in memory BODYHEAD AB … 1 R1R1 Q1 ….. Constraint: price > 5 ….. Previous query SUPPCONF A C 2 … ……… … BODYHEAD AB21 R2R2 SUPPCONF ………… (R 2 =  P (R 1 )) delete from R 1 all rules containing item C

Incremental Algorithm for CDC Q2 ….. Constraint: qty >10 ….. Current query read the DB find groups -in which new constraints are satisfied -containing items belonging to BHF update support counters in BHF R2R2 BODYHEAD ………… SUPPCONF build BHF … Q1 ….. Constraint: qty > 5 ….. Previous query Rules in memory BODYHEAD ………… SUPPCONF R1R1

Body-Head Forest (BHF) g m a (4) f g (3)  body (head) tree contains itemsets which are candidates for being in the body (head) part of the rule  an itemset is represented as a single path in the tree and vice versa  each path in the body (head) tree is associated to a counter representing the body (rule) support a f g rule: rule support = 3 confidence = 3/4

Experiments (1): IC vs CD algorithm ID algorithm execution time vs constraint selectivity execution time vs volume of previous result (a) (b) CD algorithm (c)(d)

Experiments(2): CARE vs Incremental execution time vs cardinality of previous result (a)(b) (c) execution time vs support threshold execution time vs selectivity of constraints

Conclusions and future works  We proposed two incremental algorithms to constraint-based mining which make use of the information contained in previous result to answer new queries.  The first algorithm deals with item dependent constraints, while the second one with context dependent ones.  We evaluated the incremental algorithms on a pretty large dataset. The result set shows that the approach reduces drastically the execution time. An interesting direction for future research: integration of condensed representations with these incremental techniques

the end questions?? questions??

condensed representation It is well known that the set of association rules can rapidly grows to be unwieldy, especially as the frequency bound decreases. Since most of these rules turn out to be redundant, it is not necessary to mine rules from all frequent itemsets, but it is sufficient to consider only the rules among closed frequent itemsets In fact, frequent closed itemsets are a small subsets (or condensed representation) of frequent itemsets without information loss For these reason, mining the frequent closed itemsets instead of frequent itemsets takes great advantages.