Data Science Algorithms: The Basic Methods

Slides:



Advertisements
Similar presentations
Association Rule Mining. 2 The Task Two ways of defining the task General –Input: A collection of instances –Output: rules to predict the values of any.
Advertisements

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A. Hall.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
732A02 Data Mining - Clustering and Association Analysis ………………… Jose M. Peña Association rules Apriori algorithm FP grow algorithm.
Mining Association Rules. Association rules Association rules… –… can predict any attribute and combinations of attributes … are not intended to be used.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Lazy Associative Classification By Adriano Veloso,Wagner Meira Jr., Mohammad J. Zaki Presented by: Fariba Mahdavifard Department of Computing Science University.
Instance-based representation
WEKA Evaluation of WEKA Waikato Environment for Knowledge Analysis Presented By: Manoj Wartikar & Sameer Sagade.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
1 Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Slides from Ofer Pasternak.
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Association Rule Mining March 5, 2009.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
1Weka Tutorial 5 - Association © 2009 – Mark Polczynski Weka Tutorial 5 – Association Technology Forge Version 0.1 ?
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Associations and Frequent Item Analysis. 2 Outline  Transactions  Frequent itemsets  Subset Property  Association rules  Applications.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.5: Mining Association Rules Rodney Nielsen.
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN Data Mining: Concepts and.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data Mining Practical Machine Learning Tools and Techniques Chapter 6.3: Association Rules Rodney Nielsen Many / most of these slides were adapted from:
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
Data Mining Practical Machine Learning Tools and Techniques
DECISION TREES An internal node represents a test on an attribute.
Data Mining Practical Machine Learning Tools and Techniques
Data Science Algorithms: The Basic Methods
Data Mining Association Analysis: Basic Concepts and Algorithms
CSE543: Machine Learning Lecture 2: August 6, 2014
Prepared by: Mahmoud Rafeek Al-Farra
Data Science Algorithms: The Basic Methods
Decision Trees: Another Example
Artificial Intelligence
Naïve Bayes Classifier
Data Science Algorithms: The Basic Methods
Data Science Algorithms: The Basic Methods
Frequent Pattern Mining
William Norris Professor and Head, Department of Computer Science
Decision Tree Saed Sayad 9/21/2018.
Market Basket Many-to-many relationship between different objects
Association Rule Mining
Data Mining Association Analysis: Basic Concepts and Algorithms
Machine Learning Techniques for Data Mining
DIRECT HASHING AND PRUNING (DHP) ALGORITHM
DATAWAREHOUSING AND DATAMINING
Privacy Preserving Data Mining
Clustering.
Naïve Bayes Classifier
Play Tennis ????? Day Outlook Temperature Humidity Wind PlayTennis
Dept. of Computer Science University of Liverpool
Association Rule Mining
Data Mining CSCI 307, Spring 2019 Lecture 15
Association Analysis: Basic Concepts
Data Mining CSCI 307, Spring 2019 Lecture 18
Data Mining CSCI 307, Spring 2019 Lecture 6
Data Mining CSCI 307, Spring 2019 Lecture 9
Presentation transcript:

Data Science Algorithms: The Basic Methods Mining Association Rules WFH: Data Mining, Chapter 4.5 Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall

Algorithms: The Basic Methods Inferring rudimentary rules Naïve Bayes, probabilistic model Constructing decision trees Constructing rules Association rule learning Linear models Instance-based learning Clustering

Mining Association Rules Baseline method for finding association rules: Use separate-and-conquer method Treat every possible combination of attribute values as a separate class Two problems: Computational complexity Resulting number of rules (which would have to be pruned on the basis of support and confidence) But: we can look for high support rules directly!

Item Sets Support: number of instances correctly covered by association rule The same as the number of instances covered by all tests in the rule (LHS and RHS!) Item: one test/attribute-value pair Item set : all items occurring in a rule Goal: only rules that exceed pre-defined support Do it by finding all item sets with the given minimum support and generating rules from them!

Weather Data No True High Mild Rainy Yes False Normal Hot Overcast Sunny Cool Play Windy Humidity Temp Outlook

Item Sets for Weather Data … Outlook = Rainy Temperature = Mild Windy = False Play = Yes (2) Outlook = Sunny Humidity = High Windy = False (2) Humidity = High (3) Temperature = Cool (4) Temperature = Hot Play = No (2) Humidity = High (2) Temperature = Hot (2) Outlook = Sunny (5) Four-item sets Three-item sets Two-item sets One-item sets In total: 12 one-item sets, 47 two-item sets, 39 three-item sets, 6 four-item sets and 0 five-item sets (with minimum support of two)

Generating Rules From an Item Set Once all item sets with minimum support have been generated, we can turn them into rules Example: (2N-1) = 7 potential rules: Humidity = Normal, Windy = False, Play = Yes (4) 4/4 4/6 4/7 4/8 4/9 4/12 If Humidity = Normal and Windy = False then Play = Yes If Humidity = Normal and Play = Yes then Windy = False If Windy = False and Play = Yes then Humidity = Normal If Humidity = Normal then Windy = False and Play = Yes If Windy = False then Humidity = Normal and Play = Yes If Play = Yes then Humidity = Normal and Windy = False If True then Humidity = Normal and Windy = False and Play = Yes

Rules for Weather Data Rules with support > 1 and confidence = 100%: In total: 3 rules with support four 5 with support three 50 with support two 100% 2  Humidity=High Outlook=Sunny Temperature=Hot 58 ... 3  Humidity=Normal Temperature=Cold Play=Yes 4  Play=Yes Outlook=Overcast Temperature=Cool Humidity=Normal Windy=False 1 Association rule Conf. Sup.

Example Rules From the Same Set Item set: Resulting rules with 100% confidence: Temperature = Cool, Humidity = Normal, Windy = False, Play = Yes (2) Temperature = Cool, Windy = False  Humidity = Normal, Play = Yes Temperature = Cool, Windy = False, Humidity = Normal  Play = Yes Temperature = Cool, Windy = False, Play = Yes  Humidity = Normal

Generating Item Sets Efficiently How can we efficiently find all frequent item sets? Finding one-item sets easy Idea: use one-item sets to generate two-item sets, two-item sets to generate three-item sets, … If (A B) is frequent item set, then (A) and (B) have to be frequent item sets as well! In general: if X is frequent k-item set, then all (k-1)-item subsets of X are also frequent Compute k-item set by merging (k-1)-item sets

Example Given: five three-item sets Lexicographically ordered! (A B C), (A B D), (A C D), (A C E), (B C D) Lexicographically ordered! Candidate four-item sets: (A B C D) OK: (A B C),(A B D),(A C D),(B C D) (A C D E) Not OK because of (C D E) Final check by counting instances in dataset! (k –1)-item sets are stored in hash table

Generating Rules Efficiently We are looking for all high-confidence rules Support of antecedent obtained from hash table But: brute-force method is (2N-1) Better way: building (c + 1)-consequent rules from c-consequent ones Observation: (c + 1)-consequent rule can only hold if all corresponding c-consequent rules also hold Resulting algorithm similar to procedure for large item sets

Example 1-consequent rules: Corresponding 2-consequent rule: Final check of antecedent against hash table! If Outlook = Sunny and Windy = False and Play = No then Humidity = High (2/2) If Humidity = High and Windy = False and Play = No then Outlook = Sunny (2/2) If Windy = False and Play = No then Outlook = Sunny and Humidity = High (2/2)

Weather Data No True High Mild Rainy Yes False Normal Hot Overcast Sunny Cool Play Windy Humidity Temp Outlook

Discussion Standard ARFF format very inefficient for typical market basket data Attributes represent items in a basket and most items are usually missing Data should be represented in sparse format Instances are also called transactions Confidence is not necessarily the best measure Example: milk occurs in almost every supermarket transaction Other measures have been devised (e.g. lift)

Association Rules: Discussion Above method makes one pass through the data for each different size item set Other possibility: generate (k+2)-item sets just after (k+1)-item sets have been generated Result: more (k+2)-item sets than necessary will be considered but less passes through the data Makes sense if data too large for main memory Practical issue: generating a certain number of rules (e.g. by incrementally reducing min. support)