Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

Data Mining Techniques Association Rule
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 9 Business Intelligence Systems
Chapter 8 Logistic Regression 1. Introduction Logistic regression extends the ideas of linear regression to the situation where the dependent variable,
x – independent variable (input)
Data Mining Techniques Outline
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Mining Association Rules
Mining Association Rules
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
Pattern Recognition Lecture 20: Data Mining 3 Dr. Richard Spillman Pacific Lutheran University.
Data Mining: A Closer Look
Chapter 5 Data mining : A Closer Look.
Evaluating Performance for Data Mining Techniques
Association Discovery from Databases Association rules are a simple formalism for expressing positive connections between columns in a 0/1 matrix. A classical.
Bayesian Decision Theory Making Decisions Under uncertainty 1.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Basic Data Mining Techniques
Data Mining Techniques
Contributed by Yizhou Sun 2008 An Introduction to WEKA.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Inductive learning Simplest form: learn a function from examples
Chapter 9 Neural Network.
Basic Data Mining Technique
Chapter 8 The k-Means Algorithm and Genetic Algorithm.
Information Systems Data Analysis – Association Mining Prof. Les Sztandera.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.
Fast Algorithms For Mining Association Rules By Rakesh Agrawal and R. Srikant Presented By: Chirayu Modi.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
CSE4334/5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai.
Association Rule Mining Data Mining and Knowledge Discovery Prof. Carolina Ruiz and Weiyang Lin Department of Computer Science Worcester Polytechnic Institute.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining Find information from data data ? information.
Association Rule Mining
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Association Analysis This lecture node is modified based on Lecture Notes for.
Mining Frequent Patterns, Associations, and Correlations Compiled By: Umair Yaqub Lecturer Govt. Murray College Sialkot.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Data Mining and Decision Support
Data Mining  Association Rule  Classification  Clustering.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Data Mining Association Rules Mining Frequent Itemset Mining Support and Confidence Apriori Approach.
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 3 Basic Data Mining Techniques Jason C. H. Chen, Ph.D. Professor of MIS School of Business.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Data Mining Find information from data data ? information.
Data Transformation: Normalization
Chapter 7. Classification and Prediction
By Arijit Chatterjee Dr
Mining Association Rules
Waikato Environment for Knowledge Analysis
©Jiawei Han and Micheline Kamber
Data Mining Lecture 11.
Mining Association Rules in Large Databases
Clustering.
©Jiawei Han and Micheline Kamber
Text Categorization Berlin Chen 2003 Reference:
©Jiawei Han and Micheline Kamber
Presentation transcript:

Chapter 11 Statistical Techniques

Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate data mining technique.  Know how to perform linear regression with Microsoft Excel’s LINEST function.  Know that logistic regression can be used to build supervised learner models for datasets having a binary outcome.  Understand how Bayes classifier is able to build supervised models for datasets having categorical data, numeric data, or a combination of both data types.

Data Warehouse and Data Mining Chapter 11 3 Chapter Objectives  Know how agglomerative clustering is applied partition data instances into disjoint clusters.  Understand that conceptual clustering is an unsupervised data mining technique that builds a concept hierarchy to partition data instances.  Know that the EM algorithm uses a statistical parameter adjustment technique to cluster data instances.  Understand the basic features that differentiate statistical and machine learning data mining methods

Data Warehouse and Data Mining Chapter 11 4 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 5 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 6 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 7 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 8 Linear Regression Analysis

Data Warehouse and Data Mining Chapter 11 9 Logistic Regression

Data Warehouse and Data Mining Chapter Logistic Regression

Data Warehouse and Data Mining Chapter Bayes Classifier

Data Warehouse and Data Mining Chapter Bayes Classifier

Data Warehouse and Data Mining Chapter Bayes Classifier

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Clustering Algorithms

Data Warehouse and Data Mining Chapter Heuristics or Statistics? Here is one way to categorize inductive problem- solving methods: Query and visualization techniques Machine learning techniques Statistical techniques Query and visualization techniques generally fall into one of three group: Query tools OLAP tools Visualization tools

Data Warehouse and Data Mining Chapter Data mining techniques come in many shapes and forms. A favorite statistical technique for estimation and prediction problems is linear regression. Linear regression attempts to model the variation in a dependent variable as a linear combination of one or more independent variables. Linear regression is an appropriate data mining strategy when the relationship between the dependent and independent variables is nearly linear. Microsoft Excel’s LINEST function provides an easy mechanism for performing multiple linear regression. Chapter Summary

Data Warehouse and Data Mining Chapter Chapter Summary Linear regression is a poor choice when the outcome is binary. The problem lies in the fact that the value restriction placed on the dependent variable is not observed by the regression equation. That is, because linear regression produces a straight-line function, values of the dependent variable are unbounded in both the positive and negative directions. For the two-outcome case, logistic regression is a better choice. Logistic regression is a nonlinear regression technique that associates a conditional probability value with each data instance.

Data Warehouse and Data Mining Chapter Chapter Summary Bayes classifier offers a simple yet powerful supervised classification technique. The model assumes all input attributes to be of equal importance and independent of one another. Even though these assumptions are likely to be false, Bayes classifier still works quite well in practice. Bayes classifier can be applied to datasets containing both categorical and numeric data. Also, unlike many statistical classifiers, Bayes classifier can be applied to datasets containing a wealth of missing items.

Data Warehouse and Data Mining Chapter Chapter Summary Agglomerative clustering is a favorite unsupervised clustering technique. Agglomerative clustering begins by assuming each data instance represents its own cluster. Each iteration of the algorithm merges the most similar pair of clusters. Several options for computing instance and cluster similarity scores and cluster merging procedures exist. Also, when the data to be clustered is real-valued, defining a measure of instance similarity can be a challenge. One common approach is to use simple Euclidean distance. A widespread application of agglomerative clustering is its use as a prelude to other clustering techniques.

Data Warehouse and Data Mining Chapter Chapter Summary Conceptual clustering is an unsupervised technique that incorporates incremental learning to form a hierarchy of concepts. The concept hierarchy takes the form of a tree structure where the root node represents the highest level of concept generalization. Conceptual clustering systems are particularly appealing because the trees they form have been shown to consistently determine psychologically preferred levels in human classification hierarchies. Also, conceptual clustering systems lend themselves well to explaining their behavior. A major problem with conceptual clustering systems is that instance ordering can have a marked impact on the results of the clustering. A nonrepresentative ordering of data instances can lead to a less than optimal clustering.

Data Warehouse and Data Mining Chapter Chapter Summary The EM (expectation-maximization) algorithm is a statistical technique that makes use of the finite Gaussian mixtures model. The mixtures model assigns each individual data instance a probability that it would have a certain set of attribute values given it was a member of a specified cluster. The model assumes all attributes to be independent random variables. The EM algorithm is similar to the K-Means procedure in that a set of parameters are recomputed until a desired convergence value is achieved. A lack of explanation about what has been discovered is a problem with EM as it is with many clustering systems. Applying a supervised model to analyze the results of an unsupervised clustering is one technique to help explain the results of an EM clustering.

Data Warehouse and Data Mining Chapter Key Terms A priori probability. The probability a hypothesis is true lacking evidence to support or reject the hypothesis. Agglomerative clustering. An unsupervised technique where each data instance initially represents its own cluster. Successive iterations of the algorithm merge pairs of highly similar clusters until all instance become members of a single cluster. In the last step, a decision is made about which clustering is a best final result. Basic-level nodes. The nodes in a concept hierarchy that represent concepts easily identified by humans.

Data Warehouse and Data Mining Chapter Key Terms Bayes classifier. A supervised learning approach that classifies new instances by using Bayes theorem. Bayes theorem. The probability of a hypothesis given some evidence is equal to the probability of the evidence given the hypothesis, times the probability of the hypothesis, divided by the probability of the evidence. Bayesian Information Criterion (BIC). The BIC gives the posterior odds for one data mining model against another model assuming neither model is favored initially.

Data Warehouse and Data Mining Chapter Key Terms Category utility. An unsupervised evaluation function that measures the gain in the “expected number” of correct attribute-value predictions for a specific object if it were placed within a given category or cluster. Coefficient of determination. For a regression analysis, the correlation between actual and estimated values for the dependent variable. Concept hierarchy. A tree structure where each node of the tree represents a concept at some level of abstraction. Nodes toward the top of the tree are the most general. Leaf nodes represent individual data instances.

Data Warehouse and Data Mining Chapter Key Terms Conceptual clustering. An incremental unsupervised clustering method that creates a concept hierarchy from a set of input instances. Conditional probability. The conditional probability of evidence E given hypothesis H denoted by P(E | H), is the probability E is true given H is true. Incremental learning. A form of learning that is supported in an unsupervised environment where instances are presented sequentially. As each new instance is seen, the learning model is modified to reflect the addition of the new instance.

Data Warehouse and Data Mining Chapter Key Terms Linear regression. A statistical technique that models the variation in a numeric dependent variable as a linear combination of one or several independent variables. Logistic regression. A nonlinear regression technique for problems having a binary outcome. A created regression equation limits the values of the output attribute to values between 0 and 1.This allows output values to represent a probability of class membership.

Data Warehouse and Data Mining Chapter Key Terms Logit. The natural logarithm of the odds ratio p(y = 1| x)/[1-p(y = 1| x)]. p(y = 1| x) is the conditional probability that the value of the linear regression equation determined by feature vector x is 1. Mixture. A set of n probability distributions where each distribution represent a cluster. Model tree. A decision tree where each leaf node contains a linear regression equation.

Data Warehouse and Data Mining Chapter Key Terms Regression. The process of developing an expression that predicts a numeric output value. Regression tree. A decision tree where leaf nodes contain averaged numeric values. Simple linear regression. A regression equation with a single independent variable. Slope-intercept form. A linear equation of the form y = ax + b where a is the slope of the line and b is the y-intercept.

Data Warehouse and Data Mining Chapter 11 34

Chapter 10 Association Rules

Data Warehouse and Data Mining Chapter Content Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Constraint-based association mining Summary

Data Warehouse and Data Mining Chapter What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Basket data analysis, clustering, classification

Data Warehouse and Data Mining Chapter Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items –E.g., 98% of people who purchase tires and auto accessories also get automotive services done

Data Warehouse and Data Mining Chapter Association Rule: Basic Concepts Applications –*  Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) –Home Electronics  * (What other products should the store stocks up?) –Attached mailing in direct marketing

Data Warehouse and Data Mining Chapter Rule Measures: Support and Confidence Find all the rules X & Y  Z with minimum confidence and support –support, s, probability that a transaction contains {X & Y => Z} –confidence, c, conditional probability that a transaction having {X & Y} also contains Z Customer buys beer Customer buys diaper Customer buys both

Data Warehouse and Data Mining Chapter Rule Measures: Support and Confidence Let minimum support 50%, and minimum confidence 50%, we have –A  C (50%, 66.6%) –C  A (50%, 100%) Customer buys diaper Customer buys both

Data Warehouse and Data Mining Chapter Mining Association Rules — An Example For rule A  C : support = support({A &C}) = 2/4 = 50% confidence = support({A &C})/support({A}) =2/3= 66.6% Min. support 50% Min. confidence 50%

Data Warehouse and Data Mining Chapter Mining Frequent Itemsets: the Key Step The Apriori principle: Any subset of a frequent itemset must be frequent

Data Warehouse and Data Mining Chapter Use the frequent itemsets to generate association rules. Find the frequent itemsets: the sets of items that have minimum support –A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset –Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) The Apriori Algorithm

Data Warehouse and Data Mining Chapter The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

Data Warehouse and Data Mining Chapter The Apriori Algorithm Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

Data Warehouse and Data Mining Chapter The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

Data Warehouse and Data Mining Chapter Generating Association Rules Confidence and Support Generating Association Rules Confidence and Support -Milk-Cheese -Bread-Eggs Possible associations include the following: 1. If customers purchase milk they also purchase bread. 2. If customers purchase bread they also purchase milk. 3. If customers purchase milk and eggs they also purchase cheese and bread. 4. If customers purchase milk, cheese, and eggs they also purchase bread.

Data Warehouse and Data Mining Chapter Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example

Data Warehouse and Data Mining Chapter Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example

Data Warehouse and Data Mining Chapter Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example

Data Warehouse and Data Mining Chapter Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example Two possible two-item set rule are:

Data Warehouse and Data Mining Chapter Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example Here are three of several possible three-item set rules:

Data Warehouse and Data Mining Chapter Reference Data Mining: Concepts and Techniques (Chapter 6 Slide for textbook), Jiawei Han and Micheline Kamber, Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Canada