Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Similar presentations


Presentation on theme: "Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate."— Presentation transcript:

1 Chapter 11 Statistical Techniques

2 Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate data mining technique.  Know how to perform linear regression with Microsoft Excel’s LINEST function.  Know that logistic regression can be used to build supervised learner models for datasets having a binary outcome.  Understand how Bayes classifier is able to build supervised models for datasets having categorical data, numeric data, or a combination of both data types.

3 Data Warehouse and Data Mining Chapter 11 3 Chapter Objectives  Know how agglomerative clustering is applied partition data instances into disjoint clusters.  Understand that conceptual clustering is an unsupervised data mining technique that builds a concept hierarchy to partition data instances.  Know that the EM algorithm uses a statistical parameter adjustment technique to cluster data instances.  Understand the basic features that differentiate statistical and machine learning data mining methods

4 Data Warehouse and Data Mining Chapter 11 4 Linear Regression Analysis

5 Data Warehouse and Data Mining Chapter 11 5 Linear Regression Analysis

6 Data Warehouse and Data Mining Chapter 11 6 Linear Regression Analysis

7 Data Warehouse and Data Mining Chapter 11 7 Linear Regression Analysis

8 Data Warehouse and Data Mining Chapter 11 8 Linear Regression Analysis

9 Data Warehouse and Data Mining Chapter 11 9 Logistic Regression

10 Data Warehouse and Data Mining Chapter 11 10 Logistic Regression

11 Data Warehouse and Data Mining Chapter 11 11 Bayes Classifier

12 Data Warehouse and Data Mining Chapter 11 12 Bayes Classifier

13 Data Warehouse and Data Mining Chapter 11 13 Bayes Classifier

14 Data Warehouse and Data Mining Chapter 11 14 Clustering Algorithms

15 Data Warehouse and Data Mining Chapter 11 15 Clustering Algorithms

16 Data Warehouse and Data Mining Chapter 11 16 Clustering Algorithms

17 Data Warehouse and Data Mining Chapter 11 17 Clustering Algorithms

18 Data Warehouse and Data Mining Chapter 11 18 Clustering Algorithms

19 Data Warehouse and Data Mining Chapter 11 19 Clustering Algorithms

20 Data Warehouse and Data Mining Chapter 11 20 Heuristics or Statistics? Here is one way to categorize inductive problem- solving methods: Query and visualization techniques Machine learning techniques Statistical techniques Query and visualization techniques generally fall into one of three group: Query tools OLAP tools Visualization tools

21 Data Warehouse and Data Mining Chapter 11 21 Data mining techniques come in many shapes and forms. A favorite statistical technique for estimation and prediction problems is linear regression. Linear regression attempts to model the variation in a dependent variable as a linear combination of one or more independent variables. Linear regression is an appropriate data mining strategy when the relationship between the dependent and independent variables is nearly linear. Microsoft Excel’s LINEST function provides an easy mechanism for performing multiple linear regression. Chapter Summary

22 Data Warehouse and Data Mining Chapter 11 22 Chapter Summary Linear regression is a poor choice when the outcome is binary. The problem lies in the fact that the value restriction placed on the dependent variable is not observed by the regression equation. That is, because linear regression produces a straight-line function, values of the dependent variable are unbounded in both the positive and negative directions. For the two-outcome case, logistic regression is a better choice. Logistic regression is a nonlinear regression technique that associates a conditional probability value with each data instance.

23 Data Warehouse and Data Mining Chapter 11 23 Chapter Summary Bayes classifier offers a simple yet powerful supervised classification technique. The model assumes all input attributes to be of equal importance and independent of one another. Even though these assumptions are likely to be false, Bayes classifier still works quite well in practice. Bayes classifier can be applied to datasets containing both categorical and numeric data. Also, unlike many statistical classifiers, Bayes classifier can be applied to datasets containing a wealth of missing items.

24 Data Warehouse and Data Mining Chapter 11 24 Chapter Summary Agglomerative clustering is a favorite unsupervised clustering technique. Agglomerative clustering begins by assuming each data instance represents its own cluster. Each iteration of the algorithm merges the most similar pair of clusters. Several options for computing instance and cluster similarity scores and cluster merging procedures exist. Also, when the data to be clustered is real-valued, defining a measure of instance similarity can be a challenge. One common approach is to use simple Euclidean distance. A widespread application of agglomerative clustering is its use as a prelude to other clustering techniques.

25 Data Warehouse and Data Mining Chapter 11 25 Chapter Summary Conceptual clustering is an unsupervised technique that incorporates incremental learning to form a hierarchy of concepts. The concept hierarchy takes the form of a tree structure where the root node represents the highest level of concept generalization. Conceptual clustering systems are particularly appealing because the trees they form have been shown to consistently determine psychologically preferred levels in human classification hierarchies. Also, conceptual clustering systems lend themselves well to explaining their behavior. A major problem with conceptual clustering systems is that instance ordering can have a marked impact on the results of the clustering. A nonrepresentative ordering of data instances can lead to a less than optimal clustering.

26 Data Warehouse and Data Mining Chapter 11 26 Chapter Summary The EM (expectation-maximization) algorithm is a statistical technique that makes use of the finite Gaussian mixtures model. The mixtures model assigns each individual data instance a probability that it would have a certain set of attribute values given it was a member of a specified cluster. The model assumes all attributes to be independent random variables. The EM algorithm is similar to the K-Means procedure in that a set of parameters are recomputed until a desired convergence value is achieved. A lack of explanation about what has been discovered is a problem with EM as it is with many clustering systems. Applying a supervised model to analyze the results of an unsupervised clustering is one technique to help explain the results of an EM clustering.

27 Data Warehouse and Data Mining Chapter 11 27 Key Terms A priori probability. The probability a hypothesis is true lacking evidence to support or reject the hypothesis. Agglomerative clustering. An unsupervised technique where each data instance initially represents its own cluster. Successive iterations of the algorithm merge pairs of highly similar clusters until all instance become members of a single cluster. In the last step, a decision is made about which clustering is a best final result. Basic-level nodes. The nodes in a concept hierarchy that represent concepts easily identified by humans.

28 Data Warehouse and Data Mining Chapter 11 28 Key Terms Bayes classifier. A supervised learning approach that classifies new instances by using Bayes theorem. Bayes theorem. The probability of a hypothesis given some evidence is equal to the probability of the evidence given the hypothesis, times the probability of the hypothesis, divided by the probability of the evidence. Bayesian Information Criterion (BIC). The BIC gives the posterior odds for one data mining model against another model assuming neither model is favored initially.

29 Data Warehouse and Data Mining Chapter 11 29 Key Terms Category utility. An unsupervised evaluation function that measures the gain in the “expected number” of correct attribute-value predictions for a specific object if it were placed within a given category or cluster. Coefficient of determination. For a regression analysis, the correlation between actual and estimated values for the dependent variable. Concept hierarchy. A tree structure where each node of the tree represents a concept at some level of abstraction. Nodes toward the top of the tree are the most general. Leaf nodes represent individual data instances.

30 Data Warehouse and Data Mining Chapter 11 30 Key Terms Conceptual clustering. An incremental unsupervised clustering method that creates a concept hierarchy from a set of input instances. Conditional probability. The conditional probability of evidence E given hypothesis H denoted by P(E | H), is the probability E is true given H is true. Incremental learning. A form of learning that is supported in an unsupervised environment where instances are presented sequentially. As each new instance is seen, the learning model is modified to reflect the addition of the new instance.

31 Data Warehouse and Data Mining Chapter 11 31 Key Terms Linear regression. A statistical technique that models the variation in a numeric dependent variable as a linear combination of one or several independent variables. Logistic regression. A nonlinear regression technique for problems having a binary outcome. A created regression equation limits the values of the output attribute to values between 0 and 1.This allows output values to represent a probability of class membership.

32 Data Warehouse and Data Mining Chapter 11 32 Key Terms Logit. The natural logarithm of the odds ratio p(y = 1| x)/[1-p(y = 1| x)]. p(y = 1| x) is the conditional probability that the value of the linear regression equation determined by feature vector x is 1. Mixture. A set of n probability distributions where each distribution represent a cluster. Model tree. A decision tree where each leaf node contains a linear regression equation.

33 Data Warehouse and Data Mining Chapter 11 33 Key Terms Regression. The process of developing an expression that predicts a numeric output value. Regression tree. A decision tree where leaf nodes contain averaged numeric values. Simple linear regression. A regression equation with a single independent variable. Slope-intercept form. A linear equation of the form y = ax + b where a is the slope of the line and b is the y-intercept.

34 Data Warehouse and Data Mining Chapter 11 34

35 Chapter 10 Association Rules

36 Data Warehouse and Data Mining Chapter 11 36 Content Association rule mining Mining single-dimensional Boolean association rules from transactional databases Mining multilevel association rules from transactional databases Mining multidimensional association rules from transactional databases and data warehouse From association mining to correlation analysis Constraint-based association mining Summary

37 Data Warehouse and Data Mining Chapter 11 37 What Is Association Mining? Association rule mining: Finding frequent patterns, associations, correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Applications: Basket data analysis, clustering, classification

38 Data Warehouse and Data Mining Chapter 11 38 Association Rule: Basic Concepts Given: (1) database of transactions, (2) each transaction is a list of items (purchased by a customer in a visit) Find: all rules that correlate the presence of one set of items with that of another set of items –E.g., 98% of people who purchase tires and auto accessories also get automotive services done

39 Data Warehouse and Data Mining Chapter 11 39 Association Rule: Basic Concepts Applications –*  Maintenance Agreement (What the store should do to boost Maintenance Agreement sales) –Home Electronics  * (What other products should the store stocks up?) –Attached mailing in direct marketing

40 Data Warehouse and Data Mining Chapter 11 40 Rule Measures: Support and Confidence Find all the rules X & Y  Z with minimum confidence and support –support, s, probability that a transaction contains {X & Y => Z} –confidence, c, conditional probability that a transaction having {X & Y} also contains Z Customer buys beer Customer buys diaper Customer buys both

41 Data Warehouse and Data Mining Chapter 11 41 Rule Measures: Support and Confidence Let minimum support 50%, and minimum confidence 50%, we have –A  C (50%, 66.6%) –C  A (50%, 100%) Customer buys diaper Customer buys both

42 Data Warehouse and Data Mining Chapter 11 42 Mining Association Rules — An Example For rule A  C : support = support({A &C}) = 2/4 = 50% confidence = support({A &C})/support({A}) =2/3= 66.6% Min. support 50% Min. confidence 50%

43 Data Warehouse and Data Mining Chapter 11 43 Mining Frequent Itemsets: the Key Step The Apriori principle: Any subset of a frequent itemset must be frequent

44 Data Warehouse and Data Mining Chapter 11 44 Use the frequent itemsets to generate.............association rules. Find the frequent itemsets: the sets of items that have minimum support –A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset –Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) The Apriori Algorithm

45 Data Warehouse and Data Mining Chapter 11 45 The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

46 Data Warehouse and Data Mining Chapter 11 46 The Apriori Algorithm Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k !=  ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return  k L k ;

47 Data Warehouse and Data Mining Chapter 11 47 The Apriori Algorithm — Example Database D Scan D C1C1 L1L1 L2L2 C2C2 C2C2 C3C3 L3L3

48 Data Warehouse and Data Mining Chapter 11 48 Generating Association Rules Confidence and Support Generating Association Rules Confidence and Support -Milk-Cheese -Bread-Eggs Possible associations include the following: 1. If customers purchase milk they also purchase bread. 2. If customers purchase bread they also purchase milk. 3. If customers purchase milk and eggs they also purchase cheese and bread. 4. If customers purchase milk, cheese, and eggs they also purchase bread.

49 Data Warehouse and Data Mining Chapter 11 49 Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example

50 Data Warehouse and Data Mining Chapter 11 50 Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example

51 Data Warehouse and Data Mining Chapter 11 51 Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example

52 Data Warehouse and Data Mining Chapter 11 52 Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example Two possible two-item set rule are:

53 Data Warehouse and Data Mining Chapter 11 53 Generating Association Rules Mining Association Rules: An Example Generating Association Rules Mining Association Rules: An Example Here are three of several possible three-item set rules:

54 Data Warehouse and Data Mining Chapter 11 54 Reference Data Mining: Concepts and Techniques (Chapter 6 Slide for textbook), Jiawei Han and Micheline Kamber, Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Canada


Download ppt "Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate."

Similar presentations


Ads by Google