Presentation is loading. Please wait.

Presentation is loading. Please wait.

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

Similar presentations


Presentation on theme: "Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate."— Presentation transcript:

1 Chapter 11 Statistical Techniques

2 Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate data mining technique.  Know how to perform linear regression with Microsoft Excel’s LINEST function.  Know that logistic regression can be used to build supervised learner models for datasets having a binary outcome.  Understand how Bayes classifier is able to build supervised models for datasets having categorical data, numeric data, or a combination of both data types.

3 Data Warehouse and Data Mining Chapter 11 3 Chapter Objectives  Know how agglomerative clustering is applied partition data instances into disjoint clusters.  Understand that conceptual clustering is an unsupervised data mining technique that builds a concept hierarchy to partition data instances.  Know that the EM algorithm uses a statistical parameter adjustment technique to cluster data instances.  Understand the basic features that differentiate statistical and machine learning data mining methods

4 Data Warehouse and Data Mining Chapter 11 4 Linear Regression Analysis

5 Data Warehouse and Data Mining Chapter 11 5 Linear Regression Analysis

6 Data Warehouse and Data Mining Chapter 11 6 Linear Regression Analysis

7 Data Warehouse and Data Mining Chapter 11 7 Linear Regression Analysis

8 Data Warehouse and Data Mining Chapter 11 8 Linear Regression Analysis

9 Data Warehouse and Data Mining Chapter 11 9 Logistic Regression

10 Data Warehouse and Data Mining Chapter 11 10 Logistic Regression

11 Data Warehouse and Data Mining Chapter 11 11 Bayes Classifier

12 Data Warehouse and Data Mining Chapter 11 12 Bayes Classifier

13 Data Warehouse and Data Mining Chapter 11 13 Bayes Classifier

14 Data Warehouse and Data Mining Chapter 11 14 Clustering Algorithms

15 Data Warehouse and Data Mining Chapter 11 15 Clustering Algorithms

16 Data Warehouse and Data Mining Chapter 11 16 Clustering Algorithms

17 Data Warehouse and Data Mining Chapter 11 17 Clustering Algorithms

18 Data Warehouse and Data Mining Chapter 11 18 Clustering Algorithms

19 Data Warehouse and Data Mining Chapter 11 19 Clustering Algorithms

20 Data Warehouse and Data Mining Chapter 11 20 Heuristics or Statistics? Here is one way to categorize inductive problem- solving methods: Query and visualization techniques Machine learning techniques Statistical techniques Query and visualization techniques generally fall into one of three group: Query tools OLAP tools Visualization tools

21 Data Warehouse and Data Mining Chapter 11 21 Data mining techniques come in many shapes and forms. A favorite statistical technique for estimation and prediction problems is linear regression. Linear regression attempts to model the variation in a dependent variable as a linear combination of one or more independent variables. Linear regression is an appropriate data mining strategy when the relationship between the dependent and independent variables is nearly linear. Microsoft Excel’s LINEST function provides an easy mechanism for performing multiple linear regression. Chapter Summary

22 Data Warehouse and Data Mining Chapter 11 22 Chapter Summary Linear regression is a poor choice when the outcome is binary. The problem lies in the fact that the value restriction placed on the dependent variable is not observed by the regression equation. That is, because linear regression produces a straight-line function, values of the dependent variable are unbounded in both the positive and negative directions. For the two-outcome case, logistic regression is a better choice. Logistic regression is a nonlinear regression technique that associates a conditional probability value with each data instance.

23 Data Warehouse and Data Mining Chapter 11 23 Chapter Summary Bayes classifier offers a simple yet powerful supervised classification technique. The model assumes all input attributes to be of equal importance and independent of one another. Even though these assumptions are likely to be false, Bayes classifier still works quite well in practice. Bayes classifier can be applied to datasets containing both categorical and numeric data. Also, unlike many statistical classifiers, Bayes classifier can be applied to datasets containing a wealth of missing items.

24 Data Warehouse and Data Mining Chapter 11 24 Chapter Summary Agglomerative clustering is a favorite unsupervised clustering technique. Agglomerative clustering begins by assuming each data instance represents its own cluster. Each iteration of the algorithm merges the most similar pair of clusters. Several options for computing instance and cluster similarity scores and cluster merging procedures exist. Also, when the data to be clustered is real-valued, defining a measure of instance similarity can be a challenge. One common approach is to use simple Euclidean distance. A widespread application of agglomerative clustering is its use as a prelude to other clustering techniques.

25 Data Warehouse and Data Mining Chapter 11 25 Chapter Summary Conceptual clustering is an unsupervised technique that incorporates incremental learning to form a hierarchy of concepts. The concept hierarchy takes the form of a tree structure where the root node represents the highest level of concept generalization. Conceptual clustering systems are particularly appealing because the trees they form have been shown to consistently determine psychologically preferred levels in human classification hierarchies. Also, conceptual clustering systems lend themselves well to explaining their behavior. A major problem with conceptual clustering systems is that instance ordering can have a marked impact on the results of the clustering. A nonrepresentative ordering of data instances can lead to a less than optimal clustering.

26 Data Warehouse and Data Mining Chapter 11 26 Chapter Summary The EM (expectation-maximization) algorithm is a statistical technique that makes use of the finite Gaussian mixtures model. The mixtures model assigns each individual data instance a probability that it would have a certain set of attribute values given it was a member of a specified cluster. The model assumes all attributes to be independent random variables. The EM algorithm is similar to the K-Means procedure in that a set of parameters are recomputed until a desired convergence value is achieved. A lack of explanation about what has been discovered is a problem with EM as it is with many clustering systems. Applying a supervised model to analyze the results of an unsupervised clustering is one technique to help explain the results of an EM clustering.

27 Data Warehouse and Data Mining Chapter 11 27 Key Terms A priori probability. The probability a hypothesis is true lacking evidence to support or reject the hypothesis. Agglomerative clustering. An unsupervised technique where each data instance initially represents its own cluster. Successive iterations of the algorithm merge pairs of highly similar clusters until all instance become members of a single cluster. In the last step, a decision is made about which clustering is a best final result. Basic-level nodes. The nodes in a concept hierarchy that represent concepts easily identified by humans.

28 Data Warehouse and Data Mining Chapter 11 28 Key Terms Bayes classifier. A supervised learning approach that classifies new instances by using Bayes theorem. Bayes theorem. The probability of a hypothesis given some evidence is equal to the probability of the evidence given the hypothesis, times the probability of the hypothesis, divided by the probability of the evidence. Bayesian Information Criterion (BIC). The BIC gives the posterior odds for one data mining model against another model assuming neither model is favored initially.

29 Data Warehouse and Data Mining Chapter 11 29 Key Terms Category utility. An unsupervised evaluation function that measures the gain in the “expected number” of correct attribute-value predictions for a specific object if it were placed within a given category or cluster. Coefficient of determination. For a regression analysis, the correlation between actual and estimated values for the dependent variable. Concept hierarchy. A tree structure where each node of the tree represents a concept at some level of abstraction. Nodes toward the top of the tree are the most general. Leaf nodes represent individual data instances.

30 Data Warehouse and Data Mining Chapter 11 30 Key Terms Conceptual clustering. An incremental unsupervised clustering method that creates a concept hierarchy from a set of input instances. Conditional probability. The conditional probability of evidence E given hypothesis H denoted by P(E | H), is the probability E is true given H is true. Incremental learning. A form of learning that is supported in an unsupervised environment where instances are presented sequentially. As each new instance is seen, the learning model is modified to reflect the addition of the new instance.

31 Data Warehouse and Data Mining Chapter 11 31 Key Terms Linear regression. A statistical technique that models the variation in a numeric dependent variable as a linear combination of one or several independent variables. Logistic regression. A nonlinear regression technique for problems having a binary outcome. A created regression equation limits the values of the output attribute to values between 0 and 1.This allows output values to represent a probability of class membership.

32 Data Warehouse and Data Mining Chapter 11 32 Key Terms Logit. The natural logarithm of the odds ratio p(y = 1| x)/[1-p(y = 1| x)]. p(y = 1| x) is the conditional probability that the value of the linear regression equation determined by feature vector x is 1. Mixture. A set of n probability distributions where each distribution represent a cluster. Model tree. A decision tree where each leaf node contains a linear regression equation.

33 Data Warehouse and Data Mining Chapter 11 33 Key Terms Regression. The process of developing an expression that predicts a numeric output value. Regression tree. A decision tree where leaf nodes contain averaged numeric values. Simple linear regression. A regression equation with a single independent variable. Slope-intercept form. A linear equation of the form y = ax + b where a is the slope of the line and b is the y-intercept.

34 Data Warehouse and Data Mining Chapter 11 34


Download ppt "Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate."

Similar presentations


Ads by Google