Presentation is loading. Please wait.

Presentation is loading. Please wait.

Www.monash.edu.au Classification and Prediction Monash University Semester 1, March 2006.

Similar presentations


Presentation on theme: "Www.monash.edu.au Classification and Prediction Monash University Semester 1, March 2006."— Presentation transcript:

1 Classification and Prediction Monash University Semester 1, March 2006

2 2 Patterns Data mining aims to reveal knowledge about the data under consideration Knowledge is viewed as Patterns within the data Patterns are also referred to as structures, models and relationships Learning approach is inherently linked to the patterns Which approach for what data ? – eternal question!

3 3 Taxonomy of Learning Approaches Verification Driven –Query and reporting Statistical Analysis Discovery Driven –Supervised Learning >Predictive >Typical Learning Techniques: Classification and Regression –Unsupervised Learning >Informative >Typical Learning Techniques: Clustering, Association Analysis, Deviation/Outlier Detections

4 4 Problems and Uses of Verification Driven Techniques Statistics –Assumption – you have a well-defined model >In reality – you often don’t –Focus on model estimation rather than model selection –Don’t cope well with large amounts of data >Time and Effort required is very high –Highlight linear relationships –Non-linear stats requires knowledge of the non-linearity >The ways in which variables interact >In real multi-dimensional data -> Not often known or specificiable –Important lessons from Statistics >Estimation of uncertainty and checking of assumptions is absolutely essential

5 5 Discovery Driven Data Mining Predictive/Supervised Learning –Build patterns by making a prediction of some unknown attribute given the values of other known attributes –Principal Techniques: Classification and Regression –Supervision: Training data (a set of observations/measurements) is accompanied by labels indicating the class of the observations –New data is classified based on the training data.

6 6 Discovery Driven Data Mining Informative/Unsupervised Learning –Do not present a solution to a known problem –They present “interesting” patterns for consideration by a domain expert –Principal Techniques: Clustering and Association Analysis –Labels of training data unknown –Given a set of observations/measurements (data), need to establish existence of classes or clusters in the data.

7 7 Classification Difficult to define Usually based on a set of classes that are pre-defined and a set of training samples that are used for building class models Each new object is then assigned to one of the classes already defined in the class model Typical Apps: Credit Approval, Target Marketing, Medical Diagnosis…

8 8 Classification Vs. Regression Classification and Regression are two major types of prediction techniques Classification –Predict nominal or discrete values Regression –Predict continuous or ordered values

9 9 Classification – A two stage process Model Construction –Each tuple/sample is assumed to belong to a pre- defined class, as determined by the class label attribute –Set of tuples used for model construction: Training Set –The model can be represented as a set of IF- THEN rules, Decision Trees of Mathematical Formulae

10 10 Classification Process (1): Model Construction Training Data Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)

11 11 Classification – A two stage process Model Usage –For classifying future or unknown objects –Need to estimate accuracy of the model –The known label of the test sample is compared with the classified result from the model –Accuracy rate is the percentage of test set samples that are correctly classified by the model –Test set is independent of the training set

12 12 Classification Process (2): Use the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) Tenured?

13 13 Evaluating Classification Methods Predictive accuracy Speed and scalability –time to construct the model –time to use the model Robustness –handling noise and missing values Scalability –efficiency in disk-resident databases Interpretability: –understanding and insight provided by the model Goodness of rules –decision tree size –compactness of classification rules

14 14 Decision Trees Assumption: –Data consists of records that have a number of independent attributes and a dependent attribute –A flow-chart-like tree structure –A structure to support a game of 20 questions: >Each internal node represents a test on an attribute >Each branch represents the outcome of the test >Each leaf represents a class

15 15 Classification by Decision Tree Induction Decision tree generation consists of two phases –Tree construction >At start, all the training examples are at the root >Partition examples recursively based on selected attributes –Tree pruning >Identify and remove branches that reflect noise or outliers Use of decision tree: Classifying an unknown sample –Test the attribute values of the sample against the decision tree

16 16 Training Dataset This follows an example from Quinlan’s ID3

17 17 Output: A Decision Tree for “buys_computer” age? overcast student?credit rating? noyes fair excellent <=30 >40 no yes

18 18 Decision Trees Classify objects based on the values of their independent attributes Classification is based on a tree structure where each node of the tree is a rule that involves a multi-way decision To classify an object, the attributes are compared with tests in each node starting from the root. The path found leads to a leaf node which is the class to which the object belongs to.

19 19 Decision Trees Attractive due to ease of understanding Rules can be expressed in natural language Aim: Build a DT consisting of a root node, number of internal nodes and a set of pre- defined leaf nodes. Continuous splitting of root node until the process is complete.

20 20 Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) –Tree is constructed in a top-down recursive divide-and-conquer manner –At start, all the training examples are at the root –Attributes are categorical (if continuous-valued, they are discretized in advance) –Examples are partitioned recursively based on selected attributes –Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning –All samples for a given node belong to the same class –There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf –There are no samples left

21 21 Decision Trees Attractive due to ease of understanding Rules can be expressed in natural language Aim: Build a DT consisting of a root node, number of internal nodes and a set of pre- defined leaf nodes. Continuous splitting of root node until the process is complete.

22 22 Decision Trees The quality of the tree depends on the quality of the training data 100% accurate for training data, but training data is only a sample…cannot be accurate for all data.

23 23 Finding the Split A variable has to be chosen to split the data This variable has to be such that it does the best job of discriminating among target classes. Finding which variable is most influential is non- trivial. One approach – finding the data’s diversity and choosing a variable that minimises the diversity amongst the children nodes. Several Techniques have been proposed – Information Gain/Entropy, Gini, Chi-Squared, Rough Sets etc.

24 24 Attribute Selection Measure Information gain (ID3/C4.5) –All attributes are assumed to be categorical –Can be modified for continuous-valued attributes Gini index (IBM IntelligentMiner) –All attributes are assumed continuous-valued –Assume there exist several possible split values for each attribute –May need other tools, such as clustering, to get the possible split values –Can be modified for categorical attributes

25 25 Information Theory – 101 Suppose there is a variable s that can take either value a or value b. If s is always going to be a, then there is no uncertainty and no information The most common measure of the mount of information (also known as entropy) is: –I = - SUM (p i log (p i )) Let p(a) = 0.9 and p(b) = 0.1 Let p(a) = 0.5 and p(b) = 0.5 When is I maximised ?

26 26 Information Gain (ID3/C4.5) Select the attribute with the highest information gain Assume there are two classes, P and N –Let the set of examples S contain p elements of class P and n elements of class N –The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as

27 27 Information Gain in Decision Tree Induction Assume that using attribute A a set S will be partitioned into sets {S 1, S 2, …, S v } –If S i contains p i examples of P and n i examples of N, the entropy, or the expected information needed to classify objects in all subtrees S i is The encoding information that would be gained by branching on A

28 28 Attribute Selection by Information Gain Computation  Class P: buys_computer = “yes”  Class N: buys_computer = “no”  I(p, n) = I(9, 5) =0.940  Compute the entropy for age: Hence Similarly

29 29 Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand

30 30 Extracting Classification Rules from Trees Example IF age = “<=30” AND student = “no” THEN buys_computer = “no” IF age = “<=30” AND student = “yes” THEN buys_computer = “yes” IF age = “31…40” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “excellent” THEN buys_computer = “yes” IF age = “>40” AND credit_rating = “fair” THEN buys_computer = “no”

31 31 Avoid Overfitting in Classification The generated tree may overfit the training data –Too many branches, some may reflect anomalies due to noise or outliers –Result is in poor accuracy for unseen samples

32 32 Avoid Overfitting in Classification Two approaches to avoid overfitting –Prepruning: Halt tree construction early—do not split a node if this would result in the goodness measure falling below a threshold >Difficult to choose an appropriate threshold –Postpruning: Remove branches from a “fully grown” tree—get a sequence of progressively pruned trees >Use a set of data different from the training data to decide which is the “best pruned tree”

33 33 Approaches to Determine the Final Tree Size Separate training (2/3) and testing (1/3) sets Use cross validation, e.g., 10-fold cross validation Use all the data for training –but apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution Use minimum description length (MDL) principle: –halting growth of the tree when the encoding is minimized

34 34 Decision Trees: Strengths and Weaknesses Strengths –Ability to generate understandable rules –Classify with minimal computational overhead Issues –Need to find the right split variable –Visualising Trees is tedious – particularly as data dimensionality increases –Handling both continuous and categorical data

35 35 Enhancements to basic decision tree induction Allow for continuous-valued attributes –Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals Handle missing attribute values –Assign the most common value of the attribute –Assign probability to each of the possible values Attribute construction –Create new attributes based on existing ones that are sparsely represented –This reduces fragmentation, repetition, and replication

36 36 Classification in Large Databases Classification—a classical problem extensively studied by statisticians and machine learning researchers Scalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speed Why decision tree induction in data mining? –relatively faster learning speed (than other classification methods) –convertible to simple and easy to understand classification rules –can use SQL queries for accessing databases –comparable classification accuracy with other methods

37 37 Other Classification Methods Bayesian approaches Neural Networks k-nearest neighbor classifier case-based reasoning Genetic algorithm Rough set approach Fuzzy set approaches


Download ppt "Www.monash.edu.au Classification and Prediction Monash University Semester 1, March 2006."

Similar presentations


Ads by Google