 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.

 Classification 1

2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes are known for the examples used to build the classifier.  A classifier can be a set of rules, a decision tree, a neural network, etc.  Typical applications: credit approval, direct marketing, fraud detection, medical diagnosis, ….. Classification

3 Simplicity first  Simple algorithms often work very well!  There are many kinds of simple structure, eg:  One attribute does all the work  All attributes contribute equally & independently  A weighted linear combination might do  Instance-based: use a few prototypes  Use simple logical rules  Success of method depends on the domain

4 Inferring rudimentary rules  1R: learns a 1-level decision tree  I.e., rules that all test one particular attribute  Basic version  One branch for each value  Each branch assigns most frequent class  Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch  Choose attribute with lowest error rate (assumes nominal attributes)

5 Pseudo-code for 1R For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate  Note: “missing” is treated as a separate attribute value

6 Evaluating the weather attributes AttributeRulesErrorsTotal errors Outlook Sunny  No 2/54/14 Overcast  Yes 0/4 Rainy  Yes 2/5 Temp Hot  No* 2/45/14 Mild  Yes 2/6 Cool  Yes 1/4 Humidity High  No 3/74/14 Normal  Yes 1/7 Windy False  Yes 2/85/14 True  No* 3/6 OutlookTempHumidityWindyPlay SunnyHotHighFalseNo SunnyHotHighTrueNo OvercastHotHighFalseYes RainyMildHighFalseYes RainyCoolNormalFalseYes RainyCoolNormalTrueNo OvercastCoolNormalTrueYes SunnyMildHighFalseNo SunnyCoolNormalFalseYes RainyMildNormalFalseYes SunnyMildNormalTrueYes OvercastMildHighTrueYes OvercastHotNormalFalseYes RainyMildHighTrueNo * indicates a tie

7 Dealing with numeric attributes  Discretize numeric attributes  Divide each attribute’s range into intervals  Sort instances according to attribute’s values  Place breakpoints where the class changes (the majority class)  This minimizes the total error  Example: temperature from weather data 64 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes | No | Yes Yes Yes | No No Yes | Yes Yes | No | Yes Yes | No OutlookTemperatureHumidityWindyPlay Sunny85 FalseNo Sunny8090TrueNo Overcast8386FalseYes Rainy7580FalseYes ……………

8 The problem of overfitting  This procedure is very sensitive to noise  One instance with an incorrect class label will probably produce a separate interval  Also: time stamp attribute will have zero errors  Simple solution: enforce minimum number of instances in majority class per interval

10 With overfitting avoidance  Resulting rule set: AttributeRulesErrorsTotal errors Outlook Sunny  No 2/54/14 Overcast  Yes 0/4 Rainy  Yes 2/5 Temperature  77.5  Yes 3/105/14 > 77.5  No* 2/4 Humidity  82.5  Yes 1/73/14 > 82.5 and  95.5  No 2/6 > 95.5  Yes 0/1 Windy False  Yes 2/85/14 True  No* 3/6

 Missing Values 11

12 Missing Values Many data sets are plagued by the problem of missing values  missing values can be a result of manual data entry, incorrect measurements, equipment errors, etc.  they are usually denoted by special characters such as: NULL * ?

13 Table 2.1

14 Missing Values  imputation (filling-in) of missing data We will use two ways of single imputation:  Single Imputation  Hot Deck Imputation

15 Missing Values  single imputation  mean imputation method uses the mean of values of a feature that contains missing data  in case of a symbolic/categorical feature, a mode (the most frequent value) is used  the algorithm imputes missing values for each attribute separately

16 Table 2.2

17 Missing Values - single imputation  hot deck imputation: for each object that contains missing values the most similar object (according to some distance function) is found, and the missing values are imputed from that object  if the most similar record also contains missing values for the same feature then it is discarded and another closest object is found  the procedure is repeated until all the missing values are imputed  when no similar object is found, the closest object with the minimum number of missing values is chosen to impute the missing values

18 Table 2.3

19 Noise

20 Noise Def.: Noise in the data is defined as a value that is a random error or variance in a measured feature  the amount of noise in the data can jeopardize the entire KDP results  the influence of noise on the data can be prevented by imposing constraints on features to detect anomalies when the data is entered  for instance, DBMS usually provides facility to define constrains for individual attributes

Noise Detection  In manual inspection, the user checks feature values against predefined constraints and manually detects the noise  For example, for object 5 in table 2.3, the cholesterol value is 45.0, which is outside the predefined acceptable interval for this feature, namely, within [50.0, 600.0]. 21

22 Noise Noise can be removed using  Binning  Requires ordering values of the noisy feature and then substituting the values with a mean or median value for predefined bins  In table 2.3, the attribute of Cholesterol contains the value of “45” which is a noise. Binning first orders the values of the noisy feature and then replaces the values with a mean or median value for the predefined bins. As an example, let us consider the cholesterol feature, with its values 45.0, 261.2, 331.2, and 407.5. If the bin size equals two, two bins are created: bin1 with 45.0 and 261.2, and bin2 with 331.2 and 407.5. For bin1 the mean value is 153.1, and for bin2 it is 369.4. Therefore the values 45.0 and 261.2 would be replaced with 153.1 and the values 331.2 and 407.5 with 369.4. Note that the two new values are within the acceptable interval.

 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.

Similar presentations

Presentation on theme: " Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.

Similar presentations

Presentation on theme: " Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes."— Presentation transcript:

Similar presentations

About project

Feedback