Presentation on theme: "Osztályozás. Célja Az osztályozás célja új dokumentumot, szavakat előre megadott csoportok valamelyikéhez rendelni oly módon, hogy az legjobban."— Presentation transcript:
Célja Az osztályozás célja új dokumentumot, szavakat előre megadott csoportok valamelyikéhez rendelni oly módon, hogy az legjobban illeszkedjen a csoport elemeivel – előre definiált csoportok vannak – felügyelt tanulás – hozzárendelési szabályt állít elő
By Dr. Borne 2005UMUC Data Mining Lecture 43 Introduction to Classification Applications Classification = to learn a function that classifies the data into a set of predefined classes. – predicts categorical class labels (i.e., discrete labels) – classifies data (constructs a model) based on the training set and on the values (class labels) in a classifying attribute; and then uses the model to classify new database entries. Example: A bank might want to learn a function that determines whether a customer should get a loan or not. Decision trees and Bayesian classifiers are examples of classification algorithms. This is called Credit Scoring. Other applications: Credit approval; Target marketing; Medical diagnosis; Outcome (e.g., Treatment) analysis.
By Dr. Borne 2005UMUC Data Mining Lecture 44 Classification - a 2-Step Process Model Construction (Description): describing a set of predetermined classes = Build the Model. – Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute – The set of tuples used for model construction = the training set – The model is represented by classification rules, decision trees, or mathematical formulae Model Usage (Prediction): for classifying future or unknown objects, or for predicting missing values = Apply the Model. – It is important to estimate the accuracy of the model: The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is chosen completely independent of the training set, otherwise over-fitting will occur
By Dr. Borne 2005UMUC Data Mining Lecture 45 When to use Classification Applications? If you do not know the types of objects stored in your database, then you should begin with a Clustering algorithm, to find the various clusters (classes) of objects within the DB. This is Unsupervised Learning. If you already know the classes of objects in your database, then you should apply Classification algorithms, to classify all remaining (or newly added) objects in the database using the known objects as a training set. This is Supervised Learning. If you are still learning about the properties of known objects in the database, then this is Semi-Supervised Learning, which may involve Neural Network techniques.
Dokumentum osztályozás A dokumentumot az előre ismert osztályok egyikéhez (vagy csoportjához) rendeljük Szóhalmaz Kategória A leképzés tanító mintán alapuló statisztikai módszerekkel történik – Bayes – Döntési fa – K legközelebbi szomszéd – SVM
By Dr. Borne 2005UMUC Data Mining Lecture 47 Issues in Classification - 1 Data Preparation: – Data cleaning Preprocess data in order to reduce noise and handle missing values – Relevance analysis (feature selection) The “ interestingness problem ” Remove the irrelevant or redundant attributes
By Dr. Borne 2005UMUC Data Mining Lecture 48 Issues in Classification - 3 Robustness: – Handling noise and missing values Speed and scalability of model – time to construct the model – time to use the model Scalability of implementation – ability to handle ever-growing databases Interpretability: – understanding and insight provided by the model Goodness of rules – decision tree size – compactness of classification rules Predictive accuracy
By Dr. Borne 2005 UMUC Data Mining Lecture 49 Issues in Classification - 4 Overfitting – Definition: If your classifier (machine learning model) fits noise (i.e., pays attention to parts of the data that are irrelevant), then it is overfitting. GOOD BAD
Bayesian Methods Learning and classification methods based on probability theory (see spelling / POS) Bayes theorem plays a critical role Build a generative model that approximates how data is produced Uses prior probability of each category given no information about an item. Categorization produces a posterior probability distribution over the possible categories given a description of an item.
By Dr. Borne 2005UMUC Data Mining Lecture 412 Bayesian Classifiers Bayes Theorem: P(C|X) = P(X|C) P(C) / P(X) which states … posterior = (likelihood x prior) / evidence P(C) = prior probability = probability that any given sample data is in class C, estimated before we have measured the sample data. We wish to determine the posterior probability P(C|X) that estimates whether C is the correct class for a given set of sample data X.
By Dr. Borne 2005UMUC Data Mining Lecture 413 Estimating Bayesian Classifiers P(C|X) = P(X|C) P(C) / P(X) … – Estimate P(C j ) by counting the frequency of occurrence of each class C j in the training data set.* – Estimate P(X k ) by counting the frequency of occurrence of each attribute value X k in the data.* – Estimate P(X k | C j ) by counting how often the attribute value X k occurs in class C j in the training data set.* – Calculate the desired end-result P(C j | X k ) which is the classification = the probability that C j is the correct class for a data item having attribute X k. (*Estimating these probabilities can be computationally very expensive for very large data sets.)
By Dr. Borne 2005UMUC Data Mining Lecture 414 Example of Bayes Classification Show sample database Show application of Bayes theorem: – Use sample database as the “ set of priors ” – Use Bayes results to classify new data
By Dr. Borne 2005UMUC Data Mining Lecture 415 Example of Bayesian Classification : Suppose that you have a database D that contains characteristics of a large number of different kinds of cars that are sorted according to each car ’ s manufacturer = the car ’ s classification C. Suppose one of the attributes X in D is the car ’ s “ color ”. Measure P(C) from the frequency of different manufacturers in D. Measure P(X) from the frequency of different colors among the cars in D. (This estimate is made independent of manufacturer.) Measure P(X|C) from frequency of cars with color X made by manufacturer C. Okay, now you see a red car flying down the beltway. What is the car ’ s make (manufacturer)? You can estimate the likelihood that the car is from a given manufacturer C by calculating P(C|X) via Bayes Theorem: – P(C|X) = P(X|C) P(C) / P(X) (Class is “C” when P(C|X) is a maximum.) With only one attribute, this is a trivial result, and not very informative. However, using a larger set of attributes (e.g., two-door, with sun roof) leads to a much better classification estimator : example of a Bayes Belief Network.
By Dr. Borne 2005UMUC Data Mining Lecture 416 Sample Database for Bayes Classification Example x = car color C = class of car (manufacturer) Car Database: Tuple x C 1 red honda 2 bluehonda 3 whitehonda 4 red chevy 5 bluechevy 6 whitechevy 7 red toyota 8 whitetoyota 9 whitetoyota 10 red chevy 11 whiteford 12 whiteford 13 blue ford 14 red chevy 15 red dodge Some statistical results: x1 = red P(x1) = 6/15 x2 = white P(x2) = 6/15 x3 = blue P(x3) = 3/15 C1 = chevy P(C1) = 5/15 C2 = honda P(C2) = 3/15 C3 = toyota P(C3) = 3/15 C4 = ford P(C4) = 3/15 C5 = dodge P(C5) = 1/15
By Dr. Borne 2005UMUC Data Mining Lecture 417 Application #1 of Bayes Theorem Recall the theorem: P(C|X) = P(X|C) P(C) / P(X) From last slide, we know P(C) and P(X). Calculate P(X|C) and then we can perform the classification. P(C | red) = P(red | C) * P(C) / P(red) P(red | chevy) = 3/5 P(red | honda) = 1/3 P(red | toyota) = 1/3 P(red | ford) = 0/3 P(red | dodge) = 1/1 Therefore... P(chevy | red) = 3/5 * 5/15 * 15/6 = 3/6 = 50% P(honda | red) = 1/3 * 3/15 * 15/6 = 1/6 = 17% P(toyota | red) = 1/3 * 3/15 * 15/6 = 1/6 = 17% P(ford | red) = 0 P(dodge | red) = 1/1 * 1/15 * 15/6 = 1/6 = 17% Example #1: We see a red car. What type of car is it?
By Dr. Borne 2005UMUC Data Mining Lecture 418 Results from Bayes Example #1 Therefore, the red car is most likely a Chevy (maybe a Camaro or Corvette? ). The red car is unlikely to be a Ford. We choose the most probable class as the Classification of the new data item (red car): therefore, Classification = C1 (Chevy).
By Dr. Borne 2005UMUC Data Mining Lecture 419 Application #2 of Bayes Theorem Recall the theorem: P(C|X) = P(X|C) P(C) / P(X) P(C | white) = P(white | C) * P(C) / P(white) P(white | chevy) = 1/5 P(white | honda) = 1/3 P(white | toyota) = 2/3 P(white | ford) = 2/3 P(white | dodge) = 0/1 Therefore... P(chevy | white) = 1/5 * 5/15 * 15/6 = 1/6 = 17% P(honda | white) = 1/3 * 3/15 * 15/6 = 1/6 = 17% P(toyota | white) = 2/3 * 3/15 * 15/6 = 2/6 = 33% P(ford | white) = 2/3 * 3/15 * 15/6 = 2/6 = 33% P(dodge | white) = 0 Example #2: We see a white car. What type of car is it?
By Dr. Borne 2005UMUC Data Mining Lecture 420 Results from Bayes Example #2 Therefore, the white car is equally likely to be a Ford or a Toyota. The white car is unlikely to be a Dodge. If we choose the most probable class as the Classification, we have a tie. You can either pick one of the two classes randomly (if you must pick). Or else weight each class 0.50 in the output classification (C3, C4), if a probabilistic classification is permitted.
By Dr. Borne 2005UMUC Data Mining Lecture 421 Why Use Bayesian Classification? Probabilistic Learning: Allows you to calculate explicit probabilities for a hypothesis -- “ learn as you go ”. This is among the most practical approaches to certain types of learning problems (e.g., e-mail Spam detection). Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Data-Driven: Prior knowledge can be combined with observed data. Probabilistic Prediction: Allows you to predict multiple hypotheses, each weighted by their own probabilities. The Standard: Bayesian methods provide a standard of optimal decision-making against which other methods can be compared.
By Dr. Borne 2005UMUC Data Mining Lecture 422 Naïve Bayesian Classification Naïve Bayesian Classification assumes that all classes C(i) are independent of one another. Naïve Bayes assumption: attribute independence P(x 1,…,x k |C) = P(x 1 |C)·…·P(x k |C) (= a simple product of probabilities) P(x i |C) is estimated as the relative frequency of samples in class C for which their attribute “ i ” has the value “ x i ”. This assumes that there is no correlation in the attribute values x 1,…,x k (attribute independence)
By Dr. Borne 2005UMUC Data Mining Lecture 423 The Independence Hypothesis… … makes the computation possible (tractable) … yields optimal classifiers when satisfied … but is seldom satisfied in practice, as attributes (variables) are often correlated. Some approaches to overcome this limitation: – Bayesian networks, that combine Bayesian reasoning with causal relationships between attributes – Decision trees, that reason on one attribute at a time, considering most important attributes first
Decision Tree Based Classification Advantages: – Inexpensive to construct – Extremely fast at classifying unknown records – Easy to interpret for small-sized trees – Accuracy is comparable to other classification techniques for many simple data sets
26 Decision trees Decision trees are popular for pattern recognition because the models they produce are easier to understand. Root node AA B B BB A.Nodes of the tree B.Leaves (terminal nodes) of the tree C.Branches (decision point) of the tree C
27 Weather Data: Play or not Play? OutlookTemperatureHumidityWindyPlay? sunnyhothighfalseNo sunnyhothightrueNo overcasthothighfalseYes rainmildhighfalseYes raincoolnormalfalseYes raincoolnormaltrueNo overcastcoolnormaltrueYes sunnymildhighfalseNo sunnycoolnormalfalseYes rainmildnormalfalseYes sunnymildnormaltrueYes overcastmildhightrueYes overcasthotnormalfalseYes rainmildhightrueNo Note: Outlook is the Forecast, no relation to Microsoft email program
28 overcast highnormal false true sunny rain No Yes Example Tree for “Play?” Outlook Humidity Windy
29 Building Decision Tree [Q93] Top-down tree construction – At start, all training examples are at the root. – Partition the examples recursively by choosing one attribute each time. Bottom-up tree pruning – Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.
30 Choosing the Splitting Attribute At each node, available attributes are evaluated on the basis of separating the classes of the training examples. A Goodness function is used for this purpose. Typical goodness functions: – information gain (ID3/C4.5) – information gain ratio – gini index witten&eibe
32 A criterion for attribute selection Which is the best attribute? – The one which will result in the smallest tree – Heuristic: choose the attribute that produces the “purest” nodes Popular impurity criterion: information gain – Information gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain witten&eibe
33 Computing information Information is measured in bits – Given a probability distribution, the info required to predict an event is the distribution’s entropy – Entropy gives the information required in bits (this can involve fractions of bits!) Formula for computing the entropy: witten&eibe
Alternative Splitting Criteria based on INFO Entropy at a given node t: (NOTE: p( j | t) is the relative frequency of class j at node t). – Measures homogeneity of a node. Maximum (log n c ) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information – Entropy based computations are similar to the GINI index computations
37 Example: attribute “Outlook”, 2 “Outlook” = “Sunny”: “Outlook” = “Overcast”: “Outlook” = “Rainy”: Expected information for attribute: Note: log(0) is not defined, but we evaluate 0*log(0) as zero witten&eibe
38 Computing the information gain Information gain: (information before split) – (information after split) Compute for attribute “Humidity” witten&eibe
39 Example: attribute “Humidity” “Humidity” = “High”: “Humidity” = “Normal”: Expected information for attribute: Information Gain:
40 Computing the information gain Information gain: (information before split) – (information after split) Information gain for attributes from weather data: witten&eibe
42 The final decision tree Note: not all leaves need to be pure; sometimes identical instances have different classes Splitting stops when data can’t be split any further witten&eibe
43 Highly-branching attributes Problematic: attributes with a large number of values (extreme case: ID code) Subsets are more likely to be pure if there is a large number of values Information gain is biased towards choosing attributes with a large number of values This may result in overfitting (selection of an attribute that is non-optimal for prediction) witten&eibe
44 Weather Data with ID code ID OutlookTemperatureHumidityWindyPlay? A sunnyhothighfalseNo B sunnyhothightrueNo C overcasthothighfalseYes D rainmildhighfalseYes E raincoolnormalfalseYes F raincoolnormaltrueNo G overcastcoolnormaltrueYes H sunnymildhighfalseNo I sunnycoolnormalfalseYes J rainmildnormalfalseYes K sunnymildnormaltrueYes L overcastmildhightrueYes M overcasthotnormalfalseYes N rainmildhightrueNo
45 Split for ID Code Attribute Entropy of split = 0 (since each leaf node is “ pure ”, having only one case. Information gain is maximal for ID code witten&eibe
46 Gain ratio Gain ratio: a modification of the information gain that reduces its bias on high-branch attributes Gain ratio should be – Large when data is evenly spread – Small when all data belong to one branch Gain ratio takes number and size of branches into account when choosing an attribute – It corrects the information gain by taking the intrinsic information of a split into account (i.e. how much info do we need to tell which branch an instance belongs to) witten&eibe
47 Gain Ratio and Intrinsic Info. Intrinsic information: entropy of distribution of instances into branches Gain ratio (Quinlan’86) normalizes info gain by:
48 Computing the gain ratio Example: intrinsic information for ID code Importance of attribute decreases as intrinsic information gets larger Example of gain ratio: Example: witten&eibe
49 More on the gain ratio “Outlook” still comes out top However: “ID code” has greater gain ratio – Standard fix: ad hoc test to prevent splitting on that type of attribute Problem with gain ratio: it may overcompensate – May choose an attribute just because its intrinsic information is very low – Standard fix: First, only consider attributes with greater than average information gain Then, compare them on gain ratio witten&eibe
50 If a data set T contains examples from n classes, gini index, gini(T) is defined as where p j is the relative frequency of class j in T. gini(T) is minimized if the classes in T are skewed. *CART Splitting Criteria: Gini Index
51 Discussion Algorithm for top-down induction of decision trees (“ID3”) was developed by Ross Quinlan – Gain ratio just one modification of this basic algorithm – Led to development of C4.5, which can deal with numeric attributes, missing values, and noisy data Similar approach: CART (to be covered later) There are many other attribute selection criteria! (But almost no difference in accuracy of result.)
52 C4.5 History ID3, CHAID – 1960s C4.5 innovations (Quinlan): – permit numeric attributes – deal sensibly with missing values – pruning to deal with for noisy data C4.5 - one of best-known and most widely-used learning algorithms – Last research version: C4.8, implemented in Weka as J4.8 (Java) – Commercial successor: C5.0 (available from Rulequest)
How to Address Overfitting Pre-Pruning (Early Stopping Rule) – Stop the algorithm before it becomes a fully-grown tree – Typical stopping conditions for a node: Stop if all instances belong to the same class Stop if all the attribute values are the same – Based on statistical significance test – Stop growing the tree when there is no statistically significant association between any attribute and the class at a particular nod – More restrictive conditions: Stop if number of instances is less than some user-specified threshold Stop if class distribution of instances are independent of the available features (e.g., using 2 test) Stop if expanding the current node does not improve impurity measures (e.g., Gini or information gain).
How to Address Overfitting… Post-pruning – Grow decision tree to its entirety – Trim the nodes of the decision tree in a bottom-up fashion – If generalization error improves after trimming, replace sub-tree by a leaf node. – Class label of leaf node is determined from majority class of instances in the sub-tree – Postpruning preferred in practice—prepruning can “ stop too early ”
55 Subtree replacement Bottom-up Consider replacing a tree only after considering all its subtrees witten & eibe
56 Estimating error rates Prune only if it reduces the estimated error Error on the training data is NOT a useful estimator Q: Why it would result in very little pruning? Use hold-out set for pruning ( “ reduced-error pruning ” ) C4.5 ’ s method – Derive confidence interval from training data – Use a heuristic limit, derived from this, for pruning – Standard Bernoulli-process-based method – Shaky statistical assumptions (based on training data) witten & eibe
Extracting Classification Rules from Trees Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction Rules are easier for humans to understand Example IF age = “ <=30 ” AND student = “ no ” THEN buys_computer = “ no ” IF age = “ <=30 ” AND student = “ yes ” THEN buys_computer = “ yes ” IF age = “ 31…40 ” THEN buys_computer = “ yes ” IF age = “ >40 ” AND credit_rating = “ excellent ” THEN buys_computer = “ yes ” IF age = “ >40 ” AND credit_rating = “ fair ” THEN buys_computer = “ no ”
K Nearest Neighbor (KNN): Training set includes classes. Examine K items near item to be classified. New item placed in class with the most number of close items. O(q) for each tuple to be classified. (Here q is the size of the training set.)
The k-Nearest Neighbor Algorithm All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to x q. Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples.. _ + _ xqxq + _ _ + _ _ +.....
Discussion on the k-NN Algorithm The k-NN algorithm for continuous-valued target functions – Calculate the mean values of the k nearest neighbors Distance-weighted nearest neighbor algorithm – Weight the contribution of each of the k neighbors according to their distance to the query point x q giving greater weight to closer neighbors – Similarly, for real-valued target functions Robust to noisy data by averaging k-nearest neighbors Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. – To overcome it, axes stretch or elimination of the least relevant attributes.
62 K Nearest Neighbors – Advantage Nonparametric architecture Simple Powerful Requires no training time – Disadvantage Memory intensive Classification/estimation is slow
63 K Nearest Neighbors The key issues involved in training this model includes setting – the variable K Validation techniques (ex. Cross validation) – the type of distant metric Euclidean measure
64 Figure K Nearest Neighbors Example X Stored training set patterns X input pattern for classification --- Euclidean distance measure to the nearest three patterns
65 Store all input data in the training set For each pattern in the test set Search for the K nearest patterns to the input pattern using a Euclidean distance measure For classification, compute the confidence for each class as C i /K, (where C i is the number of patterns among the K nearest patterns belonging to class i.) The classification for the input pattern is the class with the highest confidence.
66 Training parameters and typical settings Number of nearest neighbors – The numbers of nearest neighbors (K) should be based on cross validation over a number of K setting. – When k=1 is a good baseline model to benchmark against. – A good rule-of-thumb numbers is k should be less than the square root of the total number of training patterns.
67 Training parameters and typical settings Input compression – Since KNN is very storage intensive, we may want to compress data patterns as a preprocessing step before classification. – Using input compression will result in slightly worse performance. – Sometimes using compression will improve performance because it performs automatic normalization of the data which can equalize the effect of each input in the Euclidean distance measure.
Szupport Vektor Gépek (SVM) Szupport vektorok Maximalizálja az eltérést SVM a szeparáló hipersíkok közti eltérést maximalizálja. A döntési függvényt teljesen meghatározza a tanuló adatoknak egy részhalmaza, a szupport vektorok. Kvadratikus programozási probléma Sokan a legsikeresebb szövegosztályozási módszernek tekintik
SVN Módszer Az tér felbontása alapesetben lineáris alakzattal úgy, hogy a szeparátor elem a legjobban kettéválassza a különböző osztályokhoz tartozó objektumokat Tipikus alkalmazás: a kétosztályú esetek, pl. spam szűrés, lineárisan szeparálható esetek Alapadatok: – az objektumok az osztály hovatarttozási adatokkal (x i,y i ) – Cél a legjobb szeparációt adó hipersík meghatározása A szeparáció minőségének mérése: – a szeparációs margók közötti távolság nagysága A szeparáció feltétele, hogy ellentétes oldalra kerüljenek a különböző osztályértékű egyedek
Példa lineárisan nem szeparálhatóra Keressünk olyan hipersíkot, amely a „rossz oldalon” lév ő pontokat bünteti
Átlapoló pontok büntetése Definiáljuk minden pontra a távolságot az ax + by = c szeparátortól, mint (ax + by) - c piros pontokra c - (ax + by) kék pontokra. Átlapoló pontokra negatív lesz.
Osztályozás SVM-mel Adott egy új pont (x 1,x 2 ), határozzuk meg a hipersík normáljára vonatkozó projekcióját: – Számítsuk ki: score = w x + b – 2 dimenzióban: score = w 1 x 1 +w 2 x 2 +b. – Adjunk meg egy t konfidencia küszöböt. 3 5 7 Score > t: igen Score < -t: nem Amúgy: nem tudjuk