Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Organization: Classification

Similar presentations


Presentation on theme: "Information Organization: Classification"— Presentation transcript:

1 Information Organization: Classification

2 Classification: Type Single-Label vs. Multi-Label
Single Label: non-overlapping categories Multi-Label: overlapping categories Document-Pivoted vs. Category-Pivoted Document-Pivoted: document  categories (e.g. filtering) Category-Pivoted: category  documents (newly created category) Binary vs. non-Binary Binary: yes|no (i.e. Hard classification) Non-Binary: Multi-level (e.g. yes|no|maybe), Scored/Ranked Automatic vs. Manual Automatic: no human intervention Manual: no machine intervention (Interactive/Semi-automatic) Machine-Learning (ML) vs. Knowledge-Engineering (KE) ML: automatic classifier “learned” from training data KE: Rule-based (e.g. if a or b or c, then category C) Search Engine

3 Classification: Binary vs. Multi-Class
Binary Classification Task of classifying an item of a given set into two groups (based on its property) Examples Put a tennis ball into the Color or no-Color bin (color) Decide if an is spam or not (Medical Test) Determine if a patient has certain disease or not (Quality Control Test) Decide if a product should be sold or discarded (IR Test) Determine if a document should be in the search results or not Multi-Class Classification Task of classifying an item of a given set into multiple groups (based on its property) Put a tennis ball into the Green, Orange, or White ball bin (color) Decide if an is advertisement, newsletter, phishing, hack, or personal. Classify a document into Yahoo! Categories (Optical Recognition) Classify a scanned character into digit (0..9) Search Engine

4 Classification: Multi-Class
One vs. All M binary classifiers (BC) for M-class classification each BC trained to separate its own class from the rest e.g., (0 vs. 1..9), (1 vs. 0, 2..9), …, (9 vs. 0..8) winner-take-all: the class with highest BC score wins. Characteristics several binary classifiers can assign an item to their classes asymmetric training (many more negative than positive examples) Pairwise Classification (M-1)M/2 binary classifiers for M-class classification a classifier trained for each possible pair of classes e.g., (0 vs 1), (0 vs 2), …, (0 vs 9), …, (1 vs 2), …, (1 vs 9), …, (8 vs 9) Voting: the class with highest number of classifier votes DAG: directed acyclic graph need to train many binary classifiers symmetric training (smaller problem space) Multi-Class Objective Function 1 M-class classifier the classifier is trained to output an ordering of classes first class is the winner solves the problem directly DAG Search Engine

5 Classification: Procedure
Select/Prepare training data set of categories and classified items positive and negative samples split data into training and validation/test sets Build the classifier Classifier: feature vector consisting of most “important” terms for the class Feature Representation transform documents into a set of features (e.g., term) Dimensionality Reduction select the best set of features to improve accuracy and prevent overfitting Train/Learn the classifier on training set Optimize the classifiers on test set Test/Evaluate the classifier on test set to optimize parameters (e.g. thresholds, feature count/weights) Retrain the classifier on the whole data Apply the classifier to new data Classifier (ML) Algorithms probability of class membership e.g. Bayes method similarity to class feature vector e.g. Rocchio method, k-NN method other Support Vector Machine, Decision Tree, Regression model, etc. Search Engine

6 Dimension Reduction : Why
Initial Features: size, color, shape, pattern G1 (BG, SM): size G2 (SQ, CR): shape G3 (BL, RD): color G4? Group 1 Group 2 Group 3 BG SM SQ CR BL RD Group 4 Class 1 Class 2 Class 3 Search Engine

7 Classification: Dimension Reduction
Feature Reduction stopping, stemming & lemmatization Feature Selection Select a subset of original feature set Document Frequency (df) Information Gain (IG): i.e. Kullback-Leibler divergence measures the usefulness (gain in information) of a feature in predicting a class best performance w/ small feature set Mutual Information (MI) measures dependency between a feature and a class sensitive to small counts Chi Square (2) measures the lack of independence between a feature and a class way for measuring the degree to which two patterns (expected & observed) differ  ((observed freq – expected freq)2 / expected freq) Term Strength (TS) Estimates term importance based on how commonly a term is likely to appear in related documents TS(t,C) = P(tx|ty) where x & y are a pair of related documents (e.g., training set) Feature Extraction Extract a set of features by combination or transformation of the original feature set Term Clustering Latent Semantic Indexing (SVD/PCA) Class Feature Y N A B C D Search Engine

8 Bayes Classifier Pick the most probable class c, given the evidence d:
c = argmax [ P(Cj |d) ] Cj = class/category j d = document (t1, t2, …, tn) P(Cj |d) = probability document d belongs to category j = P(Cj ) P(d |Cj) / P(d) = P(Cj ) P(d |Cj) P(Cj ) = probability that randomly picked document belongs to category j P(d |Cj) = probability that category j contains document d Naïve Bayes assumption P(Cj ) = number of documents belonging to Cj / total number of documents P(tk |Cj ) = probability of a term (i.e. feature) k occurring in category j = number of documents in Cj with term k / number of documents in Cj  Example Search Engine

9 Rocchio Classifier Build the class vectors using Rocchio’s Relevance Feedback formula no initial query Class prototype vector = average vector of the class R = positive examples, S = negative examples Compute document-class similarity class vector created from training data is used to classify new documents Rank classes by similarity Search Engine

10 K-Nearest Neighbor Classifier
Similar to Rocchio classifier All instances correspond to points in an n-dimensional Euclidean space e.g. vector space use document’s k nearest neighbors in the training set to compute doc-class similarity find k nearest neighbors of document in each category of the training set rank category by doc-KNN similarity instead of using static class centroid vectors, use centroids of k vectors in the training set that are nearest to the document classified to compute doc-cluster similarity. 1-NN 3-NN Search Engine

11 Other Classifiers Support Vector Machine Decision Tree Decision Rule
Assumes linear separability In 2 dimensions, can separate by a line ax + by = c In higher dimensions, separate by a hyperplane Which hyperplane to choose? The decision function is fully specified by subset of training samples, the support vectors Decision Tree nodes = term, branch= probability, leaf=category Decision Rule Rule-based e.g., if X or Y or Z, then C1 Regression Fitting of training data to a real-valued function e.g., Linear Least Square Fit Support vectors Maximize margin Search Engine

12 Classification: Problems
Noise Data Training data often contains noise False positives and false negatives e.g. medical tests Inconsistent Classification Classification structure static, ordered Categorization tomato in fruit or vegetable category Category label retrieval, search, IR Indexing inconsistency/Error Resource Intensive Solutions? Faceted classification Fusion of IR & IO Search Engine

13 Supplemental Material
(Optional ) For curious-minded and advanced learners Supplemental Material Search Engine

14 Information theory Information conveyed by a message is –log2(p) = log2(n) If there are n equally probable possible messages, then the probability p of each is 1/n. Example: If there are 16 messages, then and we need log(16) = 4 bits to identify/send each message Entropy Information contained in a message in terms of expected value Measure of avg. information content that is missing with uncertainty Information conveyed by the distribution P = (p1, p2, .., pn): entropy (E) of P E(P) = -(p1*log(p1) + p2*log(p2) pn*log(pn)) Examples: If P is (0.5, 0.5) then I(P) is 1 If P is (0.67, 0.33) then I(P) is 0.92 If P is (1, 0) then I(P) is 0 The more uniform the probability distribution, the greater its information More information is conveyed by a message telling you which event actually occurred Information Gain the expected reduction in entropy Search Engine

15 IG: Example Values(Wind) = {Weak, Strong}
Day Outlook Temperature Humidity Wind Play Tennis D1 Sunny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Values(Wind) = {Weak, Strong} E(S) = -9/14*log2(9/14) - 5/14*log2(5/14) = 0.940 E(Sweak) = -6/8*log2 (6/8) – 2/8*log2 (2/8) = 0.811 E(Sstrong) = -3/6*log2 (3/6) – 3/6*log2 (3/6) = 1.0 IG(S,Wind) = E(S) - (8/14)  E(Sweak) - (6/14)  E(Sstrong) = (8/14)  (6/14)  1.0 = 0.048 IG(S,Outlook) = 0.246 IG(S,Humidity) = 0.151 IG(S,Temperature) = 0.029 Search Engine

16 Chi Square Chi Square (2) Null Hypothesis Expected Frequencies
measures the lack of independence between a feature and a class way for measuring the degree to which two patterns (expected & observed) differ (observed freq – expected freq)2  expected freq Null Hypothesis Feature & Class is independent Expected Frequencies E(A) = N*p(t)*p(C) = N* (A+B)/N * (A+C)/N = (A+B)(A+C)/N E(B) = N*p(t)*p(Cc) = N* (A+B)/N * (B+D)/N = (A+B)(B+D)/N E(C) = N*p(tc)*p(C) = N* (C+D)/N * (A+C)/N = (C+D)(A+C)/N E(D) = N*p(tc)*p(Cc) = N* (C+D)/N * (B+D)/N = (C+D)(B+D)/N Class (C) Feature (t) Y N Total A B A+B C D C+D A+C B+D Search Engine

17 Chi Square (observed freq – expected freq)2  expected freq E(A) = N*p(t)*p(C) = N* (A+B)/N * (A+C)/N = (A+B)(A+C)/N E(B) = N*p(t)*p(Cc) = N* (A+B)/N * (B+D)/N = (A+B)(B+D)/N E(C) = N*p(tc)*p(C) = N* (C+D)/N * (A+C)/N = (C+D)(A+C)/N E(D) = N*p(tc)*p(Cc) = N* (C+D)/N * (B+D)/N = (C+D)(B+D)/N Class (C) Feature (t) Y N Total A B A+B C D C+D A+C B+D Search Engine

18 Decision Tree Cough Fever Weight Pain Classification Mary no yes normal throat flu Fred no yes normal abdomen appendicitis Julie yes yes skinny none flu Elvis yes no obese chest heart disease Search Engine


Download ppt "Information Organization: Classification"

Similar presentations


Ads by Google