Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Selection: Why?

Similar presentations


Presentation on theme: "Feature Selection: Why?"— Presentation transcript:

1 Feature Selection: Why?
Text collections have a large number of features 10,000 – 1,000,000 unique words … and more May make using a particular classifier feasible Some classifiers can’t deal with 100,000 of features Reduces training time Training time for some methods is quadratic or worse in the number of features Can improve generalization (performance) Eliminates noise features Avoids overfitting Think the two NB models 13.5

2 Basic feature selection algorithm
For a given class c, we compute a utility measure A(t,c) for each term of the vocabulary Select the k terms that have the highest values of A(t,c) SELECTFEATURES(D, c, k) 1 V ← EXTRACTVOCABULARY(D) 2 L ← [] 3 for each t ∈ V 4 do A(t, c) ← COMPUTEFEATUREUTILITY(D, t, c) 5 APPEND(L, A(t, c), ti) 6 return FEATURESWITHLARGESTVALUES(L, k)

3 Feature selection: how?
Three utility measures: Information theory: How much information does the value of one categorical variable give you about the value of another Mutual information Hypothesis testing statistics: Are we confident that the value of one categorical variable is associated with the value of another Chi-square test Frequency 13.5

4 Feature selection via Mutual Information
In training set, choose k words which best discriminate (give most info on) the categories. The Mutual Information between a word, class is: For each word w and each category c For MLEs of the probabilities

5 Feature selection via Mutual Information (example)

6 Feature selection via Mutual Information
Mutual information measures how much information – in the information theoretic sense If a term’s distribution is the same in the class as it is in the collection as a whole, then I(U; C) =0. MI reaches its maximum value if the term is a perfect indicator for class membership if the term is present in a document if and only if the document is in the class.

7 2 statistic In statistics, the χ2 test is applied to test the independence of two events. In feature selection, the two events are occurrence of the term and occurrence of the class

8 2 statistic(example)

9 2 statistic(example) X2 is a measure of how much expected counts E and observed counts N deviate from each other. A high value of X2 indicates that the hypothesis of independence, which implies that expected and observed counts are similar, is incorrect.

10 Frequency-based feature selection
Selecting the terms that are most common in the class Frequency can be either document frequency (the number of documents in the class c that contain the term t) or as collection frequency (the number of tokens of t that occur in documents in c)

11 Frequency-based feature selection
Discussions Frequency-based feature selection selects some frequent terms that have no specific information about the class the days of the week (Monday, Tuesday, ), which are frequent across classes in newswire text. When many thousands of features are selected, then frequency-based feature selection often does well. If somewhat suboptimal accuracy is acceptable, then frequency-based feature selection can be a good alternative to more complex methods.

12 Comparison of feature selection methods
All three methods – MI, χ2 and frequency-based– are greedy methods.

13 Feature selection for NB
In general feature selection is necessary for multivariate Bernoulli NB. “Feature selection” really means something different for multinomial NB. It means dictionary truncation

14 Evaluating Categorization
Evaluation must be done on test data that are independent of the training data (usually a disjoint set of instances). Classification accuracy: c/n where n is the total number of test instances and c is the number of test instances correctly classified by the system. Results can vary based on sampling error due to different training and test sets. 13.6

15 Classifier Accuracy Measures
(classifier)C1 C2 (true)C1 True positive False negative False positive True negative Classifier Accuracy Measures classes buy_computer = yes buy_computer = no total recognition(%) 6954 46 7000 99.34 412 2588 3000 86.27 7366 2634 10000 95.42 Accuracy of a classifier M, acc(M): percentage of test set tuples that are correctly classified by the model M Error rate (misclassification rate) of M = 1 – acc(M) Alternative accuracy measures (e.g., for cancer diagnosis) sensitivity = t-pos/pos /* true positive recognition rate */ specificity = t-neg/neg /* true negative recognition rate */ precision = t-pos/(t-pos + f-pos) accuracy = sensitivity * pos/(pos + neg) + specificity * neg/(pos + neg)

16 Classification by decision tree induction

17 Decision Tree Induction: Training Dataset
This follows an example of Quinlan’s ID3

18 Output: A Decision Tree for “buys_computer”
age? overcast student? credit rating? <=30 >40 no yes 31..40 fair excellent

19 Algorithm for Decision Tree Induction
Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf

20 Attribute Selection Measure: Information Gain (ID3/C4.5)
Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D: Information needed (after using A to split D into v partitions) to classify D: Information gained by branching on attribute A I : the expected information needed to classify a given sample E (entropy) : expected information based on the partitioning into subsets by A

21 Attribute Selection: Information Gain
Class P: buys_computer = “yes” Class N: buys_computer = “no” means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence Similarly,

22 Gini index (CART, IBM IntelligentMiner)
If a data set D contains examples from n classes, gini index, gini(D) is defined as where pj is the relative frequency of class j in D If a data set D is split on A into two subsets D1 and D2, the gini index gini(D) is defined as Reduction in Impurity: The attribute provides the smallest ginisplit(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute)

23 Gini index (CART, IBM IntelligentMiner)
Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no” Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2 but gini{medium,high} is 0.30 and thus the best since it is the lowest

24 Genetic Algorithms (GA)
Genetic Algorithm: based on an analogy to biological evolution An initial population is created consisting of randomly generated rules Each rule is represented by a string of bits E.g., if A1 and ¬A2 then C2 can be encoded as 100 If an attribute has k > 2 values, k bits can be used Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules and their offsprings The fitness of a rule is represented by its classification accuracy on a set of training examples Offsprings are generated by crossover and mutation The process continues until a population P evolves when each rule in P satisfies a prespecified threshold Slow but easily parallelizable


Download ppt "Feature Selection: Why?"

Similar presentations


Ads by Google