Download presentation

Presentation is loading. Please wait.

Published byWill Pine Modified over 2 years ago

1
Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

2
Roadmap Feature representations: Features in attribute-value matrices Motivation: text classification Managing features General approaches Feature selection techniques Feature scoring measures Alternative feature weighting Chi-squared feature selection

3
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates

4
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features

5
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction

6
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import

7
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import Global feature weighting: weight whole column Local feature weighting: weight cell, conditions

8
Feature Selection Example Task: Text classification Feature template definition:

9
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation:

10
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection:

11
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting:

12
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting: Apply tf*idf feature weighting tf = term frequency; idf = inverse document frequency

13
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions

14
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size

15
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic:

16
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness

17
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN

18
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost

19
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost Leads to difficulty with estimation/learning More dimensions more samples needed to learn model

20
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance

21
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.

22
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well

23
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well Expensive: Training cost, training cost Poor prediction: overfitting, sparseness

24
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches:

25
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local)

26
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local) Feature selection/filtering, vs Feature mapping (aka extraction)

27
Feature Selection Feature selection: r’ is a subset of r How can we pick features?

28
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches:

29
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance

30
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores

31
Feature Selection Wrapper approach: Pros:

32
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons:

33
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros

34
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons:

35
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons: Doesn’t always boost task performance

36
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r

37
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated

38
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples:

39
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters

40
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix Produces ‘closest’ rank r’ approximation of original

41
Feature Mapping Pros:

42
Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons:

43
Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons: Some ad-hoc factors: e.g. # of dimensions Resulting feature space can be hard to interpret

44
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

45
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent

46
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent Many different measures: Mutual information Information gain Chi-squared etc…

47
Feature Scoring Measures

48
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C

49
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have

50
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have

51
Setting Up !c i cici !t k ab tktk cd

52
Setting Up !c i cici !t k ab tktk cd

53
Setting Up !c i cici !t k ab tktk cd

54
Setting Up !c i cici !t k ab tktk cd

55
Setting Up !c i cici !t k ab tktk cd

56
Setting Up !c i cici !t k ab tktk cd

57
Setting Up !c i cici !t k ab tktk cd

58
Feature Selection Functions Question: What makes a good features?

59
Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes

60
Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes I.e. features are best that most effectively differentiate between classes

61
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears

62
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold

63
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros:

64
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons:

65
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons: Ad-hoc, low DF terms ‘topical’

66
Term Selection Functions: MI Pointwise Mutual Information (MI)

67
Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent

68
Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent Issue: Can be heavily influenced by marginal Problem comparing terms of differing frequencies

69
Term Selection Functions: IG Information Gain: Intuition: Transmitting Y, how many bits can we save if we know X? IG(Y,X i ) = H(Y)-H(Y|X)

70
Information Gain: Derivation From F. Xia, ‘11

71
More Feature Selection GSS coefficient: From F. Xia, ‘11

72
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs From F. Xia, ‘11

73
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11

74
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11

75
More Term Selection Relevancy score: From F. Xia, ‘11

76
More Term Selection Relevancy score: Odds Ratio: From F. Xia, ‘11

77
Global Selection Previous measures compute class-specific selection What if you want to filter across ALL classes? Compute an aggregate measure across classes Sum: Average: Max: From F. Xia, ‘11

78
What’s the best? Answer: It depends on …. Classifiers Type of data … According to (Yang and Pedersen, 1997): {OR,NGL,GSS} > {X 2 max,Ig sum }> {# avg }>>{MI} On text classification tasks Using kNN From F. Xia, ‘11

79
Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) tfidf = tf*idf

80
Chi Square Tests for presence/absence of relation random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction

81
Chi Square Example Can gender predict shoe choice? A: male/female Features B: shoe choice Classes: {sandal, sneaker,…} Due to F. Xia sandalsneakerleather shoe bootother Male6171395 Female1357169

82
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male50 Female50 Total1922202514100 Due to F. Xia

83
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.5111012.5750 Female9.5111012.5750 Total1922202514100 Due to F. Xia

84
Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11+.. = 14.026

85
Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association Compute X 2

86
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/N(c+d)(b+d)/Nc+d totala+cb+dN

87
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table If probability is low – below some significance level Can reject null hypothesis

88
Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages Sufficient values per cell: > 5

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google