Presentation is loading. Please wait.

Presentation is loading. Please wait.

Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012.

Similar presentations


Presentation on theme: "Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012."— Presentation transcript:

1 Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

2 Roadmap Feature representations: Features in attribute-value matrices Motivation: text classification Managing features General approaches Feature selection techniques Feature scoring measures Alternative feature weighting Chi-squared feature selection

3 Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates

4 Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features

5 Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction

6 Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import

7 Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import Global feature weighting: weight whole column Local feature weighting: weight cell, conditions

8 Feature Selection Example Task: Text classification Feature template definition:

9 Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation:

10 Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection:

11 Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting:

12 Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting: Apply tf*idf feature weighting tf = term frequency; idf = inverse document frequency

13 The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions

14 The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size

15 The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic:

16 The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness

17 The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN

18 The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost

19 The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost Leads to difficulty with estimation/learning More dimensions  more samples needed to learn model

20 Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance

21 Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.

22 Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well

23 Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well Expensive: Training cost, training cost Poor prediction: overfitting, sparseness

24 Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches:

25 Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local)

26 Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local) Feature selection/filtering, vs Feature mapping (aka extraction)

27 Feature Selection Feature selection: r’ is a subset of r How can we pick features?

28 Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches:

29 Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance

30 Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores

31 Feature Selection Wrapper approach: Pros:

32 Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons:

33 Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros

34 Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons:

35 Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons: Doesn’t always boost task performance

36 Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r

37 Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated

38 Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous  concept of ‘bigness’ Examples:

39 Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous  concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters

40 Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous  concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix Produces ‘closest’ rank r’ approximation of original

41 Feature Mapping Pros:

42 Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons:

43 Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons: Some ad-hoc factors: e.g. # of dimensions Resulting feature space can be hard to interpret

44 Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

45 Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent

46 Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent Many different measures: Mutual information Information gain Chi-squared etc…

47 Feature Scoring Measures

48 Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C

49 Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have

50 Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have

51 Setting Up !c i cici !t k ab tktk cd

52 Setting Up !c i cici !t k ab tktk cd

53 Setting Up !c i cici !t k ab tktk cd

54 Setting Up !c i cici !t k ab tktk cd

55 Setting Up !c i cici !t k ab tktk cd

56 Setting Up !c i cici !t k ab tktk cd

57 Setting Up !c i cici !t k ab tktk cd

58 Feature Selection Functions Question: What makes a good features?

59 Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes

60 Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes I.e. features are best that most effectively differentiate between classes

61 Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears

62 Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold

63 Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros:

64 Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons:

65 Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons: Ad-hoc, low DF terms ‘topical’

66 Term Selection Functions: MI Pointwise Mutual Information (MI)

67 Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent

68 Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent Issue: Can be heavily influenced by marginal Problem comparing terms of differing frequencies

69 Term Selection Functions: IG Information Gain: Intuition: Transmitting Y, how many bits can we save if we know X? IG(Y,X i ) = H(Y)-H(Y|X)

70 Information Gain: Derivation From F. Xia, ‘11

71 More Feature Selection GSS coefficient: From F. Xia, ‘11

72 More Feature Selection GSS coefficient: NGL coefficient: N : # of docs From F. Xia, ‘11

73 More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11

74 More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11

75 More Term Selection Relevancy score: From F. Xia, ‘11

76 More Term Selection Relevancy score: Odds Ratio: From F. Xia, ‘11

77 Global Selection Previous measures compute class-specific selection What if you want to filter across ALL classes? Compute an aggregate measure across classes Sum: Average: Max: From F. Xia, ‘11

78 What’s the best? Answer: It depends on …. Classifiers Type of data … According to (Yang and Pedersen, 1997): {OR,NGL,GSS} > {X 2 max,Ig sum }> {# avg }>>{MI} On text classification tasks Using kNN From F. Xia, ‘11

79 Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) tfidf = tf*idf

80 Chi Square Tests for presence/absence of relation random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction

81 Chi Square Example Can gender predict shoe choice? A: male/female  Features B: shoe choice  Classes: {sandal, sneaker,…} Due to F. Xia sandalsneakerleather shoe bootother Male6171395 Female1357169

82 Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male50 Female50 Total1922202514100 Due to F. Xia

83 Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.5111012.5750 Female9.5111012.5750 Total1922202514100 Due to F. Xia

84 Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11+.. = 14.026

85 Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association Compute X 2

86 For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/N(c+d)(b+d)/Nc+d totala+cb+dN

87 X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table If probability is low – below some significance level Can reject null hypothesis

88 Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages Sufficient values per cell: > 5


Download ppt "Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012."

Similar presentations


Ads by Google