Download presentation

Presentation is loading. Please wait.

Published byWill Pine Modified over 2 years ago

1
Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

2
Roadmap Feature representations: Features in attribute-value matrices Motivation: text classification Managing features General approaches Feature selection techniques Feature scoring measures Alternative feature weighting Chi-squared feature selection

3
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates

4
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features

5
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction

6
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import

7
Representing Input: Attribute-Value Matrix f 1 Currency f 2 Country …f m Date Label x 1 = Doc1110.30Spam x 2 =Doc2111.751Spam.. x n =Doc40002NotSpam Choosing features: Define features – i.e. with feature templates Instantiate features Perform dimensionality reduction Weighting features: increase/decrease feature import Global feature weighting: weight whole column Local feature weighting: weight cell, conditions

8
Feature Selection Example Task: Text classification Feature template definition:

9
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation:

10
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection:

11
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting:

12
Feature Selection Example Task: Text classification Feature template definition: Word – just one template Feature instantiation: Words from training (and test?) data Feature selection: Stopword removal: remove top K (~100) highest freq Words like: the, a, have, is, to, for,… Feature weighting: Apply tf*idf feature weighting tf = term frequency; idf = inverse document frequency

13
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions

14
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size

15
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic:

16
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness

17
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN

18
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost

19
The Curse of Dimensionality Think of the instances as vectors of features # of features = # of dimensions Number of features potentially enormous # words in corpus continues to increase w/corpus size High dimensionality problematic: Leads to data sparseness Hard to create valid model Hard to predict and generalize – think kNN Leads to high computational cost Leads to difficulty with estimation/learning More dimensions more samples needed to learn model

20
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance

21
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.

22
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well

23
Breaking the Curse Dimensionality reduction: Produce a representation with fewer dimensions But with comparable performance More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf. Functionally, Many ML algorithms do not scale well Expensive: Training cost, training cost Poor prediction: overfitting, sparseness

24
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches:

25
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local)

26
Dimensionality Reduction Given an initial feature set r, Create a feature set r’ s.t. |r| < |r’| Approaches: r’: same for all classes (aka global), vs r’: different for each class (aka local) Feature selection/filtering, vs Feature mapping (aka extraction)

27
Feature Selection Feature selection: r’ is a subset of r How can we pick features?

28
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches:

29
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance

30
Feature Selection Feature selection: r’ is a subset of r How can we pick features? Extrinsic ‘wrapper’ approaches: For each subset of features: Build, evaluate classifier for some task Pick subset of features with best performance Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores

31
Feature Selection Wrapper approach: Pros:

32
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons:

33
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros

34
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons:

35
Feature Selection Wrapper approach: Pros: Easy to understand, implement Clear relationship b/t selected features and task perf. Cons: Computationally intractable: 2 |r’| *(training + testing) Specific to task, classifier; ad-hov Filtering approach: Pros: theoretical basis, less task, classifier specific Cons: Doesn’t always boost task performance

36
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r

37
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated

38
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples:

39
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters

40
Feature Mapping Feature mapping (extraction) approaches Features r’ representation combinations/transformations of features in r Example: many words near-synonyms, but treated as unrelated Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’ Examples: Term classes: e.g. class-based n-grams Derived from term clusters Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix Produces ‘closest’ rank r’ approximation of original

41
Feature Mapping Pros:

42
Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons:

43
Feature Mapping Pros: Data-driven Theoretical basis – guarantees on matrix similarity Not bound by initial feature space Cons: Some ad-hoc factors: e.g. # of dimensions Resulting feature space can be hard to interpret

44
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

45
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent

46
Feature Filtering Filtering approaches: Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class Fairly fast and classifier-independent Many different measures: Mutual information Information gain Chi-squared etc…

47
Feature Scoring Measures

48
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C

49
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have

50
Basic Notation, Distributions Assume binary representation of terms, classes t k : term in T; c i : class in C P(t k ): proportion of documents in which t k appears P(c i ): proportion of documents of class c i Binary so have

51
Setting Up !c i cici !t k ab tktk cd

52
Setting Up !c i cici !t k ab tktk cd

53
Setting Up !c i cici !t k ab tktk cd

54
Setting Up !c i cici !t k ab tktk cd

55
Setting Up !c i cici !t k ab tktk cd

56
Setting Up !c i cici !t k ab tktk cd

57
Setting Up !c i cici !t k ab tktk cd

58
Feature Selection Functions Question: What makes a good features?

59
Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes

60
Feature Selection Functions Question: What makes a good features? Perspective: Best features: Features that are most DIFFERENTLY distributed across classes I.e. features are best that most effectively differentiate between classes

61
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears

62
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold

63
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros:

64
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons:

65
Term Selection Functions: DF Document frequency (DF): Number of documents in which t k appears Applying DF: Remove terms with DF below some threshold Intuition: Very rare terms: won’t help with categorization Or not useful globally Pros: Easy to implement, scalable Cons: Ad-hoc, low DF terms ‘topical’

66
Term Selection Functions: MI Pointwise Mutual Information (MI)

67
Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent

68
Term Selection Functions: MI Pointwise Mutual Information (MI) MI=0 if t,c independent Issue: Can be heavily influenced by marginal Problem comparing terms of differing frequencies

69
Term Selection Functions: IG Information Gain: Intuition: Transmitting Y, how many bits can we save if we know X? IG(Y,X i ) = H(Y)-H(Y|X)

70
Information Gain: Derivation From F. Xia, ‘11

71
More Feature Selection GSS coefficient: From F. Xia, ‘11

72
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs From F. Xia, ‘11

73
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11

74
More Feature Selection GSS coefficient: NGL coefficient: N : # of docs Chi-square: From F. Xia, ‘11

75
More Term Selection Relevancy score: From F. Xia, ‘11

76
More Term Selection Relevancy score: Odds Ratio: From F. Xia, ‘11

77
Global Selection Previous measures compute class-specific selection What if you want to filter across ALL classes? Compute an aggregate measure across classes Sum: Average: Max: From F. Xia, ‘11

78
What’s the best? Answer: It depends on …. Classifiers Type of data … According to (Yang and Pedersen, 1997): {OR,NGL,GSS} > {X 2 max,Ig sum }> {# avg }>>{MI} On text classification tasks Using kNN From F. Xia, ‘11

79
Feature Weighting For text classification, typical weights include: Binary: weights in {0,1} Term frequency (tf): # occurrences of t k in document d i Inverse document frequency (idf): df k : # of docs in which t k appears; N: # docs idf = log (N/(1+df k )) tfidf = tf*idf

80
Chi Square Tests for presence/absence of relation random variables Bivariate analysis tests 2 random variables Can test strength of relationship (Strictly speaking) doesn’t test direction

81
Chi Square Example Can gender predict shoe choice? A: male/female Features B: shoe choice Classes: {sandal, sneaker,…} Due to F. Xia sandalsneakerleather shoe bootother Male6171395 Female1357169

82
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male50 Female50 Total1922202514100 Due to F. Xia

83
Comparing Distributions Observed distribution (O): Expected distribution (E): sandalsneakerleather shoe bootother Male6171395 Female1357169 sandalsneakerleather shoe boototherTotal Male9.5111012.5750 Female9.5111012.5750 Total1922202514100 Due to F. Xia

84
Computing Chi Square Expected value for cell= row_total*column_total/table_total X 2 =(6-9.5) 2 /9.5+(17-11) 2 /11+.. = 14.026

85
Calculating X 2 Tabulate contigency table of observed values: O Compute row, column totals Compute table of expected values, given row/col Assuming no association Compute X 2

86
For 2x2 Table O: E: !c i cici !t k ab tktk cd !c i cici Total !t k (a+b)(a+c)/N(a+b)(b+d)/Na+b tktk (c+d)(a+c)/N(c+d)(b+d)/Nc+d totala+cb+dN

87
X 2 Test Test whether random variables are independent Null hypothesis: 2 R.V.s are independent Compute X 2 statistic: Compute degrees of freedom df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4 Test probability of X 2 statistic value X 2 table If probability is low – below some significance level Can reject null hypothesis

88
Requirements for X 2 Test Events assumed independent, same distribution Outcomes must be mutually exclusive Raw frequencies, not percentages Sufficient values per cell: > 5

Similar presentations

OK

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 6. Dimensionality Reduction.

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google

Powerpoint ppt on incredible india Ppt on company secretary course Electronic paper display ppt online Ppt on two point perspective interior Ppt on operating system services Ppt on non biodegradable waste recycling Ppt on science fiction Ppt on event handling in javascript what is a number Ppt on safe abortion Share ppt online