Presentation is loading. Please wait.

Presentation is loading. Please wait.

EACL-2006 Tutorial 1 Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth

Similar presentations


Presentation on theme: "EACL-2006 Tutorial 1 Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth"— Presentation transcript:

1 EACL-2006 Tutorial 1 Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth http://www.d.umn.edu/~tpederse tpederse@d.umn.edu

2 EACL-2006 Tutorial2 Language Independent Methods Do not utilize syntactic information Do not utilize syntactic information No parsers, part of speech taggers, etc. required No parsers, part of speech taggers, etc. required Do not utilize dictionaries or other manually created lexical resources Do not utilize dictionaries or other manually created lexical resources Based on lexical features selected from corpora Based on lexical features selected from corpora Assumption: word segmentation can be done by looking for white spaces between strings Assumption: word segmentation can be done by looking for white spaces between strings No manually annotated data of any kind, methods are completely unsupervised in the strictest sense No manually annotated data of any kind, methods are completely unsupervised in the strictest sense

3 EACL-2006 Tutorial3 Clustering Similar Contexts A context is a short unit of text A context is a short unit of text often a phrase to a paragraph in length, although it can be longer often a phrase to a paragraph in length, although it can be longer Input: N contexts Input: N contexts Output: K clusters Output: K clusters Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters Where each member of a cluster is a context that is more similar to each other than to the contexts found in other clusters

4 EACL-2006 Tutorial4 Applications Headed contexts (contain target word) Headed contexts (contain target word) Name Discrimination Name Discrimination Word Sense Discrimination Word Sense Discrimination Headless contexts Headless contexts Email Organization Email Organization Document Clustering Document Clustering Paraphrase identification Paraphrase identification Clustering Sets of Related Words Clustering Sets of Related Words

5 EACL-2006 Tutorial5 Tutorial Outline Identifying lexical features Identifying lexical features Measures of association & tests of significance Measures of association & tests of significance Context representations Context representations First & second order First & second order Dimensionality reduction Dimensionality reduction Singular Value Decomposition Singular Value Decomposition Clustering Clustering Partitional techniques Partitional techniques Cluster stopping Cluster stopping Cluster labeling Cluster labeling Hands On Exercises Hands On Exercises

6 EACL-2006 Tutorial6 General Info Please fill out short survey Please fill out short survey Break from 4:00-4:30pm Break from 4:00-4:30pm Finish at 6pm Finish at 6pm Reception tonight at 7pm at Castle (?) Reception tonight at 7pm at Castle (?) Slides and video from tutorial will be posted (I will send you email when that is ready) Slides and video from tutorial will be posted (I will send you email when that is ready) Questions are welcome Questions are welcome Now, or via email to me or SenseClusters list. Now, or via email to me or SenseClusters list. Comments, observations, criticisms are all welcome Comments, observations, criticisms are all welcome Knoppix CD, will give you Linux and SenseClusters when computer is booted from the CD. Knoppix CD, will give you Linux and SenseClusters when computer is booted from the CD.

7 EACL-2006 Tutorial7 SenseClusters A package for clustering contexts A package for clustering contexts http://senseclusters.sourceforge.net http://senseclusters.sourceforge.net http://senseclusters.sourceforge.net SenseClusters Live! (Knoppix CD) SenseClusters Live! (Knoppix CD) Integrates with various other tools Integrates with various other tools Ngram Statistics Package Ngram Statistics Package CLUTO CLUTO SVDPACKC SVDPACKC

8 EACL-2006 Tutorial8 Many thanks… Amruta Purandare (M.S., 2004) Amruta Purandare (M.S., 2004) Founding developer of SenseClusters (2002-2004) Founding developer of SenseClusters (2002-2004) Now PhD student in Intelligent Systems at the University of Pittsburgh http://www.cs.pitt.edu/~amruta/ Now PhD student in Intelligent Systems at the University of Pittsburgh http://www.cs.pitt.edu/~amruta/ http://www.cs.pitt.edu/~amruta/ Anagha Kulkarni (M.S., 2006, expected) Anagha Kulkarni (M.S., 2006, expected) Enhancing SenseClusters since Fall 2004! Enhancing SenseClusters since Fall 2004! http://www.d.umn.edu/~kulka020/ http://www.d.umn.edu/~kulka020/ http://www.d.umn.edu/~kulka020/ National Science Foundation (USA) for supporting Amruta, Anagha and me via CAREER award #0092784 National Science Foundation (USA) for supporting Amruta, Anagha and me via CAREER award #0092784

9 EACL-2006 Tutorial 9 Background and Motivations

10 EACL-2006 Tutorial10 Headed and Headless Contexts A headed context includes a target word A headed context includes a target word Our goal is to cluster the target words based on their surrounding contexts Our goal is to cluster the target words based on their surrounding contexts Target word is center of context and our attention Target word is center of context and our attention A headless context has no target word A headless context has no target word Our goal is to cluster the contexts based on their similarity to each other Our goal is to cluster the contexts based on their similarity to each other The focus is on the context as a whole The focus is on the context as a whole

11 EACL-2006 Tutorial11 Headed Contexts (input) I can hear the ocean in that shell. I can hear the ocean in that shell. My operating system shell is bash. My operating system shell is bash. The shells on the shore are lovely. The shells on the shore are lovely. The shell command line is flexible. The shell command line is flexible. The oyster shell is very hard and black. The oyster shell is very hard and black.

12 EACL-2006 Tutorial12 Headed Contexts (output) Cluster 1: Cluster 1: My operating system shell is bash. My operating system shell is bash. The shell command line is flexible. The shell command line is flexible. Cluster 2: Cluster 2: The shells on the shore are lovely. The shells on the shore are lovely. The oyster shell is very hard and black. The oyster shell is very hard and black. I can hear the ocean in that shell. I can hear the ocean in that shell.

13 EACL-2006 Tutorial13 Headless Contexts (input) The new version of Linux is more stable and better support for cameras. The new version of Linux is more stable and better support for cameras. My Chevy Malibu has had some front end troubles. My Chevy Malibu has had some front end troubles. Osborne made on of the first personal computers. Osborne made on of the first personal computers. The brakes went out, and the car flew into the house. The brakes went out, and the car flew into the house. With the price of gasoline, I think I’ll be taking the bus more often! With the price of gasoline, I think I’ll be taking the bus more often!

14 EACL-2006 Tutorial14 Headless Contexts (output) Cluster 1: Cluster 1: The new version of Linux is more stable and better support for cameras. The new version of Linux is more stable and better support for cameras. Osborne made one of the first personal computers. Osborne made one of the first personal computers. Cluster 2: Cluster 2: My Chevy Malibu has had some front end troubles. My Chevy Malibu has had some front end troubles. The brakes went out, and the car flew into the house. The brakes went out, and the car flew into the house. With the price of gasoline, I think I’ll be taking the bus more often! With the price of gasoline, I think I’ll be taking the bus more often!

15 EACL-2006 Tutorial15 Web Search as Application Web search results are headed contexts Web search results are headed contexts Search term is target word (found in snippets) Search term is target word (found in snippets) Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up” Web search results are often disorganized – two people sharing same name, two organizations sharing same abbreviation, etc. often have their pages “mixed up” If you click on search results or follow links in pages found, you will encounter headless contexts too… If you click on search results or follow links in pages found, you will encounter headless contexts too…

16 EACL-2006 Tutorial16 Name Discrimination

17 EACL-2006 Tutorial17 George Millers!

18 EACL-2006 Tutorial18

19 EACL-2006 Tutorial19

20 EACL-2006 Tutorial20

21 EACL-2006 Tutorial21

22 EACL-2006 Tutorial22

23 EACL-2006 Tutorial23 Email Foldering as Application Email (public or private) is made up of headless contexts Email (public or private) is made up of headless contexts Short, usually focused… Short, usually focused… Cluster similar email messages together Cluster similar email messages together Automatic email foldering Automatic email foldering Take all messages from sent-mail file or inbox and organize into categories Take all messages from sent-mail file or inbox and organize into categories

24 EACL-2006 Tutorial24

25 EACL-2006 Tutorial25

26 EACL-2006 Tutorial26 Clustering News as Application News articles are headless contexts News articles are headless contexts Entire article or first paragraph Entire article or first paragraph Short, usually focused Short, usually focused Cluster similar articles together Cluster similar articles together

27 EACL-2006 Tutorial27

28 EACL-2006 Tutorial28

29 EACL-2006 Tutorial29

30 EACL-2006 Tutorial30 What is it to be “similar”? You shall know a word by the company it keeps You shall know a word by the company it keeps Firth, 1957 (Studies in Linguistic Analysis) Firth, 1957 (Studies in Linguistic Analysis) Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) Meanings of words are (largely) determined by their distributional patterns (Distributional Hypothesis) Harris, 1968 (Mathematical Structures of Language) Harris, 1968 (Mathematical Structures of Language) Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) Words that occur in similar contexts will have similar meanings (Strong Contextual Hypothesis) Miller and Charles, 1991 (Language and Cognitive Processes) Miller and Charles, 1991 (Language and Cognitive Processes) Various extensions… Various extensions… Similar contexts will have similar meanings, etc. Similar contexts will have similar meanings, etc. Names that occur in similar contexts will refer to the same underlying person, etc. Names that occur in similar contexts will refer to the same underlying person, etc.

31 EACL-2006 Tutorial31 General Methodology Represent contexts to be clustered using first or second order feature vectors Represent contexts to be clustered using first or second order feature vectors Lexical features Lexical features Reduce dimensionality to make vectors more tractable and/or understandable Reduce dimensionality to make vectors more tractable and/or understandable Singular value decomposition Singular value decomposition Cluster the context vectors Cluster the context vectors Find the number of clusters Find the number of clusters Label the clusters Label the clusters Evaluate and/or use the contexts! Evaluate and/or use the contexts!

32 EACL-2006 Tutorial 32 Identifying Lexical Features Measures of Association and Tests of Significance

33 EACL-2006 Tutorial33 What are features? Features represent the (hopefully) salient characteristics of the contexts to be clustered Features represent the (hopefully) salient characteristics of the contexts to be clustered Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features Eventually we will represent each context as a vector, where the dimensions of the vector are associated with features Vectors/contexts that include many of the same features will be similar to each other Vectors/contexts that include many of the same features will be similar to each other

34 EACL-2006 Tutorial34 Where do features come from? In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered In unsupervised clustering, it is common for the feature selection data to be the same data that is to be clustered This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection This is not cheating, since data to be clustered does not have any labeled classes that can be used to assist feature selection It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step It may also be necessary, since we may need to cluster all available data, and not hold out some for a separate feature identification step Email or news articles Email or news articles

35 EACL-2006 Tutorial35 Feature Selection “Test” data – the contexts to be clustered “Test” data – the contexts to be clustered Assume that the feature selection data is the same as the test data, unless otherwise indicated Assume that the feature selection data is the same as the test data, unless otherwise indicated “Training” data – a separate corpus of held out feature selection data (that will not be clustered) “Training” data – a separate corpus of held out feature selection data (that will not be clustered) may need to use if you have a small number of contexts to cluster (e.g., web search results) may need to use if you have a small number of contexts to cluster (e.g., web search results) This sense of “training” due to Schütze (1998) This sense of “training” due to Schütze (1998)

36 EACL-2006 Tutorial36 Lexical Features Unigram – a single word that occurs more than a given number of times Unigram – a single word that occurs more than a given number of times Bigram – an ordered pair of words that occur together more often than expected by chance Bigram – an ordered pair of words that occur together more often than expected by chance Consecutive or may have intervening words Consecutive or may have intervening words Co-occurrence – an unordered bigram Co-occurrence – an unordered bigram Target Co-occurrence – a co-occurrence where one of the words is the target word Target Co-occurrence – a co-occurrence where one of the words is the target word

37 EACL-2006 Tutorial37 Bigrams fine wine (window size of 2) fine wine (window size of 2) baseball bat baseball bat house of representatives (window size of 3) house of representatives (window size of 3) president of the republic (window size of 4) president of the republic (window size of 4) apple orchard apple orchard Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?) Selected using a small window size (2-4 words), trying to capture a regular (localized) pattern between two words (collocation?)

38 EACL-2006 Tutorial38 Co-occurrences tropics water tropics water boat fish boat fish law president law president train travel train travel Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations Usually selected using a larger window (7-10 words) of context, hoping to capture pairs of related words rather than collocations

39 EACL-2006 Tutorial39 Bigrams and Co-occurrences Pairs of words tend to be much less ambiguous than unigrams Pairs of words tend to be much less ambiguous than unigrams “bank” versus “river bank” and “bank card” “bank” versus “river bank” and “bank card” “dot” versus “dot com” and “dot product” “dot” versus “dot com” and “dot product” Three grams and beyond occur much less frequently (Ngrams very Zipfian) Three grams and beyond occur much less frequently (Ngrams very Zipfian) Unigrams are noisy, but bountiful Unigrams are noisy, but bountiful

40 EACL-2006 Tutorial40 “occur together more often than expected by chance…” Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix Observed frequencies for two words occurring together and alone are stored in a 2x2 matrix Throw out bigrams that include one or two stop words Throw out bigrams that include one or two stop words Expected values are calculated, based on the model of independence and observed values Expected values are calculated, based on the model of independence and observed values How often would you expect these words to occur together, if they only occurred together by chance? How often would you expect these words to occur together, if they only occurred together by chance? If two words occur “significantly” more often than the expected value, then the words do not occur together by chance. If two words occur “significantly” more often than the expected value, then the words do not occur together by chance.

41 EACL-2006 Tutorial41 2x2 Contingency Table Intelligence!Intelligence Artificial100400 !Artificial 300100,000

42 EACL-2006 Tutorial42 2x2 Contingency Table Intelligence!Intelligence Artificial100300400 !Artificial20099,40099,600 30099,700100,000

43 EACL-2006 Tutorial43 2x2 Contingency Table Intelligence!Intelligence Artificial100.0 000.12 000.12300.0398.8400 !Artificial200.0298.899,400.099,301.299,600 30099,700100,000

44 EACL-2006 Tutorial44 Measures of Association

45 EACL-2006 Tutorial45 Measures of Association

46 EACL-2006 Tutorial46 Interpreting the Scores… G^2 and X^2 are asymptotically approximated by the chi-squared distribution… G^2 and X^2 are asymptotically approximated by the chi-squared distribution… This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get … This means…if you fix the marginal totals of a table, randomly generate internal cell values in the table, calculate the G^2 or X^2 scores for each resulting table, and plot the distribution of the scores, you *should* get …

47 EACL-2006 Tutorial47

48 EACL-2006 Tutorial48 Interpreting the Scores… Values above a certain level of significance can be considered grounds for rejecting the null hypothesis Values above a certain level of significance can be considered grounds for rejecting the null hypothesis H0: the words in the bigram are independent H0: the words in the bigram are independent 3.841 is associated with 95% confidence that the null hypothesis should be rejected 3.841 is associated with 95% confidence that the null hypothesis should be rejected

49 EACL-2006 Tutorial49 Measures of Association There are numerous measures of association that can be used to identify bigram and co-occurrence features There are numerous measures of association that can be used to identify bigram and co-occurrence features Many of these are supported in the Ngram Statistics Package (NSP) Many of these are supported in the Ngram Statistics Package (NSP) http://www.d.umn.edu/~tpederse/nsp.html http://www.d.umn.edu/~tpederse/nsp.html http://www.d.umn.edu/~tpederse/nsp.html

50 EACL-2006 Tutorial50 Measures Supported in NSP Log-likelihood Ratio (ll) Log-likelihood Ratio (ll) True Mutual Information (tmi) True Mutual Information (tmi) Pearson’s Chi-squared Test (x2) Pearson’s Chi-squared Test (x2) Pointwise Mutual Information (pmi) Pointwise Mutual Information (pmi) Phi coefficient (phi) Phi coefficient (phi) T-test (tscore) T-test (tscore) Fisher’s Exact Test (leftFisher, rightFisher) Fisher’s Exact Test (leftFisher, rightFisher) Dice Coefficient (dice) Dice Coefficient (dice) Odds Ratio (odds) Odds Ratio (odds)

51 EACL-2006 Tutorial51 NSP Will explore NSP during practical session Will explore NSP during practical session Integrated into SenseClusters, may also be used in stand-alone mode Integrated into SenseClusters, may also be used in stand-alone mode Can be installed easily on a Linux/Unix system from CD or download from Can be installed easily on a Linux/Unix system from CD or download from http://www.d.umn.edu/~tpederse/nsp.html http://www.d.umn.edu/~tpederse/nsp.html http://www.d.umn.edu/~tpederse/nsp.html I’m told it can also be installed on Windows (via cygwin or ActivePerl), but I have no personal experience of this… I’m told it can also be installed on Windows (via cygwin or ActivePerl), but I have no personal experience of this…

52 EACL-2006 Tutorial52 Summary Identify lexical features based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data Identify lexical features based on frequency counts or measures of association – either in the data to be clustered or in a separate set of feature selection data Language independent Language independent Unigrams usually only selected by frequency Unigrams usually only selected by frequency Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case Remember, no labeled data from which to learn, so somewhat less effective as features than in supervised case Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association Bigrams and co-occurrences can also be selected by frequency, or better yet measures of association Bigrams and co-occurrences need not be consecutive Bigrams and co-occurrences need not be consecutive Stop words should be eliminated Stop words should be eliminated Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful) Frequency thresholds are helpful (e.g., unigram/bigram that occurs once may be too rare to be useful)

53 EACL-2006 Tutorial53 Related Work Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log- likelihood and exact tests Moore, 2004 (EMNLP) follow-up to Dunning and Pedersen on log- likelihood and exact tests http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf http://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdfhttp://acl.ldc.upenn.edu/acl2004/emnlp/pdf/Moore.pdf Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihood Pedersen, 1996 (SCSUG) explanation of exact tests, and comparison to log-likelihood http://arxiv.org/abs/cmp-lg/9608010 (also see Pedersen, Kayaalp, and Bruce, AAAI-1996) (also see Pedersen, Kayaalp, and Bruce, AAAI-1996) Dunning, 1993 (Computational Linguistics) introduces log-likelihood ratio for collocation identification Dunning, 1993 (Computational Linguistics) introduces log-likelihood ratio for collocation identification http://acl.ldc.upenn.edu/J/J93/J93-1003.pdf http://acl.ldc.upenn.edu/J/J93/J93-1003.pdfhttp://acl.ldc.upenn.edu/J/J93/J93-1003.pdf

54 EACL-2006 Tutorial 54 Context Representations First and Second Order Methods

55 EACL-2006 Tutorial55 Once features selected… We will have a set of unigrams, bigrams, co- occurrences or target co-occurrences that we believe are somehow interesting and useful We will have a set of unigrams, bigrams, co- occurrences or target co-occurrences that we believe are somehow interesting and useful We also have any frequency and measure of association score that have been used in their selection We also have any frequency and measure of association score that have been used in their selection Convert contexts to be clustered into a vector representation based on these features Convert contexts to be clustered into a vector representation based on these features

56 EACL-2006 Tutorial56 First Order Representation Each context is represented by a vector with M dimensions, each of which indicates whether or not a particular feature occurred in that context Each context is represented by a vector with M dimensions, each of which indicates whether or not a particular feature occurred in that context Value may be binary, a frequency count, or an association score Value may be binary, a frequency count, or an association score Context by Feature representation Context by Feature representation

57 EACL-2006 Tutorial57 Contexts Cxt1: There was an island curse of black magic cast by that voodoo child. Cxt1: There was an island curse of black magic cast by that voodoo child. Cxt2: Harold, a known voodoo child, was gifted in the arts of black magic. Cxt2: Harold, a known voodoo child, was gifted in the arts of black magic. Cxt3: Despite their military might, it was a serious error to attack. Cxt3: Despite their military might, it was a serious error to attack. Cxt4: Military might is no defense against a voodoo child or an island curse. Cxt4: Military might is no defense against a voodoo child or an island curse.

58 EACL-2006 Tutorial58 Unigram Feature Set island 1000 island 1000 black 700 black 700 curse 500 curse 500 magic 400 magic 400 child 200 child 200 (assume these are frequency counts obtained from some corpus…) (assume these are frequency counts obtained from some corpus…)

59 EACL-2006 Tutorial59 First Order Vectors of Unigrams islandblackcursemagicchild Cxt111111 Cxt201011 Cxt300000 Cxt410101

60 EACL-2006 Tutorial60 Bigram Feature Set island curse 189.2 island curse 189.2 black magic 123.5 black magic 123.5 voodoo child 120.0 voodoo child 120.0 military might 100.3 military might 100.3 serious error 89.2 serious error 89.2 island child 73.2 island child 73.2 voodoo might 69.4 voodoo might 69.4 military error 54.9 military error 54.9 black child 43.2 black child 43.2 serious curse 21.2 serious curse 21.2 (assume these are log-likelihood scores based on frequency counts from some corpus) (assume these are log-likelihood scores based on frequency counts from some corpus)

61 EACL-2006 Tutorial61 First Order Vectors of Bigrams blackmagic island curse military might serious error voodoo child Cxt111001 Cxt210001 Cxt300110 Cxt401101

62 EACL-2006 Tutorial62 First Order Vectors Can have binary values or weights associated with frequency, etc. Can have binary values or weights associated with frequency, etc. Forms a context by feature matrix Forms a context by feature matrix May optionally be smoothed/reduced with Singular Value Decomposition May optionally be smoothed/reduced with Singular Value Decomposition More on that later… More on that later… The contexts are ready for clustering… The contexts are ready for clustering… More on that later… More on that later…

63 EACL-2006 Tutorial63 Second Order Features First order features encode the occurrence of a feature in a context First order features encode the occurrence of a feature in a context Feature occurrence represented by binary value Feature occurrence represented by binary value Second order features encode something ‘extra’ about a feature that occurs in a context Second order features encode something ‘extra’ about a feature that occurs in a context Feature occurrence represented by word co-occurrences Feature occurrence represented by word co-occurrences Feature occurrence represented by context occurrences Feature occurrence represented by context occurrences

64 EACL-2006 Tutorial64 Second Order Representation First, build word by word matrix from features First, build word by word matrix from features Based on bigrams or co-occurrences Based on bigrams or co-occurrences First word is row, second word is column, cell is score First word is row, second word is column, cell is score (optionally) reduce dimensionality w/SVD (optionally) reduce dimensionality w/SVD Each row forms a vector of first order co-occurrences Each row forms a vector of first order co-occurrences Second, replace each word in a context with its row/vector as found in the word by word matrix Second, replace each word in a context with its row/vector as found in the word by word matrix Average all the word vectors in the context to create the second order representation Average all the word vectors in the context to create the second order representation Due to Schütze (1998), related to LSI/LSA Due to Schütze (1998), related to LSI/LSA

65 EACL-2006 Tutorial65 Word by Word Matrix magiccursemighterrorchild black123.500043.2 island0189.20073.2 military00100.354.90 serious021.2089.20 voodoo0069.40120.0

66 EACL-2006 Tutorial66 Word by Word Matrix …can also be used to identify sets of related words …can also be used to identify sets of related words In the case of bigrams, rows represent the first word in a bigram and columns represent the second word In the case of bigrams, rows represent the first word in a bigram and columns represent the second word Matrix is asymmetric Matrix is asymmetric In the case of co-occurrences, rows and columns are equivalent In the case of co-occurrences, rows and columns are equivalent Matrix is symmetric Matrix is symmetric The vector (row) for each word represent a set of first order features for that word The vector (row) for each word represent a set of first order features for that word Each word in a context to be clustered for which a vector exists (in the word by word matrix) is replaced by that vector in that context Each word in a context to be clustered for which a vector exists (in the word by word matrix) is replaced by that vector in that context

67 EACL-2006 Tutorial67 There was an island curse of black magic cast by that voodoo child. There was an island curse of black magic cast by that voodoo child. magiccursemighterrorchild black123.500043.2 island0189.20073.2 voodoo0069.40120.0

68 EACL-2006 Tutorial68 Second Order Co-Occurrences Word vectors for “black” and “island” show similarity as both occur with “child” Word vectors for “black” and “island” show similarity as both occur with “child” “black” and “island” are second order co- occurrence with each other, since both occur with “child” but not with each other (i.e., “black island” is not observed) “black” and “island” are second order co- occurrence with each other, since both occur with “child” but not with each other (i.e., “black island” is not observed)

69 EACL-2006 Tutorial69 Second Order Representation There was an [curse, child] curse of [magic, child] magic cast by that [might, child] child There was an [curse, child] curse of [magic, child] magic cast by that [might, child] child [curse, child] + [magic, child] + [might, child] [curse, child] + [magic, child] + [might, child]

70 EACL-2006 Tutorial70 There was an island curse of black magic cast by that voodoo child. magiccursemighterrorchild Cxt141.263.124.4078.8

71 EACL-2006 Tutorial71 Second Order Representation Results in a Context by Feature (Word) Representation Results in a Context by Feature (Word) Representation Cell values do not indicate if feature occurred in context. Rather, they show the strength of association of that feature with other words that occur with a word in the context. Cell values do not indicate if feature occurred in context. Rather, they show the strength of association of that feature with other words that occur with a word in the context.

72 EACL-2006 Tutorial72 Summary First order representations are intuitive, but… First order representations are intuitive, but… Can suffer from sparsity Can suffer from sparsity Contexts represented based on the features that occur in those contexts Contexts represented based on the features that occur in those contexts Second order representations are harder to visualize, but… Second order representations are harder to visualize, but… Allow a word to be represented by the words it co- occurs with (i.e., the company it keeps) Allow a word to be represented by the words it co- occurs with (i.e., the company it keeps) Allows a context to be represented by the words that occur with the words in the context Allows a context to be represented by the words that occur with the words in the context Helps combat sparsity… Helps combat sparsity…

73 EACL-2006 Tutorial73 Related Work Pedersen and Bruce 1997 (EMNLP) presented first order method of discrimination Pedersen and Bruce 1997 (EMNLP) presented first order method of discrimination http://acl.ldc.upenn.edu/W/W97/W97-0322.pdf http://acl.ldc.upenn.edu/W/W97/W97-0322.pdfhttp://acl.ldc.upenn.edu/W/W97/W97-0322.pdf Schütze 1998 (Computational Linguistics) introduced second order method Schütze 1998 (Computational Linguistics) introduced second order method http://acl.ldc.upenn.edu/J/J98/J98-1004.pdf http://acl.ldc.upenn.edu/J/J98/J98-1004.pdfhttp://acl.ldc.upenn.edu/J/J98/J98-1004.pdf Purandare and Pedersen 2004 (CoNLL) compared first and second order methods Purandare and Pedersen 2004 (CoNLL) compared first and second order methods http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdf http://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdfhttp://acl.ldc.upenn.edu/hlt-naacl2004/conll04/pdf/purandare.pdf First order better if you have lots of data First order better if you have lots of data Second order better with smaller amounts of data Second order better with smaller amounts of data

74 EACL-2006 Tutorial 74 Dimensionality Reduction Singular Value Decomposition

75 EACL-2006 Tutorial75 Motivation First order matrices are very sparse First order matrices are very sparse Context by feature Context by feature Word by word Word by word NLP data is noisy NLP data is noisy No stemming performed No stemming performed synonyms synonyms

76 EACL-2006 Tutorial76 Many Methods Singular Value Decomposition (SVD) Singular Value Decomposition (SVD) SVDPACKC http://www.netlib.org/svdpack/ SVDPACKC http://www.netlib.org/svdpack/http://www.netlib.org/svdpack/ Multi-Dimensional Scaling (MDS) Multi-Dimensional Scaling (MDS) Principal Components Analysis (PCA) Principal Components Analysis (PCA) Independent Components Analysis (ICA) Independent Components Analysis (ICA) Linear Discriminant Analysis (LDA) Linear Discriminant Analysis (LDA) etc… etc…

77 EACL-2006 Tutorial77 Effect of SVD SVD reduces a matrix to a given number of dimensions This may convert a word level space into a semantic or conceptual space SVD reduces a matrix to a given number of dimensions This may convert a word level space into a semantic or conceptual space If “dog” and “collie” and “wolf” are dimensions/columns in a word co-occurrence matrix, after SVD they may be a single dimension that represents “canines” If “dog” and “collie” and “wolf” are dimensions/columns in a word co-occurrence matrix, after SVD they may be a single dimension that represents “canines”

78 EACL-2006 Tutorial78 Effect of SVD The dimensions of the matrix after SVD are principal components that represent the meaning of concepts The dimensions of the matrix after SVD are principal components that represent the meaning of concepts Similar columns are grouped together Similar columns are grouped together SVD is a way of smoothing a very sparse matrix, so that there are very few zero valued cells after SVD SVD is a way of smoothing a very sparse matrix, so that there are very few zero valued cells after SVD

79 EACL-2006 Tutorial79 How can SVD be used? SVD on first order contexts will reduce a context by feature representation down to a smaller number of features SVD on first order contexts will reduce a context by feature representation down to a smaller number of features Latent Semantic Analysis typically performs SVD on a feature by context representation, where the contexts are reduced Latent Semantic Analysis typically performs SVD on a feature by context representation, where the contexts are reduced SVD used in creating second order context representations SVD used in creating second order context representations Reduce word by word matrix Reduce word by word matrix

80 EACL-2006 Tutorial80 Word by Word Matrix applebloodcellsibmdataboxtissuegraphicsmemoryorganplasma pc20013100000 body03000020021 disk10020301200 petri02100020101 lab00302020213 sales00023001200 linux20013201100 debt00023402000

81 EACL-2006 Tutorial81 Singular Value Decomposition A=UDV’

82 EACL-2006 Tutorial82 U.35.09-.2.52-.09.40.02.63.20-.00-.02.05-.49.59.44.08-.09-.44-.04-.6-.02-.01.35.13.39-.60.31.41-.22.20-.39.00.03.08-.45.25-.02.17.09.83.05-.26-.01.00.29-.68-.45-.34-.31.02-.21.01.43-.02-.07.37-.01-.31.09.72-.48-.04.03.31-.00.08.46.11-.08.24-.01.39.05.08.08-.00-.01.56.25.30-.07-.49-.52.14-.3-.30.00-.07

83 EACL-2006 Tutorial83 D 9.19 6.36 3.99 3.25 2.52 2.30 1.26 0.66 0.00 0.00 0.00

84 EACL-2006 Tutorial84 V.21.08-.04.28.04.86-.05-.05-.31-.12.03.04-.37.57.39.23-.04.26-.02.03.25.44.11-.39-.27-.32-.30.06.17.15-.41.58.07.37.15.12-.12.39-.17-.13.71-.31-.12.03.63-.01-.45.52-.09-.26.08-.06.21.08-.02.49.27.50-.32-.45.13.02-.01.31.12-.03.09-.51.20.05-.05.02.29.08-.04-.31-.71.25.11.15-.12.02-.32.05-.59-.62-.23.07.28-.23-.14-.45.64.17-.04-.32.31.12-.03.04-.26.19.17-.06-.07-.87-.10-.07.22-.20.11-.47-.12-.18-.27.03-.18.09.12-.58.50

85 EACL-2006 Tutorial85 Word by Word Matrix After SVD applebloodcellsibmdatatissuegraphicsmemoryorganplasma pc.73.00.111.32.0.01.86.77.00.09 body.001.21.3.00.331.6.00.85.841.5 disk.76.00.011.32.1.00.91.72.00.00 germ.001.11.2.00.491.5.00.86.771.4 lab.211.72.0.351.72.5.181.71.22.3 sales.73.15.391.32.2.35.85.98.17.41 linux.96.00.161.72.7.031.11.0.00.13 debt1.2.00.002.13.2.001.51.1.00.00

86 EACL-2006 Tutorial86 Second Order Representation These two contexts share no words in common, yet they are similar! disk and linux both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory” These two contexts share no words in common, yet they are similar! disk and linux both occur with “Apple”, “IBM”, “data”, “graphics”, and “memory” The two contexts are similar because they share many second order co-occurrences The two contexts are similar because they share many second order co-occurrences applebloodcellsibmdatatissuegraphicsmemoryorganPlasma disk.76.00.011.32.1.00.91.72.00.00 linux.96.00.161.72.7.031.11.0.00.13 I got a new disk today! What do you think of linux?

87 EACL-2006 Tutorial87 Relationship to LSA Latent Semantic Analysis uses feature by context first order representation Latent Semantic Analysis uses feature by context first order representation Indicates all the contexts in which a feature occurs Indicates all the contexts in which a feature occurs Use SVD to reduce dimensions (contexts) Use SVD to reduce dimensions (contexts) Cluster features based on similarity of contexts in which they occur Cluster features based on similarity of contexts in which they occur Represent sentences using an average of feature vectors Represent sentences using an average of feature vectors

88 EACL-2006 Tutorial88 Feature by Context Representation Cxt1Cxt2Cxt3Cxt4 black magic 1101 island curse 1001 military might 0010 serious error 0010 voodoo child 1101

89 EACL-2006 Tutorial89 References Deerwester, S. and Dumais, S.T. and Furnas, G.W. and Landauer, T.K. and Harshman, R., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, vol. 41, 1990 Deerwester, S. and Dumais, S.T. and Furnas, G.W. and Landauer, T.K. and Harshman, R., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, vol. 41, 1990 Landauer, T. and Dumais, S., A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge, Psychological Review, vol. 104, 1997 Landauer, T. and Dumais, S., A Solution to Plato's Problem: The Latent Semantic Analysis Theory of Acquisition, Induction and Representation of Knowledge, Psychological Review, vol. 104, 1997 Schütze, H, Automatic Word Sense Discrimination, Computational Linguistics, vol. 24, 1998 Schütze, H, Automatic Word Sense Discrimination, Computational Linguistics, vol. 24, 1998 Berry, M.W. and Drmac, Z. and Jessup, E.R.,Matrices, Vector Spaces, and Information Retrieval, SIAM Review, vol 41, 1999 Berry, M.W. and Drmac, Z. and Jessup, E.R.,Matrices, Vector Spaces, and Information Retrieval, SIAM Review, vol 41, 1999

90 EACL-2006 Tutorial 90 Clustering Partitional Methods Cluster Stopping Cluster Labeling

91 EACL-2006 Tutorial91 Many many methods… Cluto supports a wide range of different clustering methods Cluto supports a wide range of different clustering methods Agglomerative Agglomerative Average, single, complete link… Average, single, complete link… Partitional Partitional K-means (Direct) K-means (Direct) Hybrid Hybrid Repeated bisections Repeated bisections SenseClusters integrates with Cluto SenseClusters integrates with Cluto http://www-users.cs.umn.edu/~karypis/cluto/ http://www-users.cs.umn.edu/~karypis/cluto/ http://www-users.cs.umn.edu/~karypis/cluto/

92 EACL-2006 Tutorial92 General Methodology Represent contexts to be clustered in first or second order vectors Represent contexts to be clustered in first or second order vectors Cluster the context vectors directly Cluster the context vectors directly vcluster vcluster … or convert to similarity matrix and then cluster … or convert to similarity matrix and then cluster scluster scluster

93 EACL-2006 Tutorial93 Agglomerative Clustering Create a similarity matrix of contexts to be clustered Create a similarity matrix of contexts to be clustered Results in a symmetric “instance by instance” matrix, where each cell contains the similarity score between a pair of instances Results in a symmetric “instance by instance” matrix, where each cell contains the similarity score between a pair of instances Typically a first order representation, where similarity is based on the features observed in the pair of instances Typically a first order representation, where similarity is based on the features observed in the pair of instances

94 EACL-2006 Tutorial94 Measuring Similarity Integer Values Integer Values Matching Coefficient Matching Coefficient Jaccard Coefficient Jaccard Coefficient Dice Coefficient Dice Coefficient Real Values Real Values Cosine Cosine

95 EACL-2006 Tutorial95 Agglomerative Clustering Apply Agglomerative Clustering algorithm to similarity matrix Apply Agglomerative Clustering algorithm to similarity matrix To start, each context is its own cluster To start, each context is its own cluster Form a cluster from the most similar pair of contexts Form a cluster from the most similar pair of contexts Repeat until the desired number of clusters is obtained Repeat until the desired number of clusters is obtained Advantages : high quality clustering Advantages : high quality clustering Disadvantages – computationally expensive, must carry out exhaustive pair wise comparisons Disadvantages – computationally expensive, must carry out exhaustive pair wise comparisons

96 EACL-2006 Tutorial96 Average Link Clustering Average Link ClusteringS1S2S3S4 S1342 S2320 S3421 S4201 S1S3S2S4S1S3 S20 S40 S1S3S 2 S4S1S3S2 S4

97 EACL-2006 Tutorial97 Partitional Methods Randomly create centroids equal to the number of clusters you wish to find Randomly create centroids equal to the number of clusters you wish to find Assign each context to nearest centroid Assign each context to nearest centroid After all contexts assigned, re-compute centroids After all contexts assigned, re-compute centroids “best” location decided by criterion function “best” location decided by criterion function Repeat until stable clusters found Repeat until stable clusters found Centroids don’t shift from iteration to iteration Centroids don’t shift from iteration to iteration

98 EACL-2006 Tutorial98 Partitional Methods Advantages : fast Advantages : fast Disadvantages Disadvantages Results can be dependent on the initial placement of centroids Results can be dependent on the initial placement of centroids Must specify number of clusters ahead of time Must specify number of clusters ahead of time maybe not… maybe not…

99 EACL-2006 Tutorial99 Vectors to be clustered

100 EACL-2006 Tutorial100 Random Initial Centroids (k=2)

101 EACL-2006 Tutorial101 Assignment of Clusters

102 EACL-2006 Tutorial102 Recalculation of Centroids

103 EACL-2006 Tutorial103 Reassignment of Clusters

104 EACL-2006 Tutorial104 Recalculation of Centroid

105 EACL-2006 Tutorial105 Reassignment of Clusters

106 EACL-2006 Tutorial106 Partitional Criterion Functions Intra-Cluster (Internal) similarity/distance Intra-Cluster (Internal) similarity/distance How close together are members of a cluster? How close together are members of a cluster? Closer together is better Closer together is better Inter-Cluster (External) similarity/distance Inter-Cluster (External) similarity/distance How far apart are the different clusters? How far apart are the different clusters? Further apart is better Further apart is better

107 EACL-2006 Tutorial107 Intra Cluster Similarity Ball of String (I1) Ball of String (I1) How far is each member from each other member How far is each member from each other member Flower (I2) Flower (I2) How far is each member of cluster from centroid How far is each member of cluster from centroid

108 EACL-2006 Tutorial108 Contexts to be Clustered

109 EACL-2006 Tutorial109 Ball of String (I1 Internal Criterion Function)

110 EACL-2006 Tutorial110 Flower (I2 Internal Criterion Function)

111 EACL-2006 Tutorial111 Inter Cluster Similarity The Fan (E1) The Fan (E1) How far is each centroid from the centroid of the entire collection of contexts How far is each centroid from the centroid of the entire collection of contexts Maximize that distance Maximize that distance

112 EACL-2006 Tutorial112 The Fan (E1 External Criterion Function)

113 EACL-2006 Tutorial113 Hybrid Criterion Functions Balance internal and external similarity Balance internal and external similarity H1 = I1/E1 H1 = I1/E1 H2 = I2/E1 H2 = I2/E1 Want internal similarity to increase, while external similarity decreases Want internal similarity to increase, while external similarity decreases Want internal distances to decrease, while external distances increase Want internal distances to decrease, while external distances increase

114 EACL-2006 Tutorial 114 Cluster Stopping

115 EACL-2006 Tutorial115 Cluster Stopping Many Clustering Algorithms require that the user specify the number of clusters prior to clustering Many Clustering Algorithms require that the user specify the number of clusters prior to clustering But, the user often doesn’t know the number of clusters, and in fact finding that out might be the goal of clustering But, the user often doesn’t know the number of clusters, and in fact finding that out might be the goal of clustering

116 EACL-2006 Tutorial116 Criterion Functions Can Help Run partitional algorithm for k=1 to deltaK Run partitional algorithm for k=1 to deltaK DeltaK is a user estimated or automatically determined upper bound for the number of clusters DeltaK is a user estimated or automatically determined upper bound for the number of clusters Find the value of k at which the criterion function does not significantly increase at k+1 Find the value of k at which the criterion function does not significantly increase at k+1 Clustering can stop at this value, since no further improvement in solution is apparent with additional clusters (increases in k) Clustering can stop at this value, since no further improvement in solution is apparent with additional clusters (increases in k)

117 EACL-2006 Tutorial117 SenseCluster’s Approach to Cluster Stopping Will be subject of Demo at EACL Will be subject of Demo at EACL Demo Session 2 5th April, 14:30-16:00 Demo Session 2 5th April, 14:30-16:00 Ted Pedersen and Anagha Kulkarni: Selecting the "Right" Number of Senses Based on Clustering Criterion Functions

118 EACL-2006 Tutorial118 H2 versus k T. Blair – V. Putin – S. Hussein

119 EACL-2006 Tutorial119 PK2 Based on Hartigan, 1975 Based on Hartigan, 1975 When ratio approaches 1, clustering is at a plateau When ratio approaches 1, clustering is at a plateau Select value of k which is closest to but outside of standard deviation interval Select value of k which is closest to but outside of standard deviation interval

120 EACL-2006 Tutorial120 PK2 predicts 3 senses T. Blair – V. Putin – S. Hussein

121 EACL-2006 Tutorial121 PK3 Related to Salvador and Chan, 2004 Related to Salvador and Chan, 2004 Inspired by Dice Coefficient Inspired by Dice Coefficient Values close to 1 mean clustering is improving … Values close to 1 mean clustering is improving … Select value of k which is closest to but outside of standard deviation interval Select value of k which is closest to but outside of standard deviation interval

122 EACL-2006 Tutorial122 PK3 predicts 3 senses T. Blair – V. Putin – S. Hussein

123 EACL-2006 Tutorial123 References Hartigan, J. Clustering Algorithms, Wiley, 1975 Hartigan, J. Clustering Algorithms, Wiley, 1975 basis for SenseClusters stopping method PK2 basis for SenseClusters stopping method PK2 Mojena, R., Hierarchical Grouping Methods and Stopping Rules: An Evaluation, The Computer Journal, vol 20, 1977 Mojena, R., Hierarchical Grouping Methods and Stopping Rules: An Evaluation, The Computer Journal, vol 20, 1977 basis for SenseClusters stopping method PK1 basis for SenseClusters stopping method PK1 Milligan, G. and Cooper, M., An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, vol. 50, 1985 Milligan, G. and Cooper, M., An Examination of Procedures for Determining the Number of Clusters in a Data Set, Psychometrika, vol. 50, 1985 Very extensive comparison of cluster stopping methods Very extensive comparison of cluster stopping methods Tibshirani, R. and Walther, G. and Hastie, T., Estimating the Number of Clusters in a Dataset via the Gap Statistic,Journal of the Royal Statistics Society (Series B), 2001 Tibshirani, R. and Walther, G. and Hastie, T., Estimating the Number of Clusters in a Dataset via the Gap Statistic,Journal of the Royal Statistics Society (Series B), 2001 Pedersen, T. and Kulkarni, A. Selecting the "Right" Number of Senses Based on Clustering Criterion Functions, Proceedings of the Posters and Demo Program of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, 2006 Pedersen, T. and Kulkarni, A. Selecting the "Right" Number of Senses Based on Clustering Criterion Functions, Proceedings of the Posters and Demo Program of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics, 2006 Describes SenseClusters stopping methods Describes SenseClusters stopping methods

124 EACL-2006 Tutorial 124 Cluster Labeling

125 EACL-2006 Tutorial125 Cluster Labeling Once a cluster is discovered, how can you generate a description of the contexts of that cluster automatically? Once a cluster is discovered, how can you generate a description of the contexts of that cluster automatically? In the case of contexts, you might be able to identify significant lexical features from the contents of the clusters, and use those as a preliminary label In the case of contexts, you might be able to identify significant lexical features from the contents of the clusters, and use those as a preliminary label

126 EACL-2006 Tutorial126 Results of Clustering Each cluster consists of some number of contexts Each cluster consists of some number of contexts Each context is a short unit of text Each context is a short unit of text Apply measures of association to the contents of each cluster to determine N most significant bigrams Apply measures of association to the contents of each cluster to determine N most significant bigrams Use those bigrams as a label for the cluster Use those bigrams as a label for the cluster

127 EACL-2006 Tutorial127 Label Types The N most significant bigrams for each cluster will act as a descriptive label The N most significant bigrams for each cluster will act as a descriptive label The M most significant bigrams that are unique to each cluster will act as a discriminating label The M most significant bigrams that are unique to each cluster will act as a discriminating label

128 EACL-2006 Tutorial 128 Evaluation Techniques Comparison to gold standard data

129 EACL-2006 Tutorial129 Evaluation If Sense tagged text is available, can be used for evaluation If Sense tagged text is available, can be used for evaluation But don’t use sense tags for clustering or feature selection! But don’t use sense tags for clustering or feature selection! Assume that sense tags represent “true” clusters, and compare these to discovered clusters Assume that sense tags represent “true” clusters, and compare these to discovered clusters Find mapping of clusters to senses that attains maximum accuracy Find mapping of clusters to senses that attains maximum accuracy

130 EACL-2006 Tutorial130 Evaluation Pseudo words are especially useful, since it is hard to find data that is discriminated Pseudo words are especially useful, since it is hard to find data that is discriminated Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. http://www.d.umn.edu/~tpederse/tools.html http://www.d.umn.edu/~tpederse/tools.html Baseline Algorithm– group all instances into one cluster, this will reach “accuracy” equal to majority classifier Baseline Algorithm– group all instances into one cluster, this will reach “accuracy” equal to majority classifier

131 EACL-2006 Tutorial131 Evaluation Pseudo words are especially useful, since it is hard to find data that is discriminated Pseudo words are especially useful, since it is hard to find data that is discriminated Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. Pick two words or names from a corpus, and conflate them into one name. Then see how well you can discriminate. http://www.d.umn.edu/~kulka020/kanaghaN ame.html http://www.d.umn.edu/~kulka020/kanaghaN ame.html

132 EACL-2006 Tutorial132 Baseline Algorithm Baseline Algorithm – group all instances into one cluster, this will reach “accuracy” equal to majority classifier Baseline Algorithm – group all instances into one cluster, this will reach “accuracy” equal to majority classifier What if the clustering said everything should be in the same cluster? What if the clustering said everything should be in the same cluster?

133 EACL-2006 Tutorial133 Baseline Performance S1S2S3Totals C10000 C20000 C3803555170 Totals803555170 S3S2S1TotalsC10000 C20000 C3553580170 Totals553580170 (0+0+55)/170 =.32 if C3 is S1 (0+0+80)/170 =.47if C3 is S3

134 EACL-2006 Tutorial134 Evaluation Suppose that C1 is labeled S1, C2 as S2, and C3 as S3 Suppose that C1 is labeled S1, C2 as S2, and C3 as S3 Accuracy = (10 + 0 + 10) / 170 = 12% Accuracy = (10 + 0 + 10) / 170 = 12% Diagonal shows how many members of the cluster actually belong to the sense given on the column Diagonal shows how many members of the cluster actually belong to the sense given on the column Can the “columns” be rearranged to improve the overall accuracy? Can the “columns” be rearranged to improve the overall accuracy? Optimally assign clusters to senses Optimally assign clusters to senses S1S2S3Totals C11030545 C22004060 C35051065 Totals803555170

135 EACL-2006 Tutorial135 Evaluation The assignment of C1 to S2, C2 to S3, and C3 to S1 results in 120/170 = 71% The assignment of C1 to S2, C2 to S3, and C3 to S1 results in 120/170 = 71% Find the ordering of the columns in the matrix that maximizes the sum of the diagonal. Find the ordering of the columns in the matrix that maximizes the sum of the diagonal. This is an instance of the Assignment Problem from Operations Research, or finding the Maximal Matching of a Bipartite Graph from Graph Theory. This is an instance of the Assignment Problem from Operations Research, or finding the Maximal Matching of a Bipartite Graph from Graph Theory. S2S3S1Totals C13051045 C20402060 C35105065 Totals355580170

136 EACL-2006 Tutorial136 Analysis Unsupervised methods may not discover clusters equivalent to the classes learned in supervised learning Unsupervised methods may not discover clusters equivalent to the classes learned in supervised learning Evaluation based on assuming that sense tags represent the “true” cluster are likely a bit harsh. Alternatives? Evaluation based on assuming that sense tags represent the “true” cluster are likely a bit harsh. Alternatives? Humans could look at the members of each cluster and determine the nature of the relationship or meaning that they all share Humans could look at the members of each cluster and determine the nature of the relationship or meaning that they all share Use the contents of the cluster to generate a descriptive label that could be inspected by a human Use the contents of the cluster to generate a descriptive label that could be inspected by a human

137 EACL-2006 Tutorial 137 Practical Session Experiments with SenseClusters

138 EACL-2006 Tutorial138 Things to Try Feature Identification Feature Identification Type of Feature Type of Feature Measures of association Measures of association Context Representation (1 st or 2 nd order) Context Representation (1 st or 2 nd order) Automatic Stopping (or not) Automatic Stopping (or not) SVD (or not) SVD (or not) Clustering Algorithm and Criterion Function Clustering Algorithm and Criterion Function Evaluation Evaluation Labeling Labeling

139 EACL-2006 Tutorial139 Experimental Data Available on Web Site Available on Web Site http://senseclusters.sourceforge.net http://senseclusters.sourceforge.net http://senseclusters.sourceforge.net Available on LIVE CD Available on LIVE CD Mostly “Name Conflate” data Mostly “Name Conflate” data

140 EACL-2006 Tutorial140 Creating Experimental Data NameConflate program NameConflate program Creates name conflated data from English GigaWord corpus Creates name conflated data from English GigaWord corpus Text2Headless program Text2Headless program Convert plain text into headless contexts Convert plain text into headless contexts http://www.d.umn.edu/~tpederse/tools.html http://www.d.umn.edu/~tpederse/tools.html http://www.d.umn.edu/~tpederse/tools.html

141 EACL-2006 Tutorial141 Headed Clustering Name Discrimination Name Discrimination Tom Hanks Tom Hanks Russell Crowe Russell Crowe

142 EACL-2006 Tutorial142

143 EACL-2006 Tutorial143

144 EACL-2006 Tutorial144

145 EACL-2006 Tutorial145

146 EACL-2006 Tutorial146 Headless Contexts Email / 20 newsgroups data Email / 20 newsgroups data Spanish Text Spanish Text

147 EACL-2006 Tutorial147

148 EACL-2006 Tutorial148

149 EACL-2006 Tutorial149 Thank you! Questions or comments on tutorial or SenseClusters are welcome at any time tpederse@d.umn.edu Questions or comments on tutorial or SenseClusters are welcome at any time tpederse@d.umn.edu tpederse@d.umn.edu SenseClusters is freely available via LIVE CD, the Web, and in source code form SenseClusters is freely available via LIVE CD, the Web, and in source code form http://senseclusters.sourceforge.net SenseClusters papers available at: SenseClusters papers available at: http://www.d.umn.edu/~tpederse/senseclusters-pubs.html


Download ppt "EACL-2006 Tutorial 1 Language Independent Methods of Clustering Similar Contexts (with applications) Ted Pedersen University of Minnesota, Duluth"

Similar presentations


Ads by Google