Presentation is loading. Please wait.

Presentation is loading. Please wait.

A new Machine Learning algorithm for Neoposy: coining new Parts of Speech Eric Atwell Computer Vision and Language group School of Computing University.

Similar presentations


Presentation on theme: "A new Machine Learning algorithm for Neoposy: coining new Parts of Speech Eric Atwell Computer Vision and Language group School of Computing University."— Presentation transcript:

1 A new Machine Learning algorithm for Neoposy: coining new Parts of Speech Eric Atwell Computer Vision and Language group School of Computing University of Leeds http://www.comp.leeds.ac.uk/eric

2 Overview Neoposy: What? Why? conflicting criteria defining PoS Unsupervised Machine Learning Clustering of word-types Unification of word-tokens Problems with token-unification Conclusions: hybrid clustering

3 Neoposy CED: neology/neologism: “a newly coined word…; or the practice of using or introducing neologies” Cf: pos-tagger, pos-tagged corpus, bi-pos model, uniposy/polyposy (Elliott 2002) … Neoposy: neology meaning “a newly coined classification of words into Parts of Speech; or the practice of introducing or using neoposies”

4 Why neoposy ? It’s interesting (well, it is to me…) “Traditional” PoS may not fit some languages Solutions may shed light on Language Universals, and on analysis of other language-like datasets A challenge for unsupervised Machine Learning, different from other classification/clustering tasks

5 Definition of “part of speech” CED: “a class of words sharing important syntactic or semantic features; a group of words in a language that may occur in similar positions or fulfil similar functions in a sentence” e.g. “a class of”, a group of” in the last sentence.

6 BUT 3 criteria can conflict: Semantic feature: noun = thing Syntactic feature: noun can inflect, sing v plural Position/function: noun fits “a X of” A word TYPE may fit more than one category, because individual TOKENS behave differently

7 A challenge for unsupervised ML “A supervised algorithm is one which is given the correct answers for some of the data, using those answers to induce a model which can generalize to new data it hasn’t seen before… An unsupervised algorithm does this purely from the data.” (Jurafsky and Martin 2000)

8 Clustering word-types e.g. Atwell 1983, Atwell and Drakos 1987, Hughes and Atwell 1994, Elliott 2002, Roberts 2002… Cluster word-types whose representative tokens in a Corpus appeared in similar contexts (e.g. word before or/and after, neighbouring function-words), trying various similarity metrics and clustering algorithms

9 Features and clustering Every instance (word-type) must be characterised by a vector of feature- values (neighbour word-types and coocurrence frequencies) ; Instances with similar feature-vectors are lumped together - merged Feature-vectors are also merged – AND all other feature-vectors where merged words appear in context must be updated

10 Example (Atwell 1983) THE: (ant 1),(cat 14),(dog 11),… OF: (a 90),(cat 2),(dog 7),…(the 130),… A: (bat 2),(cat 13),(dog 12),… … => THE: (ant 1),(bat 2),(cat 27),(dog 23),… OF: (cat 2),(dog 7),…(the 220),… …

11 Problems with word-type clusters These clustering algorithms assume a word-type can belong to only one class: OK for function words (articles, prepositions, personal pronouns) but not OK for open-class categories OK for high-freq word-types, but many types are sparse (Zipf) Features (context-word types) must be updated after each iteration – not part of standard clustering

12 Clustering tokens, not types “a class of”, “a group of” are TOKENS in similar context, but we shouldn’t generalise to say “class” and “group” are always the same PoS. Clustering relies on similar frequency- vectors, but for a TOKEN, f=1 ??? Instead of statistical models, use constraint logic programming

13 Unification of shared contexts ?-neoposy([the,cat,sat,on,the,mat], Tagged). Tagged= [[the,T1], [cat,T2], [sat,T3], [on,T4], [the,T5], [mat,T2]]

14 How many word-classes are learnt? N <= number of tokens (??) As many classes as “contexts”… Context = “word before”, N = number of word-TYPES e.g. 1M-corpus: c50K word-classes…

15 Iterating to use classes in contexts ?- neoposy2 ([the,cat,sat,on,the,mat,and,went,to,sleep ], Tagged). Tagged= [[the,T1], [cat,T2], [sat,T3], [on,T4], [the,T5], [mat,T2], [and,T6], [went,T7], [to,T8], [sleep,T9]]; Tagged= [[the,T1], [cat,T2], [sat,T3], [on,T4], [the,T5], [mat,T2], [and,T3], [went,T7], [to,T8], [sleep,T9]];

16 Cascading word-class unification Tagged= [[the,T1], [cat,T2], [sat,T3], [on,T4], [the,T5], [mat,T2], [and,T3], [went,T7], [to,T8], [sleep,T9]]; Tagged= [[the,T1], [cat,T2], [sat,T3], [on,T4], [the,T5], [mat,T2], [and,T3], [went,T4], [to,T8], [sleep,T9]]; Tagged= [[the,T1], [cat,T2], [sat,T3], [on,T4], [the,T5], [mat,T2], [and,T3], [went,T4], [to,T5], [sleep,T9]]; Tagged= [[the,T1], [cat,T2], [sat,T3], [on,T4], [the,T5], [mat,T2], [and,T3], [went,T4], [to,T5], [sleep,T2]]

17 Alternative constraints? Merging by word-class context may be too powerful ?constrain a word-type to N classes ?very hard constraint-satisfaction problem: 1M words, 50K types = average 20 tokens per type (but I don’t know how to do this…)

18 Conclusion: Hybrid clustering? Word-token constraint-based clustering for rare words, hapax legomena Word-type statistical clustering for high-freq words (closed-class function words) “Learning hints”: “seed” with limited PoS-lexicon

19 Future work Combining Corpus Linguistics and Machine Learning in linguistic knowledge discovery Applications in other languages (eg with minimal lexical resources), other language-like datasets

20 Summary Neoposy: What? Why? conflicting criteria defining PoS Unsupervised Machine Learning Clustering of word-types Unification of word-tokens Problems with token-unification Conclusions: hybrid clustering Linguistic knowledge discovery beyond English…


Download ppt "A new Machine Learning algorithm for Neoposy: coining new Parts of Speech Eric Atwell Computer Vision and Language group School of Computing University."

Similar presentations


Ads by Google