Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds
Introduction Significant advances in Machine Learning approaches to the automatic analysis of corpora A range of Machine Learning approaches Three dimensions of classification –Levels of linguistic analysis; –Machine Learning techniques; –Current research in Discourse Analysis A framework for further development
Levels of linguistic analysis Tokenisation Part-of-Speech tagging Parsing Semantic analysis Discourse analysis
Low-level Linguistic Analysis Tokenisation: breaks up the sequence of characters in a text by locating the word boundaries Part-of-Speech: assigns correct Part-of-Speech and additional grammatical features to each word A forced move from hand- built to Machine Learning approaches Many systems learn statistical models from a training corpus, e.g. CLAWS Transformation-Based Learning is the most popular alternative approach
Parsing and Semantic Analysis Parsing: take a formal grammar and a linguistic input and apply the grammar to the input to produce a parse-tree –Top-Down and Bottom Up reflect contrasting perspectives Semantic Analysis: augment data to facilitate automatic recognition of the underlying semantic content and structure –A common practice is to label documents with thesaurus classes for document classification and management
Discourse Analysis Discourse analysis extends beyond sentence boundaries No universal agreement on discourse analysis categories or labels A growing range of dialogue transcript corpora have been hand-annotated with dialogue-act or speech-act tags designed for specific applications
Machine Learning Techniques for Linguistic Annotation of Corpora N-gram Markov models, HMMs Neural Networks, Semantic Networks Transformation-Based Learning Decision-Tree classification Vector-based clustering
N-gram, Markov models N-gram and Markov Models A Markov Model of a sequence of states or symbols (e.g. words or Part-of-Speech tags) is used to estimate the probability or likelihood of a symbol sequence Hidden Markov Models (HMMs) are a variant including 2 layers of states: –a visible layer corresponding to input symbols –a hidden layer learnt by the system
Neural Networks, Semantic Networks Neural networks have been developed in many fields in the hope of achieving human-like learning A related model is the semantic network –Typically nodes represent concepts –Connections represent semantically meaningful associations between these concepts.
Transformation-Based Learning Brill (1995) developed a symbolic Machine Learning method called Transformation-Based Learning (TBL) Given a tagged training corpus, TBL produces a sequence of rules that serves as a model of the training data
Decision Tree Classification and Vector-Based Clustering A decision tree is constructed by partitioning the training set, selecting, at each step, the feature that most reduce the uncertainty about the class in each partition, and using it as a split Vector-based clustering uses co-occurrence statistics to construct vectors that represent word classes or meanings by virtue of their direction in multi-dimensional word-collocation space
Discourse Analysis 1/2 1994: Woszczyna and Waibel – N-grams, Markov Model 1996: Reithinger, Engel, Kipp and Klesen – N- grams, HMM 1996: Mast et al. – Decision Trees, N-grams 1997: Reithinger and Klesen – N-grams, Bayesian network
Discourse Analysis 2/2 1998: Samuel, Carberry, and Vijay-Shanker – Transformation-Based Learning 1998: Wright – N-grams, CART Decision Tree, Neural Networks 1998: Taylor, King, Isard, and Wright – Combined N-grams and HMM 1998: Fukada et al – Bi- grams, HMM 1998: Stolcke et al. – HMM, Decision Trees
Conclusion This survey has explored algorithms underlying different levels of linguistic analysis, providing a framework for further research Better to combine 2 or more ML approaches? Discourse Analysis: HMM/n-grams + ano Future work –Explore systems which can be used and re-used –Integrate such systems and comparatively evaluate Machine Learning techniques for corpus analysis