Language Technology Language Technology Machine learning of natural language 11 februari 2005 UA Antal van den Bosch.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Mining customer ratings for product recommendation using the support vector machine and the latent class model William K. Cheung, James T. Kwok, Martin.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Data Mining Classification: Alternative Techniques
Data Mining Classification: Alternative Techniques
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Soft computing Lecture 6 Introduction to neural networks.
Bayesian Learning Rong Jin. Outline MAP learning vs. ML learning Minimum description length principle Bayes optimal classifier Bagging.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Three kinds of learning
1 MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING By Kaan Tariman M.S. in Computer Science CSCI 8810 Course Project.
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Part I: Classification and Bayesian Learning
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Introduction to Machine Learning Approach Lecture 5.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
Comparing the Parallel Automatic Composition of Inductive Applications with Stacking Methods Hidenao Abe & Takahira Yamaguchi Shizuoka University, JAPAN.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.
Short Introduction to Machine Learning Instructor: Rada Mihalcea.
Data mining and machine learning A brief introduction.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
COMP3503 Intro to Inductive Modeling
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Transcription of Text by Incremental Support Vector machine Anurag Sahajpal and Terje Kristensen.
Learning from Observations Chapter 18 Through
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
(1) Abstract and Contents New model selection criteria called Matchability which is based on maximizing matching opportunity is proposed. Given data set.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CS Inductive Bias1 Inductive Bias: How to generalize on novel data.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Matwin Text classification: In Search of a Representation Stan Matwin School of Information Technology and Engineering University of Ottawa
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Data Mining and Decision Support
NTU & MSRA Ming-Feng Tsai
Machine Learning in Practice Lecture 10 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Pattern Recognition NTUEE 高奕豪 2005/4/14. Outline Introduction Definition, Examples, Related Fields, System, and Design Approaches Bayesian, Hidden Markov.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.
Network Management Lecture 13. MACHINE LEARNING TECHNIQUES 2 Dr. Atiq Ahmed Université de Balouchistan.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
CS 9633 Machine Learning Support Vector Machines
Machine Learning for Computer Security
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Ch9: Decision Trees 9.1 Introduction A decision tree:
Data Mining Lecture 11.
Machine Learning Week 1.
Discriminative Frequent Pattern Analysis for Effective Classification
CSCI 5832 Natural Language Processing
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
MACHINE LEARNING TECHNIQUES IN IMAGE PROCESSING
CSCI 5832 Natural Language Processing
A task of induction to find patterns
A task of induction to find patterns
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Support Vector Machines 2
Presentation transcript:

Language Technology Language Technology Machine learning of natural language 11 februari 2005 UA Antal van den Bosch

Tutorial Introduction –Empiricism, analogy, induction, language: a lightweight historical overview –Walkthrough example –ML basics: greedy vs lazy learning

Empiricism, analogy, induction, language A lightweight historical overview De Saussure: Any creation [of a language utterance] must be preceded by an unconscious comparison of the material deposited in the storehouse of language, where productive forms are arranged according to their relations. (1916, p. 165)

Lightweight history (2) Bloomfield: The only useful generalizations about language are inductive generalizations. (1933, p. 20). Zipf: nf 2 =k (1935), rf=k (1949)

Lightweight history (3) Harris: With an apparatus of linguistic definitions, the work of linguistics is reducible […] to establishing correlations. […] And correlations between the occurrence of one form and that of other forms yield the whole of linguistic structure. (1940) Hjelmslev: Induction leads not to constancy but to accident. (1943)

Lightweight history (4) Firth: A [linguistic] theory derives its usefulness and validity from the aggregate of experience to which it must continually refer. (1952, p. 168) Chomsky: I don't see any way of explaining the resulting final state [of language learning] in terms of any proposed general developmental mechanism that has been suggested by artificial intelligence, sensorimotor mechanisms, or anything else. (in Piatelli- Palmarini, 1980, p. 100)

Lightweight history (5) Halliday: The test of a theory [on language] is: does it facilitate the task at hand? (1985) Altmann: After the blessed death of generative linguistics, a linguist does no longer need to find a competent speaker. Instead, he needs to find a competent statistician. (1997)

Analogical memory-based language processing With a memory filled with instances of language mappings –from text to speech, –from words to syntactic structure, –from utterances to acts, … With the use of analogical reasoning, Process new instances from input –text, words, utterances to output –speech, syntactic structure, acts

Analogy (1) sequence a sequence b sequence b’ sequence a’ is similar to maps to

Analogy (2) sequence a sequence b ? sequence a’ is similar to maps to

Analogy (3) sequence n sequence b ? are similar to map to sequence a sequence f sequence a’ sequence n’ sequence f’

Memory-based parsing zo werd het Grand een echt theater … zo MOD/S wordt er … … zo gaat HD/S het … … en dan werd [ NP het DET dus … … dan is het HD/SUBJ NP ] bijna erger … … ergens ene keer [ NP een DET echt … … ben ik een echt MOD maar … … een echt bedrijf HD/PREDC NP ] zo MOD/S werd HD/S [ NP het DET Grand HD/SUBJ NP ] [ NP een DET echt MOD theater HD/PREDC NP ]

CGN treebank

Make data (1) #BOS zo BW T901 MOD 502 werd WW1 T304 HD 502 het VNW3 U503b DET 500 Grand*v N1 T102 HD 500 een LID U608 DET 501 echt ADJ9 T227 MOD 501 theater N1 T102 HD 501. LET T #500 NP -- SU 502 #501 NP -- PREDC 502 #502 SMAIN #EOS 54

Make data (2) Given context, map individual words to function+chunk code: 1.zoMODO 2.werdHDO 3.het DETB-NP 4.Grand HD/SUI-NP 5.een DETB-NP 6.echt MODI-NP 7.theater HD/PREDCI-NP

Make data (3) Generate instances with context: 1._ _ _ zo werd het Grand MOD-O 2._ _ zo werd het Grand een HD-O 3._ zo werd het Grand een echt DET-B-NP 4.zo werd het Grand een echt theaterHD/SU-I-NP 5.werd het Grand een echt theater _ DET-B-NP 6.het Grand een echt theater _ _ MOD-I-NP 7.Grand een echt theater _ _ _ HD/PREDC-I-NP

Crash course: Machine Learning The field of machine learning is concerned with the question of how to construct computer programs that automatically learn with experience. (Mitchell, 1997) Dynamic process: learner L shows improvement on task T after learning. Getting rid of programming. Handcrafting versus learning. Machine Learning is task-independent.

Machine Learning: Roots Information theory Artificial intelligence Pattern recognition Took off during 70s Major algorithmic improvements during 80s Forking: neural networks, data mining

Machine Learning: 2 strands Theoretical ML (what can be proven to be learnable by what?) –Gold, identification in the limit –Valiant, probably approximately correct learning Empirical ML (on real or artificial data) –Evaluation Criteria: Accuracy Quality of solutions Time complexity Space complexity Noise resistance

Empirical ML: Key Terms 1 Instances: individual examples of input-output mappings of a particular type Input consists of features Features have values Values can be –Symbolic (e.g. letters, words, …) –Binary (e.g. indicators) –Numeric (e.g. counts, signal measurements) Output can be –Symbolic (classification: linguistic symbols, …) –Binary(discrimination, detection, …) –Numeric (regression)

Empirical ML: Key Terms 2 A set of instances is an instance base Instance bases come as labeled training sets or unlabeled test sets (you know the labeling, not the learner) A ML experiment consists of training on the training set, followed by testing on the disjoint test set Generalisation performance (accuracy, precision, recall, F-score) is measured on the output predicted on the test set Splits in train and test sets should be systematic: n-fold cross-validation –10-fold CV –Leave-one-out testing Significance tests on pairs or sets of (average) CV outcomes

Empirical ML: 2 Flavours Greedy –Learning abstract model from data –Classification apply abstracted model to new data Lazy –Learning store data in memory –Classification compare new data to data in memory

Greedy learning

Lazy Learning

Greedy vs Lazy Learning Greedy: –Decision tree induction CART, C4.5 –Rule induction CN2, Ripper –Hyperplane discriminators Winnow, perceptron, backprop, SVM –Probabilistic Naïve Bayes, maximum entropy, HMM –(Hand-made rulesets) Lazy: –k-Nearest Neighbour MBL, AM Local regression

Greedy vs Lazy Learning Decision trees keep the smallest amount of informative decision boundaries (in the spirit of MDL, Rissanen, 1983) Rule induction keeps smallest number of rules with highest coverage and accuracy (MDL) Hyperplane discriminators keep just one hyperplane (or vectors that support it) Probabilistic classifiers convert data to probability matrices k-NN retains every piece of information available at training time

Greedy vs Lazy Learning Minimal Description Length principle: –Ockham’s razor –Length of abstracted model (covering core) –Length of productive exceptions not covered by core (periphery) –Sum of sizes of both should be minimal –More minimal models are better “Learning = compression” dogma In ML, length of abstracted model has been focus; not storing periphery

Greedy vs Lazy Learning + abstraction - abstraction + generalization- generalization Decision Tree Induction Hyperplane discriminators Regression Handcrafting Table Lookup Memory-Based Learning

Greedy vs Lazy: So? Highly relevant to ML of NL In language data, what is core? What is periphery? Often little or no noise; productive exceptions (Sub-)subregularities, pockets of exceptions “disjunctiveness” Some important elements of language have different distributions than the “normal” one E.g. word forms have a Zipfian distribution

ML and Natural Language Apparent conclusion: ML could be an interesting tool to do linguistics –Next to probability theory, information theory, statistical analysis (natural allies) –“Neo-Firthian” linguistics More and more annotated data available Skyrocketing computing power and memory

Entropy & IG: Formulas