Word Prediction Words do not randomly appear in text. The probability of a word appearing in a text is to a large degree related to the words that have.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Machine Learning PoS-Taggers COMP3310 Natural Language Processing Eric.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Outline Why part of speech tagging? Word classes
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
Chapter 8. Word Classes and Part-of-Speech Tagging From: Chapter 8 of An Introduction to Natural Language Processing, Computational Linguistics, and Speech.
BİL711 Natural Language Processing
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
February 2007CSA3050: Tagging II1 CSA2050: Natural Language Processing Tagging 2 Rule-Based Tagging Stochastic Tagging Hidden Markov Models (HMMs) N-Grams.
NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.
Hidden Markov Model (HMM) Tagging  Using an HMM to do POS tagging  HMM is a special case of Bayesian inference.
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Part-Of-Speech (POS) Tagging.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
1 Collocation and translation MA Literary Translation- Lesson 2 prof. Hugo Bowles February
POS based on Jurafsky and Martin Ch. 8 Miriam Butt October 2003.
Tagging – more details Reading: D Jurafsky & J H Martin (2000) Speech and Language Processing, Ch 8 R Dale et al (2000) Handbook of Natural Language Processing,
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Part of speech (POS) tagging
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Word classes and part of speech tagging Chapter 5.
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
Albert Gatt Corpora and Statistical Methods Lecture 9.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
8. Word Classes and Part-of-Speech Tagging 2007 년 5 월 26 일 인공지능 연구실 이경택 Text: Speech and Language Processing Page.287 ~ 303.
Statistical Natural Language Processing Diana Trandabăț
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Parts of Speech Sudeshna Sarkar 7 Aug 2008.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
S1: Chapter 1 Mathematical Models Dr J Frost Last modified: 6 th September 2015.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
10/30/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 7 Giuseppe Carenini.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Collocation 발표자 : 이도관. Contents 1.Introduction 2.Frequency 3.Mean & Variance 4.Hypothesis Testing 5.Mutual Information.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Word classes and part of speech tagging Chapter 5.
Speech and Language Processing Ch8. WORD CLASSES AND PART-OF- SPEECH TAGGING.
Tokenization & POS-Tagging
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging I Introduction Tagsets Approaches.
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Word classes and part of speech tagging 09/28/2004 Reading: Chap 8, Jurafsky & Martin Instructor: Rada Mihalcea Note: Some of the material in this slide.
Natural Language Processing
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
CSA3202 Human Language Technology HMMs for POS Tagging.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Supertagging CMSC Natural Language Processing January 31, 2006.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Word classes and part of speech tagging Chapter 5.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
Language Model for Machine Translation Jang, HaYoung.
Statistical NLP: Lecture 7
CSCI 5832 Natural Language Processing
CPSC 503 Computational Linguistics
Lecture 6: Part of Speech Tagging (II): October 14, 2004 Neal Snider
Presentation transcript:

Word Prediction Words do not randomly appear in text. The probability of a word appearing in a text is to a large degree related to the words that have appeared before it. –e.g. I’d like to make a collect... –Call is the most likely next word, but other words such as telephone, international... are also possible. Other (very common) words are unlikely (e.g. dog, house).

Word Prediction Word prediction is very useful for applications such as: –Speech recognition: It is possible to select between words that are hard for a speech recognizer to distinguish. –Augmentative communication for the disabled: Speech generation systems can become more effective. –Spelling error detection: They are leaving in about 15 minuets. He is trying to fine out. Word prediction is also related to the problem of computing the probability of a sentence.

Counting Words in Corpora Text corpus: A collection of text (or speech). –Brown corpus: 1M words of text. –Switchboard corpus: 3 million words of speech. Word counting in corpora: –Punctuation may count as words or not. –Are “don’t”,”O’Reilly”,“non-carbonated” one or two words. –Are “They” and “they” different or the same word. Many of these choices depend on the application.

Words in Corpora Tokens: Total numbers of running words in corpus (possibly including punctuation). Types: Number of distinct words in corpus. Wordform: The inflected word as it appears in the corpus. e.g. cat =/= cats. Lemma: Set of word forms having the same stem and the same word sense. e.g. cat and cats are not distinguished. –Switchboard corpus: 2.4M tokens, 20,000 types –Brown corpus: 1M tokens, 61,805 types, 37,851 lemmata

Simple N-Grams Probabilistic models of word sequence Simplest model: Every word may follow any other word. All words have equal probability. More complex: The probability of appearance of each word depends on its frequency in the corpus. e.g. – the appears 69,971 times in Brown corpus (7%) –rabbit appears 12 times (0.001%) But suppose we have the sentence: –Just then the white...

Markov Assumption The probability of the appearance of a word depends on the words that have appeared before it. P(rabbit | Just then the white) Impossible to calculate this probability from a corpus. The exact word sequence would have to appear in the corpus. Markov simplifying assumption: we approximate the probability of a word given all the previous words with the probability given only the previous word. P(rabbit | Just then the white)= P(rabbit | white) First used by Markov (1913) to calculate if the upcoming letter in Pushkin’s Eugene Onegin would be a vowel or a consonant

Bigrams, Trigrams... N-grams When looking only in the previous word we call the model a bigram model (or first order Markov model). When looking two words back we have a trigram model (or second order Markov model). Generally, looking N-1 words back, we get an N-gram model (or N-1th order Markov model). The probability of a sentence is calculated by multiplying the (approximated) probability of each word in the sentence. e.g: P(I want to eat Chinese food)=P(I | )P(want | I)P(to | want)P(eat | to)P(Chinese | eat)P(food | Chinese)

N-gram Probability Calculation We count the appearances of a bigram in our training corpus and then we normalize by dividing by the number of appearances of the first word. P(Wn | Wn-1)=C(Wn-1 Wn) / C(Wn-1) Problem with N-grams is that many valid N-grams will be missing from the finite training corpus. Many bigrams that should have same probability will have zero, or very small probability.

I: 3437, want:1215, to:3256, eat: 938, Chinese: 213, food: 1506, lunch: 459

Smoothing Smoothing (or discounting) is the process of assigning some probability to zero or low probability N-grams. Add-one smoothing is the adding of one appearance to every possible N- gram. This method doesn’t work well because it takes too much probability away from likely N-grams and shares it among many impossible N-grams. e.g. P(to | want) changes from 0.65 to 0.28

Smoothing Witten-Bell Discounting: We use the count of N-grams we have only seen once to help us estimate the count of things we have not yet seen. We compute the probability of seeing a new N-gram for the first time by counting the N-grams in the corpus that have appeared at least once. This moves far less probability from likely to unlikely events and gives better models than the add-one smoothing.

Training and Test sets The statistical parameters of a model are trained on a training corpus and then the quality of the model is tested on a test set. Selection of training corpus is important. If it is too specific it might not generalize well to new sentences. If it is too general it may not be reflective of the domain of the application where we want to use it. Selection of the test sentences is also critical. If the test sentence comes from the training corpus then the performance will be artificially high. If the domain of the test corpus is different than that of the training corpus then the statistical model will not reflect the properties of the test set. Usually we have the same corpus and we divide it into a training set and a test set.

N-grams and Training Corpora Example of artificially generated sentences showing the performance of N-grams for various N and the dependence of the model on the domain training corpus. Works of Shakespeare –Unigram: Will rash been and by I the me loves gentle me not slavish page, the and hour, ill let –Bigram: Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live king. Follow. –Trigram: Sweet Prince, Falstaff shall die. Harry of Monmouth’s grave. –Quadrigram: Enter Leonato’s brother Antonio, and the rest, but seek the weary beds of people sick.

N-grams and Training Corpora Similar sentences generated from the Wall Street journal corpus. –Unigram: Months the my and issue of wear foreign new exchange’s September were recession exchange new endorsed a acquire to six executives. –Bigram: Last December through the way to preserve the Hudson corporation N. B. E. C. Taylor would seem to complete the major central planners... –Trigram: They also point to ninety nine point six billion dollars from two hundrent oh six three per cent of the rates of interest...

Context-Sensitive Spelling Error Correction Spelling errors which result in real words cannot be found without looking at the context of a word. –According to empirical studies between 15% to 40% of spelling errors result in real words. Local errors are errors that can be detected by looking at the immediate surrounding of the words. Global errors require looking at a larger context.

Local and Global Errors

Context-Sensitive Spelling Error Correction Generate every possible misspelling of a word that results in a real word. These are generated based on typographical transformations (letter insertion, deletion, substitution) or based on lists of homophones (e.g. piece – peace) Then given all these possible words we select the word that maximizes the N-gram probability of the sentence that contains the word in question.

Part of Speech Tagging (POST) POST is the process of assigning to each word in a text its part of speech or in general its word class. Input: A text and a set of possible word classes (tagset). Output: The text, where next to each input word its word class has been marked. A good post is often the first step for many applications such as: Speech Generation and Recognition Machine Translation Information Retrieval Computational Lexicography Word sense disambiguation Prepositional phrase attachment disambiguation

Word Classes and Tagsets For English there is a number of different tagset differing in the number of word classes contained in each. –Brown Corpus tagset: 87 –Penn Treebank tagset: 45 –C5 (British National Corpus): 67 –C7: 146 For Greek different tagsets have been used ranging from 58 to 584 word classes. Example of Sentence with POS annotation: The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS./.

Penn Treebank Tagset

Ambiguity in POST Ambiguity is a serious problem for POST, i.e. a word may belong to more than one word classes. E.g. Book/VB that/DT flight/NN. Book: VB(verb) ή NN(noun) That: DET(determiner) ή IN (subordinative conjuction) Many of the most commonly occurring words are ambiguous: only 11.5% of the lemmata in the Brown Corpus are ambiguous, but 40% of the word tokens are ambiguous. Word ClassesNumber of Words 1 WC WC WC264 4 WC61 5 WC12 6 WC2 7 WC1 (still)

POST Algorithms Rule-Based: They use a large pool of hand written rules which are used for word class disambiguation. Stochastic and Probabilistic: They use a training corpus to calculate the probability that a certain word belongs to a certain word class, depending on the context of the word in the text. Transformation-based Learning (Brill algorithm): Combination of rule-based and stochastic approaches.

Rule-Based POST Usually two-stages: –Use a dictionary or morphological analysis to assign to each word all its possible POS. –Use a list of hand-written disambiguation rules to select only one POS for each word. –ENGTWOL tagger: –First Stage: Uses a 2-level morphological analyses to find possible word classes for each word. Also uses some additional syntactic information (e.g. Verb subcategorisation) –Second Stage: Uses about 1,100 constraints to rule out incorrect POS

ENGTWOL Rules Rules used to eliminate tags inconsistent with the context. E.g: ADVERBIAL-THAT RULE Given input: “that” If (+1 A/ADV/QUANT); // If the next word is adj, adverb or quantifier (+2 SENT-LIM); // and following is a sentence boundary (NOT –1 SVOC/A); // and the previous is not a verb that allows adj as //complements (e.g. consider) then eliminate non-ADV tags else eliminate ADV tag This rule checks if the word “that” has an adverbial reading: e.g. This isn’t that odd.

Stochastic Part-of-Speech Tagging The intuition is to pick the most-likely tag for each word. We select the tag that maximizes the probability: P(word|tag)*P(tag|previous n tags) Markov simplifying assumption: P(tag|previous n tags)=P(tag|previous tag) so we want to maximize P(word|tag)* P(tag|previous tag) The probabilities P(word|tag) and P(tag1|tag2) have been calculated by counting occurrences of words and tags in annotated training corpora.

Stochastic POST Example Find the POS of the word “race” in the sentence: Secreteriat is expected to race tomorrow. Is race a noun (NN) or a verb (VB). Assuming we know that the correct tagging for to is TO we have to calculate: P(VB|TO)P(race|VB) and P(NN|TO)P(race|VB) From frequency counts of the Brown and the Switchboard corpus we have: P(NN|TO) =.021, P(VB|TO) =.34, P(race|NN)=.00041, P(race|VB)= So, P(VB|TO)P(race|VB)= > P(NN|TO)P(race|NN)= To tag an entire sentence instead of just one word, we use a Hidden Markov Model

Transformation-Based POST Like rule-based systems it is using rules to decide which word class will be selected for each word. However, like stochastic systems, it learns these rules from a preannotated training corpus. An ordered list of transformations is kept by the algorithm. These rules are applied on the text we want to tag. Each transformation is composed of a rewrite rule and an activation condition. e.g. Rewrite Rule: Change the tagging from MD (modal) to NN (noun) Activation Condition: The tagging of the previous word is DT (determiner) The/DT can/MD rusted/VB./.  The/DT can/ΝΝ rusted/VB./..

Learning of Transformation Rules Initially the training text is given as input without any tagging and an initial tagging is performed. e.g. most likely tags. In each training stage a number of possible transformations (based on certain patterns) is performed on the text Of these transformations we select the transformation that produced the annotation that is closest to the correct annotation of the training text. The training continues iteratively until no transformation can be found that causes any improvement in the tagging of the text.

Learning of Transformation Rules

Allowed Transformations The transformations may be lexicalized or not. Non Lexicalized: Change the tagging of a word from A to B if: –The previous word is tagged as Z –One of the two following words is tagged as Z –The previous word is tagged as Z and the next as W Lexicalized: Change the tagging of a word from A to B if: –The previous word is w –One of the two following words is w –The previous word is w, the next word is x and the previous word is tagged as Z

Accuracy of Tagging Factors affecting accuracy: –Amount of training data available. –Tag set. –Difference between the training corpus and dictionary and the corpus of application. –Number of unknown words (coverage of the dictionary). A ‘dumb’ tagger that always assigns the most common part of speech can get accuracy of about 90% (for English). Most taggers report accuracies between 95-97%. These number can be somewhat misleading. 97% means probability of 63% to get every word in a 15 word sentence correct. When POS is just preprocessing for other applications, very high accuracy is required.

Collocations A collocation is an expression of two or more words that correspond to a conventional way of saying things. e.g. strong tea, weapons of mass destruction, make up … Various definitions. e.g. (Choueka, 1988) –A collocation is defined as a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components. Adjacency of words is not always necessary (e.g. knock...door).

Applications of Collocations Language Generation: To make the generated language sound natural. Computational Lexicography: To identify important collocations for listing in dictionaries. Parsing: For preferring parses containing natural collocations. Corpus Linguistic Research: For studying the social phenomena as they are expressed in the language.

Criteria for Collocations Non-compositionality: The meaning of a collocation is not a straightforward composition of the meanings of its parts. e.g. white in white wine, white hair and white woman refers to somewhat different colors depending on the collocation. Non-substitutability: We cannot substitute other words for the components of a collocation. E.g. we cannot say yellow wine instead of white wine. Non-modifiability: Most collocation cannot be freely modified with additional lexical material. E.g. we cannot say kicked the wooden bucket.

Discovering Collocations We can search for collocations based on their frequency in a corpus ofthe13.689ofa inthe13.361bythe tothe13.183withthe onthe12.622fromthe forthe11.428NewYork and the10.007hesaid thatthe 9.775asa atthe 9.231isa tobe 8.753hasbeen ina 8.573fora Most common bigrams from the New York Times corpus (14M words)

Discovering Collocations But such bigrams are not really collocations. A better approach is to filter based on part-of-speech NewYorkA-N 7.261UnitedStatesA-N 5.412LosAngelesN-N 3.301lastyearA-N 3.191SaudiArabiaN-N 2.699lastweekA-N 2.514vicepresidentA-N

Mean and Variance Often collocations are not consecutive words. E.g knock and door in –She knocked on his door –A man knocked on the metal front door –They knocked at the door We need to count frequencies of co-occurrence of the words in a window (usually 3 to 4 words to the side of each word). We can also calculate the mean and the variance of the distances between the words. Low variance means that the words have a relatively fixed structural relationship (i.e. they usually occur at the same distance)

Hypothesis Testing Based on frequency it is possible that coocurrance of two frequently occurring words might be accidental.Hypothesis testing is a method for testing if this co-occurrence is accidental. We formulate a null hypothesis that these co-occurrences are accidental and compute the probability p of the words occurring together if the hypothesis is true. If p is too low we reject the hypothesis (i.e. the words are related). Various test have been proposed in the statistics literature for determining the probability of the words occurring together: t test, testing of differences, Person’s chi-test, likelihood ration. Mutual Information is a measure of word relation based on information theory. It tells us how much the information we have about the occurrence of a word increases when we have information about the occurrence of another word.