Unsupervised language acquisition Carl de Marcken 1996.

Slides:



Advertisements
Similar presentations
Lecture 4 (week 2) Source Coding and Compression
Advertisements

Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.
Learning HMM parameters
The Geometric Distributions Section Starter Fred Funk hits his tee shots straight most of the time. In fact, last year he put 78% of his.
Hidden Markov Models Fundamentals and applications to bioinformatics.
Unsupervised language acquisition Carl de Marcken 1996.
1 Huffman Codes. 2 Introduction Huffman codes are a very effective technique for compressing data; savings of 20% to 90% are typical, depending on the.
Continuous Random Variables & The Normal Probability Distribution
Probabilistic Pronunciation + N-gram Models CSPP Artificial Intelligence February 25, 2004.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
Hidden Markov Models John Goldsmith. Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current.
Unsupervised Learning of Natural Language Morphology using MDL John Goldsmith November 9, 2001.
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Grammar induction by Bayesian model averaging Guy Lebanon LARG meeting May 2001 Based on Andreas Stolcke’s thesis UC Berkeley 1994.
Variant definitions of pointer length in MDL Aris Xanthos, Yu Hu, and John Goldsmith University of Chicago.
Variable-Length Codes: Huffman Codes
Estimation 8.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
CS262 Lecture 5, Win07, Batzoglou Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Section 4-2 Basic Concepts of Probability.
CMPT 120 Lists and Strings Summer 2012 Instructor: Hassan Khosravi.
Huffman Codes Message consisting of five characters: a, b, c, d,e
Natural Language Processing Expectation Maximization.
Chapter 4 Probability 4-1 Overview 4-2 Fundamentals 4-3 Addition Rule
Conditional & Joint Probability A brief digression back to joint probability: i.e. both events O and H occur Again, we can express joint probability in.
Investment Analysis and Portfolio management Lecture: 24 Course Code: MBF702.
Section 7.1 The STANDARD NORMAL CURVE
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
CSCI 1100/1202 January 28, The switch Statement The switch statement provides another means to decide which statement to execute next The switch.
Arrays An array is a data structure that consists of an ordered collection of similar items (where “similar items” means items of the same type.) An array.
Compression.  Compression ratio: how much is the size reduced?  Symmetric/asymmetric: time difference to compress, decompress?  Lossless; lossy: any.
11 Chapter 14 Part 1 Statistical Parsing Based on slides by Ray Mooney.
Introduction to Algorithms Chapter 16: Greedy Algorithms.
Huffman coding Content 1 Encoding and decoding messages Fixed-length coding Variable-length coding 2 Huffman coding.
Two Main Uses of Statistics: 1)Descriptive : To describe or summarize a collection of data points The data set in hand = the population of interest 2)Inferential.
Hidden Markov Models 1 2 K … 1 2 K … 1 2 K … … … … 1 2 K … x1x1 x2x2 x3x3 xKxK 2 1 K 2.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Data Structures & Algorithms
ICS312 LEX Set 25. LEX Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the C program.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ALGORITHMS.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
TM Design Macro Language D and SD MA/CSSE 474 Theory of Computation.
SIMS 296a-4 Text Data Mining Marti Hearst UC Berkeley SIMS.
Huffman Coding (2 nd Method). Huffman coding (2 nd Method)  The Huffman code is a source code. Here word length of the code word approaches the fundamental.
Huffman code and Lossless Decomposition Prof. Sin-Min Lee Department of Computer Science.
NATURAL LANGUAGE PROCESSING
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
1 Section 7.1 First-Order Predicate Calculus Predicate calculus studies the internal structure of sentences where subjects are applied to predicates existentially.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Lossless Compression-Statistical Model Lossless Compression One important to note about entropy is that, unlike the thermodynamic measure of entropy,
Making Sense of Statistics: A Conceptual Overview Sixth Edition PowerPoints by Pamela Pitman Brown, PhD, CPG Fred Pyrczak Pyrczak Publishing.
Review Day 2 May 4 th Probability Events are independent if the outcome of one event does not influence the outcome of any other event Events are.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Hidden Markov Models BMI/CS 576
HUFFMAN CODES.
CSC317 Greedy algorithms; Two main properties:
Copyright © Cengage Learning. All rights reserved.
PROBABILITY AND STATISTICS
Huffman Coding CSE 373 Data Structures.
CISC101 Reminders All assignments are now posted.
Algorithms CSCI 235, Spring 2019 Lecture 30 More Greedy Algorithms
Error Correction Coding
Presentation transcript:

Unsupervised language acquisition Carl de Marcken 1996

T H E R E N T I S D U E A,B,C…3 bits each HE4 HER4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8 Viterbi-best parse

T H E R E N T I S D U E A,B,C…3 bits each HE4 HER4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8 After character 1: Best analysis: T: 3 bits

T H E R E N T I S D U E A,B,C…3 bits each HE4 HER4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8 After character 1: Best (only) analysis: T: 3 bits After character 2: (1,2) not in lexicon (1,1)’s best analysis + (2,2) which exists: = 6 bits

T H E R E N T I S D U E After character 1: Best (only) analysis: T: 3 bits After character 2: (1,2) not in lexicon (1.1)’s best analysis + (2,2) which exists: = 6 bits (WINNER) After character 3: (1,3) is in lexicon: THE: 3 bits (1,1) best analysis + (2,3) which exists: = 7 bits T-HE (1,2) best analysis + (3,3): T-H-E: = 9 bits THE wins (3 bits) A,B,C…3 bits each HE4 HER 4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8

T H E R E N T I S D U E After character 1: Best (only) analysis: T: 3 bits After character 2: (1,2) not in lexicon (1,1)’s best analysis + (2,2) which exists: = 6 bits After character 3: (1,3) is in lexicon: THE: 3 bits (1,1) best analysis + (2,3) which exists: = 7 bits T-HE (1,2) best analysis + (3,3): T-H-E: = 9 THE wins After character 4: (1,4) not in lexicon; (2,4) HER (4 bits): with best analysis afterafter (1), yields T-HER (7 bits) (3,4) not in lexicon; best up to 3: THE plus R yields THE-R, cost is = 6. (Winner) A,B,C…3 bits each HE4 HER4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8

T H E R E N T I S D U E 1: T3 2: T-H 6 3: THE3 4: THE-R 6 5: (1,5) THERE: 5 (1,1) + (2,5)HERE = = 8 (1,2) + (3,5)ERE = 6 + 8= 14 (4,5) not in lexicon (1,4) + (5,5) = THE-R-E = = 9 THERE is the winner (5 bits) 6: (1,6) not checked because exceeds lexicon max length (2,6) HEREN not in lexicon (3,6) EREN not in lexicon (4,6) REN not in lexicon (5,6) EN not in lexicon (1,5) + (6,6) = THERE-N = = 8 Winner A,B,C…3 bits each HE4 HER4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8

T H E R E N T I S D U E 1: T3 2: T-H 6 3: THE3 4: THE-R 6 5 THERE 5 6 THERE-N 8 7: start with ERENT: not in lexicon (1,3) + (4,7): THE-RENT=3 + 7 = 10 ENT not in lexicon NT not in lexicon (1,6) + (7,7) = THERE-N-T = 11 THE-RENT winner (10 bits) A,B,C…3 bits each HE4 HER4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8

T H E R E N T I S D U E 1: T3 2: T-H 6 3: THE3 4: THE-R 6 5 THERE 5 6 THERE-N 8 7 THE-RENT 10 8 Start with RENTI: not in lexicon ENTI, NTI, TI, none in lexicon (1,7) THE-RENT + (8,8) I = = 13 The winner by default 9: Start with ENTIS: not in lexicon, nor is NTIS (1,6) THERE-N + (7,9)TIS = = 16 (1,7) THE-RENT + (8,9) IS= = 14 (1,8) THE-RENT-I + (9,9) S = = 16 THE-RENT-IS is the winner (14) A,B,C…3 bits each HE4 HER4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8

T H E R E N T I S D U E A,B,C…3 bits each HE4 HER4 THE3 HERE5 THERE5 RENT7 IS4 TIS8 DUE7 ERE8 1: T3 2: T-H 6 3: THE3 4: THE-R 6 5 THERE 5 6 THERE-N 8 7 THE-RENT 10 8 THE-RENT-I 13 9 THE-RENT-IS 14 10: Not found: NTISD, TISD, ISD, SD (1,9) THE-RENT-IS + (10,10) D = = 17 11: Not found: TISDU, ISDU, SDU, DU; (1,10) THE-RENT-IS-D + U = = 20 Winner: THE-RENT-IS-D-U (20) 12: Not found: ISDUE, SDUE; (1,9) THE-RENT-IS + (10,12) DUE = 14+7 = 21; UE not found; WINNER! (1,11) THE-RENT-IS-D-U + (12,12) E = = 23

Broad outline Goal: take a large unbroken corpus (no indication of where word boundaries are), find the best analysis of the corpus into words. “Best”? Interpret the goal in the context of MDL (Minimum Description Length) theory We begin with a corpus, and a lexicon which initially has all and only the individual characters (letters, or phonemes) as its entries.

1.Iterate several times (e.g., 7 times): 1.Construct tentative new entries for lexicon with tentative counts; From counts, calculate rough probabilities. 2.EM (Expectation/Maximization): iterate 5 times: 1.Expectation: find all possible occurrences of each lexical entry in the corpus; assign relative weights to each occurrence found, based on its probability; use this to assign (non-integral!) counts of words in the corpus. 2.Maximization: convert counts into probabilities. 3.Test each lexical entry to see whether description length is better without it in the lexicon. If true, remove it. 2.Find best parse (Viterbi-parse), the one with highest probability.

T H E R E N T I S D U E Lexicon: D E H I N R S T U T 2 E 3 All others 1 Total count: 12

Step 0 Initialize the lexicon with all of the symbols in the lexicon (the alphabet, the set of phonemes, whatever it is). Each symbol has a probability, which is simply its frequency. There are no (non-trivial) chunks in the lexicon.

Step Create tentative members –TH HE ER RE EN NT TI IS SD DU UE –Give each of these a count of 1. –Now the total count of “words” in the corpus is = 23. –Calculate new probabilities: pr(E) = 3/23; pr(TH) = 1/23. –Prob’s of the lexicon form a distribution.

Expectation/Maximization (EM) iterative: This is a widely used algorithm to do something important and almost miraculous: to find the best parameters for hidden parameters. Expectation: Find all occurrences of each lexical item in the corpus. Use the Forward/Backward algorithm.

Forward algorithm Find all ways of parsing the corpus from the beginning to each point, and associate with each point the sum of the probabilities for all of those ways. We don’t know which is the right one, really.

Forward Start at position 1, after T: THERENTISDUE The only way to get there and put a word break there (“T HERENTISDUE”) utilizes the word(?) “T”, whose probability is 2/23. Forward(1) = 2/23. Now, after position 2, after TH: There are 2 ways to get this: T H ERENTISDUE (a) or TH ERENETISDUE (b) (a)has probability 2/23 * 1/23 = 2/529 = (b)Has prob 1/23 =

There are 2 ways to get this: T H ERENTISDUE (a) or TH ERENETISDUE (b) (a)has probability 2/23 * 1/23 = 2/529 = (b)Has prob 1/23 = So the Forward probability after letter 2 (after “TH”) is After letter 3 (after “THE”), we have to consider the possibilities: (1)T-HEand (2)TH-E and (3)T-H-E

(1)T-HE (2)TH-E (3)T-H-E (1)We calculate this prob as Prob of a break after (1) = “T” = 2/23 =.0869 * prob (HE) ( which is 1/23 = ) = (2)We combine cases (2) and (3), giving us for both, together: Prob of a break after position 2 (the H), already calculated as * prob of (E) = * 0.13 =

Forward T H E Value of Forward here is the sum of the probabilities going by the two paths, P1 and P2 P2 P1b P1a

Forward T H E P2aP2b P3 Value of Forward here is the sum of the probabilities going by the two paths, P2 and P3 You only need to back (from where you are) the length of the longest lexical entry (which is now 2).

Conceptually We are computing for each break (between letters) what the probability is that there is a break there, by considering all possible chunkings of the (prefix) string, the string up to that point from the left. This is the Forward probability of that break.

Backward We do exactly the same thing from right to left, giving us a backward probability: …. D U E

Now the tricky step: T H E R E N T I S D U E Note that we know the probability of the entire string (it’s Forward(12), which is the sum of the probabilities of all the ways of chunking the string)=Pr(string) What is the probability that -R- is a word, given the string?

T H E R E N T I S D U E That is, we’re wondering whether the R here is a chunk, or part of the chunk ER, or part of the chunk RE. It can’t be all three, but we’re not in a position (yet) to decide which it is. How do we count it? We take the count of 1, and divide it up among the three options in proportion to their probabilities.

T H E R E N T I S D U E Probability that R is a word can be found in this expression: (a) This is the fractional count that goes to R.

Do this for all members of the lexicon Compute Forward and Backward just once for the whole corpus, or for each sentence or subutterance if you have that information. Compute the counts of all lexical items that conceivably could occur (in each sentence, etc.). End of Expectation.

We’ll go through something just like this again in a few minutes, when we calculate the Viterbi-best parse.

Maximization We now have a bunch of counts for the lexical items. None of the counts are integral (except by accident). Normalize: take sum of counts over the lexicon = N, and calculate frequency of each word = Count (word)/N; Set prob(word) = freq (word).

Why “Maximization”? Because the values for probabilities that maximize the probability for the whole are obtained by using the frequency values for the probabilities. That’s not obvious….

Testing a lexical entry A lexical entry makes a positive contribution to the analysis iff the Description Length of the corpus is lower when we incorporate that lexical entry than when we don’t, all other things being equal. What is the description length (DL), and how is it calculated?

Approximately? Since the plog is rarely an integer, you may have to round up to the next integer, but that’s all. So the more often something is used in the lexicon, the cheaper it is for it to be used by the words that use it. Just like in morphology.

Let’s look at results It doesn’t know whether it’s finding letters, letter chunks, morphemes, words, or phrases. Why not? Statistical learning is heavily structure- bound: don’t forget that! If the structure is there, it must be found.