LING/C SC 581: Advanced Computational Linguistics

LING/C SC 581: Advanced Computational Linguistics
Lecture Notes Feb 6th

Adminstrivia The Homework Pipeline: No classes next week:
Homework 2 graded Homework 4 not back yet… soon Homework 5 due Weds by midnight No classes next week: I'm out of town on business No new homework assigned this week

Today's Topics Homework 4 review

Homework 4 Review: Question 1
Construct a WSJ text corpus that excludes both words tagged as – NONE- and punctuation words (defined previously) Show your Python console. How many words in the corpus? How many distinct words? Plot the cumulative frequency distribution graph How many top words do you need to account for 50% of the corpus?

excluded = set(['-NONE-', '-LRB-', '-RRB-', 'SYM', ':', '.', ',', '``', "''"]) tokens = [x[0] for x in ptb.tagged_words(categories=['news']) if x[1] not in excluded] words = set(tokens) print('Tokens: {}; #Words: {}'.format(len(text),len(words))) Tokens: ; #Words: 49184 len(words) 49184 print('Lexical diversity: {:.3f}%'.format(len(words)/len(text))) Lexical diversity: 0.047% text = nltk.Text(tokens) dist = nltk.FreqDist(text) print(dist) <FreqDist with samples and outcomes>

list = sorted(dist.items(),key=lambda t:t[1],reverse=True) half = len(text) / 2.0 total = 0 index = 0 while total < half: total += list[index][1] index += 1 print('No of words: {}; total: {}'.format(index,total)) No of words: 217; total: /2 =

print('{:12s} {:5s}'.format('Word','Freq')) for word, freq in list[:index]: print('{:12s} {:5d}'.format(word,freq))

With case folding: tokens = [x[0].lower() for x in ptb.tagged_words(categories=['news']) if x[1] not in excluded] Tokens: ; #Words: 43746 Lexical diversity: 0.042% No of words: 176; total: ( /2= )

Colorless green ideas examples Chomsky (1957):
(1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Chomsky (1957): . . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not. idea (1) is syntactically valid, (2) is word salad One piece of supporting evidence: (1) pronounced with normal intonation (2) pronounced like a list of words …

Background: Language Models and N-grams
given a word sequence w1 w2 w3 ... wn chain rule how to compute the probability of a sequence of words p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1) note It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all possible word sequences

Background: Language Models and N-grams
Given a word sequence w1 w2 w3 ... wn Bigram approximation just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3...wn)  p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) note p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|w1...wn-2 wn-1)

Colorless green ideas Sentences: Statistical Experiment (Pereira 2002)
(1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Statistical Experiment (Pereira 2002) bigram language model wi-1 wi

Part-of-Speech (POS) Tag Sequence
Chomsky's example: colorless green ideas sleep furiously JJ JJ NNS VBP RB (POS Tags) Similar but grammatical example: revolutionary new ideas appear infrequently JJ JJ NNS VBP RB LSLT pg. 146

Stanford Parser Stanford Parser:
a probabilistic PS parser trained on the Penn Treebank

Stanford Parser Stanford Parser: a probabilistic PS parser trained on the Penn Treebank

Penn Treebank (PTB) Corpus: word frequencies: Word POS Frequency
colorless green NNP 33 JJ 19 NN 5 ideas NNS 32 sleep VB 4 VBP 2 1 furiously RB Word POS Frequency revolutionary JJ 6 NNP 2 NN new 1795 1459 NNPS 1 ideas NNS 32 appear VB 55 VBP 41 infrequently

Stanford Parser Structure of NPs: colorless green ideas
revolutionary new ideas Phrase Frequency [NP JJ JJ NNS] 1073 [NP NNP JJ NNS] 61

An experiment examples Question:
(1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Question: Is (1) even the most likely permutation of these particular five words?

Parsing Data All 5! (=120) permutations of
colorless green ideas sleep furiously .

Parsing Data The winning sentence was:
furiously ideas sleep colorless green . after training on sections (approx. 40,000 sentences) sleep selects for ADJP object with 2 heads adverb (RB) furiously modifies noun

Parsing Data The next two highest scoring permutations were:
Furiously green ideas sleep colorless . Green ideas sleep furiously colorless . sleep takes NP object sleep takes ADJP object

Parsing Data (Pereira 2002) compared Chomsky’s original minimal pair:
colorless green ideas sleep furiously furiously sleep ideas green colorless Ranked #23 and #36 respectively out of 120

Parsing Data But graph (next slide) shows how arbitrary these rankings are when trained on randomly chosen sections covering 14K- 31K sentences Example: #36 furiously sleep ideas green colorless outranks #23 colorless green ideas sleep furiously (and the top 3) over much of the training space Example: Chomsky's original sentence #23 colorless green ideas sleep furiously outranks both the top 3 and #36 just briefly at one data point

Sentence Rank vs. Amount of Training Data
Best three sentences

Sentence Rank vs. Amount of Training Data
#23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless

LING/C SC 581: Advanced Computational Linguistics

Similar presentations

Presentation on theme: "LING/C SC 581: Advanced Computational Linguistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING/C SC 581: Advanced Computational Linguistics

Similar presentations

Presentation on theme: "LING/C SC 581: Advanced Computational Linguistics"— Presentation transcript:

Similar presentations

About project

Feedback