LING 388: Computers and Language

LING 388: Computers and Language
Lecture 19

Term Project Proposal Deadline: due by end of next week
One page summary Ask yourself: what are you interested in exploring? Must involve some use of what we've covered in terms of programming, e.g. straight Python or NLTK Propose some task, experiment or application you plan to prototype or build: (doesn't have to be a complete application) Send it to me for project approval

NLTK Book Chapter 1: section 4. http://www.nltk.org/book/ch01.html
Highlights:

NLTK Book text1: Moby Dick text7: Wall Street Journal
>>> sorted(w for w in set(text1) if w.endswith('ableness')) ['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness'] text7: Wall Street Journal >>> sorted(w for w in set(text7) if w.istitle() and '.' in w) ['A.', 'A.C.', 'A.D.', 'A.L.', 'Ala.', 'Ariz.', 'Aug.', 'B.', 'B.A.T', 'C.', 'C.J.B.', 'Calif.', 'Co.', 'Colo.', 'Conn.', 'Corp.', 'Cos.', 'D.', 'D.C.', 'Dec.', 'Del.', 'Dr.', 'E.', 'E.C.', 'E.W.', 'F.', 'F.H.', 'F.W.', 'Feb.', 'Fla.', 'G.', 'Ga.', 'Gov.', 'H.', 'H.N.', 'I.', 'Ill.', 'Inc.', 'Ind.', 'J.', 'J.L.', 'J.P.', 'Jan.', 'Jr.', 'K.', 'Ky.', 'L.', 'L.A.', 'L.P.', 'La.', 'Lt.', 'Ltd.', 'M.', 'M.D.', 'Mass.', 'Md.', 'Messrs.', 'Mich.', 'Minn.', 'Miss.', 'Mo.', 'Mr.', 'Mrs.', 'Ms.', 'N.', 'N.C', 'N.C.', 'N.H.', 'N.J', 'N.J.', 'N.M.', 'N.V', 'N.V.', 'N.Y', 'N.Y.', 'Nev.', 'No.', 'Nov.', 'O.', 'Oct.', 'Ore.', 'P.', 'Pa.', 'Prof.', 'Pty.', 'R.', 'R.D.', 'R.I.', 'R.P.', 'Rep.', 'Rev.', 'S.', 'S.A', 'S.I.', 'Sen.', 'Sept.', 'Sept.30', 'Sino-U.S.', 'Sr.', 'St.', 'T.', 'Tenn.', 'U.K.', 'U.S.', 'U.S.-Japan', 'U.S.-Japanese', 'U.S.A', 'U.S.A.', 'U.S.S.R.', 'Va.', 'W.', 'W.D.', 'W.N.', 'W.R.', 'Wash.', 'Wis.', 'Z.'] >>>

NLTK Book text1: Moby Dick text7: Wall Street Journal
>>> len(set(text1).intersection(set(text7))) 4642 Still more than 4000 words in common >>> len(set(text1)) 19317 >>> len(set(text7)) 12408

NLTK Book About 1/3 of the 3 letter words in common
>>> t1_3 = [w for w in text1 if len(w) == 3] >>> t7_3 = [w for w in text7 if len(w) == 3] >>> len(set(t1_3)) 658 >>> len(set(t7_3)) 595 >>> len(set(t7_3).intersection(set(t1_3))) 232 >>> set(t7_3).intersection(set(t1_3)) {'his', 'sum', 'NEW', 'sad', '890', 'far', 'die', 'log', 'him', 'its', 'how', 'Lee', 'Air', '144', 'six', 'end', 'air', 'jet', 'but', 'yon', 'Don', 'fit', 'old', 'How', 'arm', '125', 'fly', 'Red', 'eat', 'saw', 'day', 'raw', 'Too', 'not', 'aim', 'who', 'had', 'own', 'car', 'toy', '102', 'fee', 'Who', 'our', 'Leo', 'net', 'Del', 'sky', 'And', '800', 'Ark', 'Nor', 'get', 'son', 'sit', 'Law', 'led', '100', 'For', 'III', 'lay', 'joy', 'man', 'now', 'lot', 'New', 'Dan', '500', '132', 'may', 'End', 'few', 'war', 'den', 'Not', 'hay', 'set', 'Can', 'art', 'act', 'jam', 'Van', 'dam', 'fed', 'cow', 'hot', '128', 'add', 'ton', 'put', 'Old', 'Ray', '108', 'two', 'try', 'bag', 'run', 'gas', 'one', 'Joe', 'cry', '150', '103', 'Pan', 'beg', 'hit', 'THE', 'top', 'Any', 'pie', 'Two', 'TWO', 'van', 'see', '114', 'via', 'tow', 'The', 'ire', 'yet', '107', 'out', 'odd', 'pit', 'say', 'You', 'any', 'Its', 'law', 'why', 'bar', 'His', 'rim', 'won', 'you', '400', 'fat', 'low', 'has', 'rap', 'tip', 'boy', 'too', '115', '118', 'But', 'new', 'she', 'per', 'hid', 'Mrs', 'due', 'nor', 'key', '...', 'ask', 'eye', 'Yet', 'met', 'Put', '111', 'box', 'War', 'and', 'yes', 'for', 'map', 'Why', 'aid', 'bid', 'Now', 'Sir', 'fan', 'big', 'all', '106', 'row', 'Far', '120', 'men', 'All', 'sex', 'Few', 'Man', '1st', 'God', 'Her', '119', 'Ten', '105', 'age', 'May', 'Day', 'bit', '110', 'did', 'was', 'ill', '130', 'her', 'ran', 'lap', 'AND', 'job', 'the', 'dry', 'bad', 'red', 'tea', 'cap', 'got', 'let', 'Tom', 'are', '180', '135', 'can', 'oil', 'cut', '133', 'ago', 'She', 'One', '101', 'buy', 'use', 'vow', 'Sea', 'off', 'bat', 'way', 'pay'} About 1/3 of the 3 letter words in common

NLTK Book About ½-1/3 of the 5 letter words in common
>>> t1_5 = [w for w in text1 if len(w) == 5] >>> t7_5 = [w for w in text7 if len(w) == 5] >>> len(set(t1_5)) 2397 >>> len(set(t7_5)) 1531 >>> len(set(t1_5).intersection(set(t7_5))) 705 >>> t1_7 = [w for w in text1 if len(w) == 7] >>> t7_7 = [w for w in text7 if len(w) == 7] >>> len(set(t1_7)) 3005 >>> len(set(t7_7)) 1937 >>> len(set(t1_7).intersection(set(t7_7))) 746 About ½-1/3 of the 5 letter words in common About ¼-1/3 of the 7 letter words in common

NLTK Book Chapter 2: Accessing Text Corpora and Lexical Resources
>>> import nltk >>> emma = nltk.Text(nltk.corpus.gutenberg. words('austen-emma.txt')) >>> emma.concordance("surprize") Displaying 25 of 37 matches:

NLTK Book: Chapter 2 Methods for PlaintextCorpusReader:
>>> from nltk.corpus import gutenberg >>> gutenberg <PlaintextCorpusReader in '.../corpora/gutenberg' (not loaded yet)> >>> gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess- busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth- parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare- macbeth.txt', 'whitman-leaves.txt'] >>> len(gutenberg.fileids()) 18 Methods for PlaintextCorpusReader: .raw(fileid) .words(fileid) .sents(fileid)

NLTK Book: Chapter 2 Other types of corpora:
from nltk.corpus import webtext Web text corpus from nltk.corpus import nps_chat NPS chat corpus from nltk.corpus import brown Brown corpus "A balanced corpus"

Google Books austens-novels-and-minor-works/

Google Books Emma was published in 1815

NLTK Book face-fiction free indirect style: It describes the way in which a writer imbues a third-person narration with the habits of thought or expression of a fictional character. Before Austen, novelists chose between first- person narrative (letting us into the mind of a character, but limiting us to his or her understanding) and third-person narrative (allowing us a God-like view of all the characters, but making them pieces in an authorial game). Austen miraculously combined the internal and the external. “But in every respect as she saw more of her, she was confirmed in all her kind designs.” (The sentence is in the third person, yet we are not exactly being told something by the author. “Kind designs” is Emma’s complacent judgment of herself.) “Emma was gratified to observe such a proof in her of strengthened character.”

NLTK Book Statistics: average # letters per word
average # words per sentence average # times each word is used

NLTK Book: Chapter 2 >>> from nltk.corpus import brown >>> len(brown.fileids()) 500 >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> news = brown.words(categories='news') >>> import nltk >>> fd = nltk.FreqDist(w.lower() for w in news) >>> for m in modals: ... print('{}: {}'.format(m,fd[m]), end=' ') ... can: 94 could: 87 may: 93 might: 38 must: 53 shall: 5 should: 61 will: 389 would: 246 >>>

NLTK Book: Chapter 2 >>> modals = ['can', 'could', 'may', 'might', 'must', 'shall', 'should', 'will', 'would'] >>> cfd = nltk.ConditionalFreqDist((cat,word) for cat in brown.categories() for word in brown.words(categories=cat)) >>> cfd.tabulate(conditions=brown.categories(), samples=modals) can could may might must shall should will would adventure 46 151 5 58 27 7 15 50 191 belles_lettres 246 213 207 113 170 34 102 236 392 editorial 121 56 74 39 53 19 88 233 180 fiction 37 166 8 44 55 3 35 52 287 government 117 38 153 13 102 98 112 244 120 hobbies 268 58 131 22 83 5 73 264 78 humor 16 30 8 8 9 2 7 13 56 learned 365 159 324 128 202 40 171 340 319 lore 170 141 165 49 96 12 76 175 186 mystery 42 141 13 57 30 1 29 20 186 news 93 86 66 38 50 5 59 389 244 religion 82 59 78 12 54 21 45 71 68 reviews 45 40 45 26 19 1 18 58 47 romance 74 193 11 51 45 3 32 43 244 science_fiction 16 49 4 12 8 3 3 16 79 NLTK book says, "Observe that the most frequent modal in the news genre is will, while the most frequent modal in the romance genre is could (actually would). Would you have predicted this?'

LING 388: Computers and Language

Similar presentations

Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING 388: Computers and Language

Similar presentations

Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

Similar presentations

About project

Feedback