Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 388: Computers and Language

Similar presentations


Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

1 LING 388: Computers and Language
Lecture 19

2 Term Project Proposal Deadline: due by end of next week
One page summary Ask yourself: what are you interested in exploring? Must involve some use of what we've covered in terms of programming, e.g. straight Python or NLTK Propose some task, experiment or application you plan to prototype or build: (doesn't have to be a complete application) Send it to me for project approval

3 NLTK Book Chapter 1: section 4. http://www.nltk.org/book/ch01.html
Highlights:

4 NLTK Book text1: Moby Dick text7: Wall Street Journal
>>> sorted(w for w in set(text1) if w.endswith('ableness')) ['comfortableness', 'honourableness', 'immutableness', 'indispensableness', 'indomitableness', 'intolerableness', 'palpableness', 'reasonableness', 'uncomfortableness'] text7: Wall Street Journal >>> sorted(w for w in set(text7) if w.istitle() and '.' in w) ['A.', 'A.C.', 'A.D.', 'A.L.', 'Ala.', 'Ariz.', 'Aug.', 'B.', 'B.A.T', 'C.', 'C.J.B.', 'Calif.', 'Co.', 'Colo.', 'Conn.', 'Corp.', 'Cos.', 'D.', 'D.C.', 'Dec.', 'Del.', 'Dr.', 'E.', 'E.C.', 'E.W.', 'F.', 'F.H.', 'F.W.', 'Feb.', 'Fla.', 'G.', 'Ga.', 'Gov.', 'H.', 'H.N.', 'I.', 'Ill.', 'Inc.', 'Ind.', 'J.', 'J.L.', 'J.P.', 'Jan.', 'Jr.', 'K.', 'Ky.', 'L.', 'L.A.', 'L.P.', 'La.', 'Lt.', 'Ltd.', 'M.', 'M.D.', 'Mass.', 'Md.', 'Messrs.', 'Mich.', 'Minn.', 'Miss.', 'Mo.', 'Mr.', 'Mrs.', 'Ms.', 'N.', 'N.C', 'N.C.', 'N.H.', 'N.J', 'N.J.', 'N.M.', 'N.V', 'N.V.', 'N.Y', 'N.Y.', 'Nev.', 'No.', 'Nov.', 'O.', 'Oct.', 'Ore.', 'P.', 'Pa.', 'Prof.', 'Pty.', 'R.', 'R.D.', 'R.I.', 'R.P.', 'Rep.', 'Rev.', 'S.', 'S.A', 'S.I.', 'Sen.', 'Sept.', 'Sept.30', 'Sino-U.S.', 'Sr.', 'St.', 'T.', 'Tenn.', 'U.K.', 'U.S.', 'U.S.-Japan', 'U.S.-Japanese', 'U.S.A', 'U.S.A.', 'U.S.S.R.', 'Va.', 'W.', 'W.D.', 'W.N.', 'W.R.', 'Wash.', 'Wis.', 'Z.'] >>> 

5 NLTK Book text1: Moby Dick text7: Wall Street Journal
>>> len(set(text1).intersection(set(text7))) 4642 Still more than 4000 words in common >>> len(set(text1)) 19317 >>> len(set(text7)) 12408

6 NLTK Book About 1/3 of the 3 letter words in common
>>> t1_3 = [w for w in text1 if len(w) == 3] >>> t7_3 = [w for w in text7 if len(w) == 3] >>> len(set(t1_3)) 658 >>> len(set(t7_3)) 595 >>> len(set(t7_3).intersection(set(t1_3))) 232 >>> set(t7_3).intersection(set(t1_3)) {'his', 'sum', 'NEW', 'sad', '890', 'far', 'die', 'log', 'him', 'its', 'how', 'Lee', 'Air', '144', 'six', 'end', 'air', 'jet', 'but', 'yon', 'Don', 'fit', 'old', 'How', 'arm', '125', 'fly', 'Red', 'eat', 'saw', 'day', 'raw', 'Too', 'not', 'aim', 'who', 'had', 'own', 'car', 'toy', '102', 'fee', 'Who', 'our', 'Leo', 'net', 'Del', 'sky', 'And', '800', 'Ark', 'Nor', 'get', 'son', 'sit', 'Law', 'led', '100', 'For', 'III', 'lay', 'joy', 'man', 'now', 'lot', 'New', 'Dan', '500', '132', 'may', 'End', 'few', 'war', 'den', 'Not', 'hay', 'set', 'Can', 'art', 'act', 'jam', 'Van', 'dam', 'fed', 'cow', 'hot', '128', 'add', 'ton', 'put', 'Old', 'Ray', '108', 'two', 'try', 'bag', 'run', 'gas', 'one', 'Joe', 'cry', '150', '103', 'Pan', 'beg', 'hit', 'THE', 'top', 'Any', 'pie', 'Two', 'TWO', 'van', 'see', '114', 'via', 'tow', 'The', 'ire', 'yet', '107', 'out', 'odd', 'pit', 'say', 'You', 'any', 'Its', 'law', 'why', 'bar', 'His', 'rim', 'won', 'you', '400', 'fat', 'low', 'has', 'rap', 'tip', 'boy', 'too', '115', '118', 'But', 'new', 'she', 'per', 'hid', 'Mrs', 'due', 'nor', 'key', '...', 'ask', 'eye', 'Yet', 'met', 'Put', '111', 'box', 'War', 'and', 'yes', 'for', 'map', 'Why', 'aid', 'bid', 'Now', 'Sir', 'fan', 'big', 'all', '106', 'row', 'Far', '120', 'men', 'All', 'sex', 'Few', 'Man', '1st', 'God', 'Her', '119', 'Ten', '105', 'age', 'May', 'Day', 'bit', '110', 'did', 'was', 'ill', '130', 'her', 'ran', 'lap', 'AND', 'job', 'the', 'dry', 'bad', 'red', 'tea', 'cap', 'got', 'let', 'Tom', 'are', '180', '135', 'can', 'oil', 'cut', '133', 'ago', 'She', 'One', '101', 'buy', 'use', 'vow', 'Sea', 'off', 'bat', 'way', 'pay'} About 1/3 of the 3 letter words in common

7 NLTK Book About ½-1/3 of the 5 letter words in common
>>> t1_5 = [w for w in text1 if len(w) == 5] >>> t7_5 = [w for w in text7 if len(w) == 5] >>> len(set(t1_5)) 2397 >>> len(set(t7_5)) 1531 >>> len(set(t1_5).intersection(set(t7_5))) 705 >>> t1_7 = [w for w in text1 if len(w) == 7] >>> t7_7 = [w for w in text7 if len(w) == 7] >>> len(set(t1_7)) 3005 >>> len(set(t7_7)) 1937 >>> len(set(t1_7).intersection(set(t7_7))) 746 About ½-1/3 of the 5 letter words in common About ¼-1/3 of the 7 letter words in common

8 NLTK Book Chapter 2: Accessing Text Corpora and Lexical Resources
>>> import nltk >>> emma = nltk.Text(nltk.corpus.gutenberg. words('austen-emma.txt')) >>> emma.concordance("surprize") Displaying 25 of 37 matches:

9 NLTK Book: Chapter 2 Methods for PlaintextCorpusReader:
>>> from nltk.corpus import gutenberg >>> gutenberg <PlaintextCorpusReader in '.../corpora/gutenberg' (not loaded yet)> >>> gutenberg.fileids() ['austen-emma.txt', 'austen-persuasion.txt', 'austen-sense.txt', 'bible-kjv.txt', 'blake-poems.txt', 'bryant-stories.txt', 'burgess- busterbrown.txt', 'carroll-alice.txt', 'chesterton-ball.txt', 'chesterton-brown.txt', 'chesterton-thursday.txt', 'edgeworth- parents.txt', 'melville-moby_dick.txt', 'milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare- macbeth.txt', 'whitman-leaves.txt'] >>> len(gutenberg.fileids()) 18 Methods for PlaintextCorpusReader: .raw(fileid) .words(fileid) .sents(fileid)

10 NLTK Book: Chapter 2 Other types of corpora:
from nltk.corpus import webtext Web text corpus from nltk.corpus import nps_chat NPS chat corpus from nltk.corpus import brown Brown corpus "A balanced corpus"

11 Google Books austens-novels-and-minor-works/

12 Google Books Emma was published in 1815

13 NLTK Book face-fiction free indirect style: It describes the way in which a writer imbues a third-person narration with the habits of thought or expression of a fictional character. Before Austen, novelists chose between first- person narrative (letting us into the mind of a character, but limiting us to his or her understanding) and third-person narrative (allowing us a God-like view of all the characters, but making them pieces in an authorial game). Austen miraculously combined the internal and the external. “But in every respect as she saw more of her, she was confirmed in all her kind designs.” (The sentence is in the third person, yet we are not exactly being told something by the author. “Kind designs” is Emma’s complacent judgment of herself.) “Emma was gratified to observe such a proof in her of strengthened character.”

14 NLTK Book Statistics: average # letters per word
average # words per sentence average # times each word is used

15 NLTK Book: Chapter 2 >>> from nltk.corpus import brown >>> len(brown.fileids()) 500 >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> news = brown.words(categories='news') >>> import nltk >>> fd = nltk.FreqDist(w.lower() for w in news) >>> for m in modals: ... print('{}: {}'.format(m,fd[m]), end=' ') ... can: 94 could: 87 may: 93 might: 38 must: 53 shall: 5 should: 61 will: 389 would: 246 >>>

16 NLTK Book: Chapter 2 >>> modals = ['can', 'could', 'may', 'might', 'must', 'shall', 'should', 'will', 'would'] >>> cfd = nltk.ConditionalFreqDist((cat,word) for cat in brown.categories() for word in brown.words(categories=cat)) >>> cfd.tabulate(conditions=brown.categories(), samples=modals)                    can  could    may  might   must  shall should   will  would        adventure     46    151      5     58     27      7     15     50    191   belles_lettres    246    213    207    113    170     34    102    236    392        editorial    121     56     74     39     53     19     88    233    180          fiction     37    166      8     44     55      3     35     52    287       government    117     38    153     13    102     98    112    244    120          hobbies    268     58    131     22     83      5     73    264     78            humor     16     30      8      8      9      2      7     13     56          learned    365    159    324    128    202     40    171    340    319             lore    170    141    165     49     96     12     76    175    186          mystery     42    141     13     57     30      1     29     20    186             news     93     86     66     38     50      5     59    389    244         religion     82     59     78     12     54     21     45     71     68          reviews     45     40     45     26     19      1     18     58     47          romance     74    193     11     51     45      3     32     43    244  science_fiction     16     49      4     12      8      3      3     16     79  NLTK book says, "Observe that the most frequent modal in the news genre is will, while the most frequent modal in the romance genre is could (actually would). Would you have predicted this?'


Download ppt "LING 388: Computers and Language"

Similar presentations


Ads by Google