Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 388: Computers and Language

Similar presentations


Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

1 LING 388: Computers and Language
Lecture 26

2 Reminder Term projects: send me your PDF
by the end of this week please!

3 Today's lecture Adapted from chapter 5 of the nltk textbook

4 nltk book: chapter 5 Cognate object constructions:
>>> from nltk import word_tokenize, pos_tag              >>> text = word_tokenize("They refuse to permit us to obtain the refuse permit")  >>> pos_tag(text) [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')] Cognate object constructions: >>> pos_tag(word_tokenize("They fight a good fight")) [('They', 'PRP'), ('fight', 'VBD'), ('a', 'DT'), ('good', 'JJ'), ('fight', 'NN')] >>> pos_tag(word_tokenize("He will dance a dance")) [('He', 'PRP'), ('will', 'MD'), ('dance', 'VB'), ('a', 'DT'), ('dance', 'NN')]

5 nltk book: chapter 5 >>> import nltk >>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) >>> import io >>> from contextlib import redirect_stdout >>> f = io.StringIO() >>> with redirect_stdout(f): ... text.similar('man') ... >>> sman = f.getvalue().split() >>> sman ['time', 'day', 'and', 'one', 'it', 'way', 'year', 'woman', 'state', 'men', 'house', 'world', 'life', 'car', 'people', 'war', 'church', 'place', 'that', 'work'] >>> len(sman) 20 >>> f = io.StringIO() >>> with redirect_stdout(f): ... text.similar('woman') ... >>> swoman = f.getvalue().split() >>> swoman ['man', 'time', 'day', 'year', 'car', 'moment', 'world', 'family', 'house', 'boy', 'country', 'child', 'state', 'job', 'way', 'girl', 'place', 'war', 'room', 'work'] >>> len(swoman) 20 >>> set(sman) - set(swoman) {'and', 'people', 'woman', 'life', 'men', 'one', 'church', 'it', 'that'} >>> set(swoman) - set(sman) {'man', 'child', 'moment', 'job', 'country', 'boy', 'girl', 'family', 'room'} Advanced: .similar() normally prints; instead we capture its results into a list Note: no women

6 nltk book: chapter 5 >>> text.similar('woman',150) man time day year car moment world family house boy country child state job way girl place war room work question case week word men people problem city one book church situation wall picture market system group fire table school class doctor bed meeting stage president line little hand government and form night story change thing others future hospital company other ball evening person body trial month process town area program position project nation course south point while ground report field game wife matter sun guy face west window record land north surface back children air result plan life truth letter road study press water enemy pool door right board floor party sense difference mind patient century bottle schools part bridge period years play committee horse river industry two union university crowd officer problems police store score head list second latter words spirit past community note scene gun law women

7 nltk book: chapter 5 fileld

8 nltk book: chapter 5 ca01 Pre-tagged corpus: .tagged_words()
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./. The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./. >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ...] >>> nltk.corpus.brown.tagged_words(tagset='brown') >>> nltk.corpus.brown.tagged_words(tagset='universal') [('The', 'DET'), ('Fulton', 'NOUN'), ...]

9 nltk book: chapter 5 Pre-tagged corpus: .tagged_sents()
>>> brown.tagged_sents() [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ- TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]

10 nltk book: chapter 5 Brown tagset (wikipedia) 85 tags: -* negated; -NC emphasis; –HL headline; -TL title; FW- foreign word A dozen tags

11 nltk book: chapter 5 >>> len(brown.tagged_words(categories='news', tagset='universal')) >>> fd = nltk.FreqDist(tag for word,tag in brown.tagged_words(categories='news', tagset='universal')) >>> fd.most_common() [('NOUN', 30654), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389), ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264), ('NUM', 2166), ('X', 92)] >>> fd.plot()

12 otherwise nltk crashes with $$ in the tag
nltk book: chapter 5 otherwise nltk crashes with $$ in the tag >>> len(brown.tagged_words(categories='news') >>> fd = nltk.FreqDist(tag.replace('$','S') for word,tag in brown.tagged_words(categories='news')) >>> fd.most_common() [('NN', 13162), ('IN', 10616), ('AT', 8893), ('NP', 6866), ('NNS', 5276), (',', 5133), ('.', 4452), ('JJ', 4392), ('CC', 2664), ('VBD', 2524), ('NN-TL', 2486), ('VB', 2440), ('VBN', 2269), ('RB', 2166), ('PPS', 2107), ('CD', 2020), ('CS', 1509), ('VBG', 1398), ('TO', 1237), ('MD', 1031), ('AP', 923), ('NP-TL', 741), ('``', 732), ('BEZ', 730), ('BEDZ', 716), ("''", 702), ('JJ-TL', 689), ('PPSS', 605), ('DT', 589), ('BE', 525), ('VBZ', 519), >>> fd.plot()

13 nltk book: chapter 5 Zipf's Law: freq(POS_tag) ∝ 1/rank(POS_tag)
>>> import nltk >>> from nltk.corpus import brown >>> fd = nltk.FreqDist(tag.replace('$','S') for word,tag in brown.tagged_words(categories='news')) >>> l = sorted((item[1]/fd.N() for item in fd.items()),reverse=True) >>> import matplotlib.pyplot as plt >>> plt.plot(l) [<matplotlib.lines.Line2D object at 0x118c2f7f0>] >>> plt.grid(True) >>> plt.xlabel('log(rank(tag))') Text(0.5,0,'log(rank(tag))') >>> plt.ylabel('log(freq(tag))') Text(0,0.5,'log(freq(tag))') >>> plt.xscale('log') >>> plt.yscale('log') >>> plt.show()

14 nltk book: chapter 5 Zipf's Law: for Brown corpus words
>>> import nltk >>> from nltk.corpus import brown >>> fd = nltk.FreqDist(word for word in brown.words()) >>> lf = sorted((item[1]/fd.N() for item in fd.items()),reverse=True) >>> lc = sorted((item[1] for item in fd.items()),reverse=True) >>> import matplotlib.pyplot as plt                    >>> plt.xscale('log') >>> plt.yscale('log') >>> plt.grid(True) >>> plt.xlabel('log(rank))') Text(0.5,0,'log(rank))') >>> plt.ylabel('log(freq(word)) and log(count(word))') Text(0,0.5,'log(freq(word)) and log(count(word))') >>> plt.plot(lf) [<matplotlib.lines.Line2D object at 0x11c8aa1d0>] >>> plt.plot(lc) [<matplotlib.lines.Line2D object at 0x11c8aacc0>] >>> plt.show()

15 What kinds of words follow often?
nltk book: chapter 5 2.8   Exploring Tagged Corpora >>> import nltk >>> from nltk.corpus import brown >>> tagged_text = brown.tagged_words() >>> tags = [two[1] for one,two in nltk.bigrams(tagged_text) if one[0]=='often']   >>> len(tags) 349 >>> fd = nltk.FreqDist(tags) >>> fd.tabulate()  VBN   VB  VBD   JJ  VBZ   AT   IN    ,   CS   QL   HV  BEN  VBG   DO   RB    .  BED   NN  QLP   CC   TO   AP  DOD   --  PPS  BER  PP$  BEZ  MD*   RP   MD  WRB  NNS   ''  HVZ   DT  HVD BER*    61   51   36   30   24   18   18   16   13   12    6    6    6    5    5    4    3    3    3    3    2    2    2    2    2    2    2    2    1    1    1    1    1    1    1    1    1    1  >>> tagged_text = brown.tagged_words(tagset='universal') >>> tags = [two[1] for one,two in nltk.bigrams(tagged_text) if one[0]=='often'] VERB  ADJ  ADP    .  DET  ADV NOUN  PRT CONJ PRON   209   32   31   23   21   21    4    3    3    2  What kinds of words follow often? Original tagset too fine-grained…

16 nltk book: chapter 5 Trigrams: V* TO V* (all tags)
>>> import nltk >>> from nltk.corpus import brown >>> tagged_sents = brown.tagged_sents() >>> nltk.trigrams(tagged_sents[0]) <generator object trigrams at 0x101c46780> >>> list(nltk.trigrams(tagged_sents[0])) [(('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')), (('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL')), (('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL')), (('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD')), (('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR')), (('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT')), (('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN')), (('an', 'AT'), … Trigram: (tg[0],tg[1],tg[2]), tg[0][0] = 1st word; tg[0][1]= 1st tag, etc. >>> l = [(tg[0][0],tg[1][0],tg[2][0]) for ts in tagged_sents for tg in nltk.trigrams(ts) if tg[1][1]=='TO' and tg[0][1].startswith('V') and tg[2][1].startswith('V')] >>> fd.most_common(10) [(('trying', 'to', 'get'), 14), (('want', 'to', 'go'), 10), (('want', 'to', 'see'), 10), (('trying', 'to', 'make'), 10), (('going', 'to', 'get'), 10), (('going', 'to', 'take'), 7), (('wanted', 'to', 'know'), 7), (('trying', 'to', 'find'), 6), (('like', 'to', 'think'), 6), (('got', 'to', 'get'), 6)]

17 nltk book: chapter 5 Words with multiple POS tags:
>>> cfd = nltk.ConditionalFreqDist((word.lower(),tag) for (word,tag) in brown.tagged_words()) >>> sorted(((word,cfd[word].B()) for word in cfd.conditions()),key=lambda x:x[1],reverse=True)[:10] [('that', 15), ('a', 13), ('to', 11), ('in', 10), ('home', 10), ('out', 9), ('well', 9), (':', 9), ('right', 9), ('it', 9)] >>> cfd['that'] FreqDist({'CS': 6464, 'DT': 2260, 'WPS': 1654, 'WPO': 135, 'QL': 56, 'DT-NC': 6, 'DT-TL': 5, 'WPS-TL': 3, 'WPS-NC': 3, 'WPS-HL': 2, ...}) >>> cfd['a'] FreqDist({'AT': 22943, 'AT-HL': 60, 'AT-TL': 55, 'NN': 50, 'NP': 30, 'NP-HL': 20, 'NP-TL': 14, 'AT-NC': 8, 'FW-IN': 5, 'AT-TL-HL': 4, ...}) >>> cfd['to'] FreqDist({'TO': 14917, 'IN': 11046, 'IN-HL': 68, 'TO-HL': 55, 'IN-TL': 36, 'TO-NC': 13, 'TO-TL': 10, 'IN-NC': 8, 'NIL': 3, 'QL': 1, ...})

18 nltk book: chapter 5 Words with multiple POS tags:
>>> cfd = nltk.ConditionalFreqDist((word.lower(),tag) for (word,tag) in brown.tagged_words(tagset='universal')) >>> sorted(((word,cfd[word].B()) for word in cfd.conditions()),key=lambda x:x[1],reverse=True)[:10] [('down', 6), ('damn', 5), ('well', 5), ('to', 5), ('that', 5), ('round', 5), ('outside', 4), ('still', 4), ('opposite', 4), ('parallel', 4)] >>> cfd['down'] FreqDist({'PRT': 696, 'ADP': 192, 'NOUN': 2, 'ADV': 2, 'VERB': 2, 'ADJ': 1}) >>> cfd['damn'] FreqDist({'ADJ': 13, 'VERB': 10, 'NOUN': 4, 'PRT': 3, 'ADV': 2}) >>> cfd['well'] FreqDist({'ADV': 723, 'PRT': 138, 'NOUN': 17, 'ADJ': 15, 'VERB': 4}) >>> cfd['to'] FreqDist({'PRT': 14995, 'ADP': 11158, 'X': 3, 'NOUN': 1, 'ADV': 1}) >>> cfd['that'] FreqDist({'ADP': 6467, 'DET': 2272, 'PRON': 1798, 'ADV': 56, 'X': 1}) >>> cfd['round'] FreqDist({'ADJ': 32, 'NOUN': 19, 'ADV': 14, 'VERB': 6, 'ADP': 4}) >>> cfd['outside'] FreqDist({'ADP': 83, 'ADV': 65, 'ADJ': 40, 'NOUN': 22})

19 nltk book: chapter 5 Much more on tagging strategies in the remainder of this chapter. See the underlying technology behind the pos_tag() function we use But we're out of time in this course…


Download ppt "LING 388: Computers and Language"

Similar presentations


Ads by Google