LING 388: Computers and Language

LING 388: Computers and Language
Lecture 26

Reminder Term projects: send me your PDF
by the end of this week please!

Today's lecture Adapted from chapter 5 of the nltk textbook

nltk book: chapter 5 Cognate object constructions:
>>> from nltk import word_tokenize, pos_tag >>> text = word_tokenize("They refuse to permit us to obtain the refuse permit") >>> pos_tag(text) [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')] Cognate object constructions: >>> pos_tag(word_tokenize("They fight a good fight")) [('They', 'PRP'), ('fight', 'VBD'), ('a', 'DT'), ('good', 'JJ'), ('fight', 'NN')] >>> pos_tag(word_tokenize("He will dance a dance")) [('He', 'PRP'), ('will', 'MD'), ('dance', 'VB'), ('a', 'DT'), ('dance', 'NN')]

nltk book: chapter 5 >>> import nltk >>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words()) >>> import io >>> from contextlib import redirect_stdout >>> f = io.StringIO() >>> with redirect_stdout(f): ... text.similar('man') ... >>> sman = f.getvalue().split() >>> sman ['time', 'day', 'and', 'one', 'it', 'way', 'year', 'woman', 'state', 'men', 'house', 'world', 'life', 'car', 'people', 'war', 'church', 'place', 'that', 'work'] >>> len(sman) 20 >>> f = io.StringIO() >>> with redirect_stdout(f): ... text.similar('woman') ... >>> swoman = f.getvalue().split() >>> swoman ['man', 'time', 'day', 'year', 'car', 'moment', 'world', 'family', 'house', 'boy', 'country', 'child', 'state', 'job', 'way', 'girl', 'place', 'war', 'room', 'work'] >>> len(swoman) 20 >>> set(sman) - set(swoman) {'and', 'people', 'woman', 'life', 'men', 'one', 'church', 'it', 'that'} >>> set(swoman) - set(sman) {'man', 'child', 'moment', 'job', 'country', 'boy', 'girl', 'family', 'room'} Advanced: .similar() normally prints; instead we capture its results into a list Note: no women

nltk book: chapter 5 >>> text.similar('woman',150) man time day year car moment world family house boy country child state job way girl place war room work question case week word men people problem city one book church situation wall picture market system group fire table school class doctor bed meeting stage president line little hand government and form night story change thing others future hospital company other ball evening person body trial month process town area program position project nation course south point while ground report field game wife matter sun guy face west window record land north surface back children air result plan life truth letter road study press water enemy pool door right board floor party sense difference mind patient century bottle schools part bridge period years play committee horse river industry two union university crowd officer problems police store score head list second latter words spirit past community note scene gun law women

nltk book: chapter 5 fileld

nltk book: chapter 5 ca01 Pre-tagged corpus: .tagged_words()
The/at Fulton/np-tl County/nn-tl Grand/jj-tl Jury/nn-tl said/vbd Friday/nr an/at investigation/nn of/in Atlanta's/np$ recent/jj primary/nn election/nn produced/vbd ``/`` no/at evidence/nn ''/'' that/cs any/dti irregularities/nns took/vbd place/nn ./. The/at jury/nn further/rbr said/vbd in/in term-end/nn presentments/nns that/cs the/at City/nn-tl Executive/jj-tl Committee/nn-tl ,/, which/wdt had/hvd over-all/jj charge/nn of/in the/at election/nn ,/, ``/`` deserves/vbz the/at praise/nn and/cc thanks/nns of/in the/at City/nn-tl of/in-tl Atlanta/np-tl ''/'' for/in the/at manner/nn in/in which/wdt the/at election/nn was/bedz conducted/vbn ./. >>> nltk.corpus.brown.tagged_words() [('The', 'AT'), ('Fulton', 'NP-TL'), ...] >>> nltk.corpus.brown.tagged_words(tagset='brown') >>> nltk.corpus.brown.tagged_words(tagset='universal') [('The', 'DET'), ('Fulton', 'NOUN'), ...]

nltk book: chapter 5 Pre-tagged corpus: .tagged_sents()
>>> brown.tagged_sents() [[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ- TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), ('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), ("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', 'NNS'), ('took', 'VBD'), ('place', 'NN'), ('.', '.')], [('The', 'AT'), ('jury', 'NN'), ('further', 'RBR'), ('said', 'VBD'), ('in', 'IN'), ('term-end', 'NN'), ('presentments', 'NNS'), ('that', 'CS'), ('the', 'AT'), ('City', 'NN-TL'), ('Executive', 'JJ-TL'), ('Committee', 'NN-TL'), (',', ','), ('which', 'WDT'), ('had', 'HVD'), ('over-all', 'JJ'), ('charge', 'NN'), ('of', 'IN'), ('the', 'AT'), ('election', 'NN'), (',', ','), ('``', '``'), ('deserves', 'VBZ'), ('the', 'AT'), ('praise', 'NN'), ('and', 'CC'), ('thanks', 'NNS'), ('of', 'IN'), ('the', 'AT'), ('City', 'NN-TL'), ('of', 'IN-TL'), ('Atlanta', 'NP-TL'), ("''", "''"), ('for', 'IN'), ('the', 'AT'), ('manner', 'NN'), ('in', 'IN'), ('which', 'WDT'), ('the', 'AT'), ('election', 'NN'), ('was', 'BEDZ'), ('conducted', 'VBN'), ('.', '.')], ...]

nltk book: chapter 5 Brown tagset (wikipedia) 85 tags: -* negated; -NC emphasis; –HL headline; -TL title; FW- foreign word A dozen tags

nltk book: chapter 5 >>> len(brown.tagged_words(categories='news', tagset='universal')) >>> fd = nltk.FreqDist(tag for word,tag in brown.tagged_words(categories='news', tagset='universal')) >>> fd.most_common() [('NOUN', 30654), ('VERB', 14399), ('ADP', 12355), ('.', 11928), ('DET', 11389), ('ADJ', 6706), ('ADV', 3349), ('CONJ', 2717), ('PRON', 2535), ('PRT', 2264), ('NUM', 2166), ('X', 92)] >>> fd.plot()

otherwise nltk crashes with $$ in the tag
nltk book: chapter 5 otherwise nltk crashes with $$ in the tag >>> len(brown.tagged_words(categories='news') >>> fd = nltk.FreqDist(tag.replace('$','S') for word,tag in brown.tagged_words(categories='news')) >>> fd.most_common() [('NN', 13162), ('IN', 10616), ('AT', 8893), ('NP', 6866), ('NNS', 5276), (',', 5133), ('.', 4452), ('JJ', 4392), ('CC', 2664), ('VBD', 2524), ('NN-TL', 2486), ('VB', 2440), ('VBN', 2269), ('RB', 2166), ('PPS', 2107), ('CD', 2020), ('CS', 1509), ('VBG', 1398), ('TO', 1237), ('MD', 1031), ('AP', 923), ('NP-TL', 741), ('``', 732), ('BEZ', 730), ('BEDZ', 716), ("''", 702), ('JJ-TL', 689), ('PPSS', 605), ('DT', 589), ('BE', 525), ('VBZ', 519), >>> fd.plot()

nltk book: chapter 5 Zipf's Law: freq(POS_tag) ∝ 1/rank(POS_tag)
>>> import nltk >>> from nltk.corpus import brown >>> fd = nltk.FreqDist(tag.replace('$','S') for word,tag in brown.tagged_words(categories='news')) >>> l = sorted((item[1]/fd.N() for item in fd.items()),reverse=True) >>> import matplotlib.pyplot as plt >>> plt.plot(l) [<matplotlib.lines.Line2D object at 0x118c2f7f0>] >>> plt.grid(True) >>> plt.xlabel('log(rank(tag))') Text(0.5,0,'log(rank(tag))') >>> plt.ylabel('log(freq(tag))') Text(0,0.5,'log(freq(tag))') >>> plt.xscale('log') >>> plt.yscale('log') >>> plt.show()

nltk book: chapter 5 Zipf's Law: for Brown corpus words
>>> import nltk >>> from nltk.corpus import brown >>> fd = nltk.FreqDist(word for word in brown.words()) >>> lf = sorted((item[1]/fd.N() for item in fd.items()),reverse=True) >>> lc = sorted((item[1] for item in fd.items()),reverse=True) >>> import matplotlib.pyplot as plt >>> plt.xscale('log') >>> plt.yscale('log') >>> plt.grid(True) >>> plt.xlabel('log(rank))') Text(0.5,0,'log(rank))') >>> plt.ylabel('log(freq(word)) and log(count(word))') Text(0,0.5,'log(freq(word)) and log(count(word))') >>> plt.plot(lf) [<matplotlib.lines.Line2D object at 0x11c8aa1d0>] >>> plt.plot(lc) [<matplotlib.lines.Line2D object at 0x11c8aacc0>] >>> plt.show()

What kinds of words follow often?
nltk book: chapter 5 2.8 Exploring Tagged Corpora >>> import nltk >>> from nltk.corpus import brown >>> tagged_text = brown.tagged_words() >>> tags = [two[1] for one,two in nltk.bigrams(tagged_text) if one[0]=='often'] >>> len(tags) 349 >>> fd = nltk.FreqDist(tags) >>> fd.tabulate() VBN VB VBD JJ VBZ AT IN , CS QL HV BEN VBG DO RB . BED NN QLP CC TO AP DOD -- PPS BER PP$ BEZ MD* RP MD WRB NNS '' HVZ DT HVD BER* 61 51 36 30 24 18 18 16 13 12 6 6 6 5 5 4 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 >>> tagged_text = brown.tagged_words(tagset='universal') >>> tags = [two[1] for one,two in nltk.bigrams(tagged_text) if one[0]=='often'] VERB ADJ ADP . DET ADV NOUN PRT CONJ PRON 209 32 31 23 21 21 4 3 3 2 What kinds of words follow often? Original tagset too fine-grained…

nltk book: chapter 5 Trigrams: V* TO V* (all tags)
>>> import nltk >>> from nltk.corpus import brown >>> tagged_sents = brown.tagged_sents() >>> nltk.trigrams(tagged_sents[0]) <generator object trigrams at 0x101c46780> >>> list(nltk.trigrams(tagged_sents[0])) [(('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL')), (('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL')), (('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL')), (('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD')), (('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR')), (('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT')), (('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN')), (('an', 'AT'), … Trigram: (tg[0],tg[1],tg[2]), tg[0][0] = 1st word; tg[0][1]= 1st tag, etc. >>> l = [(tg[0][0],tg[1][0],tg[2][0]) for ts in tagged_sents for tg in nltk.trigrams(ts) if tg[1][1]=='TO' and tg[0][1].startswith('V') and tg[2][1].startswith('V')] >>> fd.most_common(10) [(('trying', 'to', 'get'), 14), (('want', 'to', 'go'), 10), (('want', 'to', 'see'), 10), (('trying', 'to', 'make'), 10), (('going', 'to', 'get'), 10), (('going', 'to', 'take'), 7), (('wanted', 'to', 'know'), 7), (('trying', 'to', 'find'), 6), (('like', 'to', 'think'), 6), (('got', 'to', 'get'), 6)]

nltk book: chapter 5 Words with multiple POS tags:
>>> cfd = nltk.ConditionalFreqDist((word.lower(),tag) for (word,tag) in brown.tagged_words()) >>> sorted(((word,cfd[word].B()) for word in cfd.conditions()),key=lambda x:x[1],reverse=True)[:10] [('that', 15), ('a', 13), ('to', 11), ('in', 10), ('home', 10), ('out', 9), ('well', 9), (':', 9), ('right', 9), ('it', 9)] >>> cfd['that'] FreqDist({'CS': 6464, 'DT': 2260, 'WPS': 1654, 'WPO': 135, 'QL': 56, 'DT-NC': 6, 'DT-TL': 5, 'WPS-TL': 3, 'WPS-NC': 3, 'WPS-HL': 2, ...}) >>> cfd['a'] FreqDist({'AT': 22943, 'AT-HL': 60, 'AT-TL': 55, 'NN': 50, 'NP': 30, 'NP-HL': 20, 'NP-TL': 14, 'AT-NC': 8, 'FW-IN': 5, 'AT-TL-HL': 4, ...}) >>> cfd['to'] FreqDist({'TO': 14917, 'IN': 11046, 'IN-HL': 68, 'TO-HL': 55, 'IN-TL': 36, 'TO-NC': 13, 'TO-TL': 10, 'IN-NC': 8, 'NIL': 3, 'QL': 1, ...})

nltk book: chapter 5 Words with multiple POS tags:
>>> cfd = nltk.ConditionalFreqDist((word.lower(),tag) for (word,tag) in brown.tagged_words(tagset='universal')) >>> sorted(((word,cfd[word].B()) for word in cfd.conditions()),key=lambda x:x[1],reverse=True)[:10] [('down', 6), ('damn', 5), ('well', 5), ('to', 5), ('that', 5), ('round', 5), ('outside', 4), ('still', 4), ('opposite', 4), ('parallel', 4)] >>> cfd['down'] FreqDist({'PRT': 696, 'ADP': 192, 'NOUN': 2, 'ADV': 2, 'VERB': 2, 'ADJ': 1}) >>> cfd['damn'] FreqDist({'ADJ': 13, 'VERB': 10, 'NOUN': 4, 'PRT': 3, 'ADV': 2}) >>> cfd['well'] FreqDist({'ADV': 723, 'PRT': 138, 'NOUN': 17, 'ADJ': 15, 'VERB': 4}) >>> cfd['to'] FreqDist({'PRT': 14995, 'ADP': 11158, 'X': 3, 'NOUN': 1, 'ADV': 1}) >>> cfd['that'] FreqDist({'ADP': 6467, 'DET': 2272, 'PRON': 1798, 'ADV': 56, 'X': 1}) >>> cfd['round'] FreqDist({'ADJ': 32, 'NOUN': 19, 'ADV': 14, 'VERB': 6, 'ADP': 4}) >>> cfd['outside'] FreqDist({'ADP': 83, 'ADV': 65, 'ADJ': 40, 'NOUN': 22})

nltk book: chapter 5 Much more on tagging strategies in the remainder of this chapter. See the underlying technology behind the pos_tag() function we use But we're out of time in this course…

LING 388: Computers and Language

Similar presentations

Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING 388: Computers and Language

Similar presentations

Presentation on theme: "LING 388: Computers and Language"— Presentation transcript:

Similar presentations

About project

Feedback