Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 3820 & 6820 Natural Language Processing Harry Howard

Similar presentations


Presentation on theme: "LING 3820 & 6820 Natural Language Processing Harry Howard"— Presentation transcript:

1 Text statistics 2 Day 24 - 10/22/14
LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction. Chapter numbering 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 6. Control NLP, Prof. Howard, Tulane University 24-Oct-2014

3 Open Spyder NLP, Prof. Howard, Tulane University 24-Oct-2014

4 Review of NLTK modules NLP, Prof. Howard, Tulane University
24-Oct-2014

5 Put it all in a single line
>>> temp = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8') >>> temp1 = temp.words() >>> temp2 = Text(temp1) >>> text = Text(PlaintextCorpusReader('', 'Wub.txt', encoding='utf- 8').words()) NLP, Prof. Howard, Tulane University 24-Oct-2014

6 The text prepration function
def textLoader(doc): from nltk.corpus import PlaintextCorpusReader from nltk.text import Text return Text(PlaintextCorpusReader('', doc, encoding='utf-8').words()) >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') NLP, Prof. Howard, Tulane University 24-Oct-2014

7 A dicionary maps a key to a value
key1:value1 key2:value2 key3:value3 NLP, Prof. Howard, Tulane University 24-Oct-2014

8 Dicionary methods >>> tallyDict = {'all':13, 'semantic':1, 'pardon':1, 'switched':1, 'Kindred':1} >>> type(tallyDict) >>> len(tallyDict) >>> str(tallyDict) >>> 'pardon' in tallyDict >>> tallyDict.items() >>> tallyDict.keys() >>> tallyDict.values() >>> tallyDict['all'] NLP, Prof. Howard, Tulane University 24-Oct-2014

9 Make a dictionary in a loop
>>> wubCount = {} >>> for word in text: ... if word in wubCount: wubCount[word]+=1 ... else: wubCount[word]=1 ... >>> len(wubCount) >>> wubCount.items()[:30] [(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1), (u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1), (u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2), (u'shrugged', 1), (u'milk', 1),… ] NLP, Prof. Howard, Tulane University 24-Oct-2014

10 8.3.4. How to keep a tally with FreqDist()
FreqDist does all of the work of creating a dictionary of word counts for us, with the single caveat that it only works on NLTK text. NLP, Prof. Howard, Tulane University 24-Oct-2014

11 Create a freqdist tally with list comprehension syntax
>>> from nltk.probability import FreqDist >>> wubFD = FreqDist(word for word in text) >>> wubFD.items()[:30] [(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1), (u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1), (u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2), (u'shrugged', 1), (u'milk', 1), (u'issues', 1), (u'...."', 1), (u'apartment', 1), (u'to', 57), (u'tail', 3), (u'dejectedly', 1), (u'squeezing', 1), (u'Not', 1), (u'sorry', 2), (u'Now', 2), (u'Eat', 1), (u'fists', 1), (u'And', 5)] NLP, Prof. Howard, Tulane University 24-Oct-2014

12 A freqdist has all the dict methods
>>> type(wubFD) >>> len(wubFD) >>> str(wubFD) >>> 'pardon' in wubFD >>> wubFD.items() >>> wubFD.keys() >>> wubFD.values() NLP, Prof. Howard, Tulane University 24-Oct-2014

13 A new terminology from statistics
>>> wubFD <FreqDist with 929 samples and 3693 outcomes> >>> len(wubFD.samples()) 929 >>> len(wubFD.keys()) >>> wubFD.B() >>> sum(wubFD.values()) 3693 >>> wubFD.N() NLP, Prof. Howard, Tulane University 24-Oct-2014

14 key sample B value outcome N
A dicionary maps a key to a value ,an experiment observes an outcome from a sample key sample B value outcome N key1:value1 key2:value2 key3:value3 NLP, Prof. Howard, Tulane University 24-Oct-2014

15 More new methods >>> wubFD['the'] > wubFD['wub']
>>> wubFD.max() # ~ value to key u'.' >>> wubFD[wubFD.max()] 289 >>> wubFD.Nr(289) # ~ value to key 1 >>> wubFD.freq('.') NLP, Prof. Howard, Tulane University 24-Oct-2014

16 An example >>> wubFD['wub'] 54 >>> wubFD.N() 3693
from __future__ import division 54/3693 >>> wubFD.freq('wub') >>> round(wubFD.freq('wub'), 3) 0.015 NLP, Prof. Howard, Tulane University 24-Oct-2014

17 A hapax is a word that only appears once in a text
>>> wubFD.hapaxes()[:30] [u'...."', u'1952', u'://', u'Am', u'An', u'Anything', u'Apparently', u'Are', u'Atomic', u'BEYOND', u'Back', u'Be', u'Beyond', u'Blundell', u'By', u'Cheer', u'DICK', u'Dick', u'Distributed', u'Do', u'Earth', u'Earthmen', u'Eat', u'Eating', u'End', u'Extensive', u'Finally', u'For', u'Good', u'Greg'] >>> len(wubFD.hapaxes()) 592 >>> wubFD.Nr(1) NLP, Prof. Howard, Tulane University 24-Oct-2014

18 How to create a table of results with tabulate()
>>> wubFD.tabulate(10) . " the , I ' said The to ." >>> wubFD.tabulate(10, 20) wub it ," and of you ?" It his s NLP, Prof. Howard, Tulane University 24-Oct-2014

19 wubFD.plot(50) NLP, Prof. Howard, Tulane University 24-Oct-2014

20 wubFD.plot(50, cumulative=True)
NLP, Prof. Howard, Tulane University 24-Oct-2014

21 Zipf's law http://en.wikipedia.org/wiki/Zipf's_law
Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million). True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852). Only 135 vocabulary items are needed to account for half the Brown Corpus. NLP, Prof. Howard, Tulane University 24-Oct-2014

22 Other examples The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on. The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913. NLP, Prof. Howard, Tulane University 24-Oct-2014

23 What to do about the most frequent words
>>> import nltk >>> stop=nltk.corpus.stopwords.words('english') >>> stop ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] NLP, Prof. Howard, Tulane University 24-Oct-2014

24 Next time Homework 8.3.3. Practice with dictionaries
Conditional frequency NLP, Prof. Howard, Tulane University 24-Oct-2014


Download ppt "LING 3820 & 6820 Natural Language Processing Harry Howard"

Similar presentations


Ads by Google