Presentation is loading. Please wait.

Presentation is loading. Please wait.

TEXT STATISTICS 3 DAY 25 - 10/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "TEXT STATISTICS 3 DAY 25 - 10/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 TEXT STATISTICS 3 DAY 25 - 10/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 24-Oct-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

3 Open Spyder 24-Oct-2014 3 NLP, Prof. Howard, Tulane University

4 >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist Review of dictionaries & FreqDist 24-Oct-2014 4 NLP, Prof. Howard, Tulane University

5 NLTK version  import nltk  print nltk.__version__ 24-Oct-2014NLP, Prof. Howard, Tulane University 5

6 Clarification of dictionary methods 1. >>> tallyDict = {'all':13, 'semantic':1} 2. >>> tallyDict['all'] 3. >>> tallyDict['pardon'] 4. >>> tallyDict['pardon'] = 1 5. >>> tallyDict['pardon'] = 2 6. >>> tallyDict = {'all':13, 'semantic':1, 'semantic':5} 7. >>> tallyDict[13] 24-Oct-2014NLP, Prof. Howard, Tulane University 6

7 A dicionary maps a key to a value 24-Oct-2014NLP, Prof. Howard, Tulane University 7 key1:value1 key2:value2 key3:value3 key (type) value (tokens)

8 Create a freqdist tally with list comprehension syntax 1. >>> from nltk.probability import FreqDist 2. >>> wubFD = FreqDist(word for word in text) 3. >>> wubFD.items()[:30] 4. [(u'all', 13), (u'semantic', 1), (u'pardon', 1), (u'switched', 1), (u'Kindred', 1), (u'splashing', 1), (u'excellent', 1), (u'month', 1), (u'four', 1), (u'sunk', 1), (u'straws', 1), (u'sleep', 1), (u'skin', 1), (u'go', 8), (u'meditation', 2), (u'shrugged', 1), (u'milk', 1), (u'issues', 1), (u'...."', 1), (u'apartment', 1), (u'to', 57), (u'tail', 3), (u'dejectedly', 1), (u'squeezing', 1), (u'Not', 1), (u'sorry', 2), (u'Now', 2), (u'Eat', 1), (u'fists', 1), (u'And', 5)] 24-Oct-2014NLP, Prof. Howard, Tulane University 8

9 A new terminology from statistics 1. >>> wubFD 2. 3. >>> len(wubFD.keys()) 4. 929 5. >>> len(wubFD.samples()) 6. 929 7. >>> wubFD.B() 8. 929 9. >>> sum(wubFD.values()) 10. 3693 11. >>> wubFD.N() 12. 3693 24-Oct-2014NLP, Prof. Howard, Tulane University 9

10 A dicionary maps a key to a value, an experiment observes an outcome from a sample 24-Oct-2014NLP, Prof. Howard, Tulane University 10 key1:value1 key2:value2 key3:value3 key (type) sample B value (tokens) outcome N

11 Value to key 1. >>> wubFD.max() 2. u'.' 3. >>> wubFD[u'.'] 4. 289 5. >>> wubFD.Nr(289) 6. 1 7. >>> wubFD.r_Nr(289) 8. defaultdict(, {0: 0, 1: 592, 2: 140, 3: 51, 4: 32, 5: 17, 6: 13, 7: 13, 8: 8, 9: 6, 10: 5, 11: 3, 12: 1, 13: 3, 14: 3, 15: 3, 17: 1, 18: 4, 19: 2, 20: 2, 21: 1, 22: 1, 23: 2, 25: 1, 26: 1, 28: 1, 30: 1, 33: 2, 34: 4, 164: 1, 37: 1, 39: 1, 41: 1, 48: 1, 53: 1, 54: 1, 56: 1, 57: 1, 59: 1, 61: 1, 66: 1, 69: 1, 289: 1, 141: 1, 146: 1}) 24-Oct-2014NLP, Prof. Howard, Tulane University 11

12 More new methods 1. >>> wubFD['the'] > wubFD['wub'] 2. >>> wubFD.freq('.') 3. 0.07825616030327646 24-Oct-2014NLP, Prof. Howard, Tulane University 12

13 An example 1. >>> wubFD['wub'] 2. 54 3. >>> wubFD.N() 4. 3693 5. from __future__ import division 6. 54/3693 7. 0.01462225832656377 8. >>> wubFD.freq('wub') 9. 0.01462225832656377 10. >>> round(wubFD.freq('wub'), 3) 11. 0.015 24-Oct-2014NLP, Prof. Howard, Tulane University 13

14 A hapax is a word that only appears once in a text 1. >>> wubFD.hapaxes()[:30] 2. [u'...."', u'1952', u'://', u'Am', u'An', u'Anything', u'Apparently', u'Are', u'Atomic', u'BEYOND', u'Back', u'Be', u'Beyond', u'Blundell', u'By', u'Cheer', u'DICK', u'Dick', u'Distributed', u'Do', u'Earth', u'Earthmen', u'Eat', u'Eating', u'End', u'Extensive', u'Finally', u'For', u'Good', u'Greg'] 3. >>> len(wubFD.hapaxes()) 4. 592 5. >>> wubFD.Nr(1) 6. 592 24-Oct-2014NLP, Prof. Howard, Tulane University 14

15 How to create a table of results with tabulate() 1. >>> wubFD.tabulate(10) 2.. " the, I ' said The to." 3. 289 164 146 141 69 66 61 59 57 56 4. >>> wubFD.tabulate(10, 20) 5. wub it," and of you ?" It his s 6. 54 53 48 41 39 37 34 34 34 34 24-Oct-2014NLP, Prof. Howard, Tulane University 15

16 24-Oct-2014NLP, Prof. Howard, Tulane University 16 wubFD.plot(50)

17 24-Oct-2014NLP, Prof. Howard, Tulane University 17 wubFD.plot(50, cumulative=True)

18 Zipf's law  http://en.wikipedia.org/wiki/Zipf's_law http://en.wikipedia.org/wiki/Zipf's_law  Zipf's law states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table.  Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.  For example, in the Brown Corpus of American English text, the word "the" is the most frequently occurring word, and by itself accounts for nearly 7% of all word occurrences (69,971 out of slightly over 1 million).  True to Zipf's Law, the second-place word "of" accounts for slightly over 3.5% of words (36,411 occurrences), followed by "and" (28,852).  Only 135 vocabulary items are needed to account for half the Brown Corpus. 24-Oct-2014NLP, Prof. Howard, Tulane University 18

19 Other examples  The same relationship occurs in many other rankings unrelated to language, such as the population ranks of cities in various countries, corporation sizes, income rankings, ranks of number of people watching the same TV channel, and so on.  The appearance of the distribution in rankings of cities by population was first noticed by Felix Auerbach in 1913. 24-Oct-2014NLP, Prof. Howard, Tulane University 19

20 What to do about the most frequent words  >>> import nltk  >>> stop=nltk.corpus.stopwords.words('english')  >>> stop  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] 24-Oct-2014NLP, Prof. Howard, Tulane University 20

21 Q7 Conditional frequency Next time 24-Oct-2014NLP, Prof. Howard, Tulane University 21


Download ppt "TEXT STATISTICS 3 DAY 25 - 10/24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google