Presentation is loading. Please wait.

Presentation is loading. Please wait.

TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 29-Oct-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

3 Open Spyder 29-Oct-2014 3 NLP, Prof. Howard, Tulane University

4 >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> wubFD = FreqDist(word.lower() for word in text) >>> wubFD.plot() Review of dictionaries & FreqDist 29-Oct-2014 4 NLP, Prof. Howard, Tulane University

5 29-Oct-2014NLP, Prof. Howard, Tulane University 5 wubFD.plot(50)

6 29-Oct-2014NLP, Prof. Howard, Tulane University 6 wubFD.plot(50, cumulative=True)

7 Want to graph frequency by rank, logarithmicly 29-Oct-2014NLP, Prof. Howard, Tulane University 7 Rank Frequency

8 Tuples, cf. freqdist.items(k,v) 1. >>> singleton = (1) 2. >>> double = (1,2) 3. >>> triple = (1,2,3) 4. >>> quadruple = (1,2,3,4) 5. >>> singleton 6. >>> singleton[0] 7. >>> double[0] 8. >>> double[1] 9. >>> double[2] 10. >>> triple[3] 11. >>> quadruple[4] 29-Oct-2014NLP, Prof. Howard, Tulane University 8

9 Range() 1. >>> range(5) 2. [0, 1, 2, 3, 4] 3. # how would you get [1, 2, 3, 4, 5]? 4. >>> range(1,6) # range(0+1,5+1) 5. [1, 2, 3, 4, 5] 29-Oct-2014NLP, Prof. Howard, Tulane University 9

10 How to print values on a logarithmic scale  The task is to extract the values/outcomes from the frequency distribution in order to graph them against their rank without any words. The values must be sorted from high to low in order to reflect their rank order. 1. >>> freq = [v for (k,v) in wubFD.items()] 2. >>> freq = sorted(freq,reverse=True) 3. >>> rank = range(1,len(freq)+1) 4. >>> import matplotlib.pyplot as plt 5. >>> plt.loglog(rank,freq) 6. >>> plt.title("Logarithmic rank-frequency plot of 'Beyond Lies the Wub'") 7. >>> plt.xlabel('Rank') 8. >>> plt.ylabel('Frequency') 29-Oct-2014NLP, Prof. Howard, Tulane University 10

11 The plot 29-Oct-2014NLP, Prof. Howard, Tulane University 11

12 Homework  Make a logarithmic rank-frequency plot of the words in the vampire novel. Is it straighter? 29-Oct-2014NLP, Prof. Howard, Tulane University 12

13 The problem of the most frequent words 29-Oct-2014 13 NLP, Prof. Howard, Tulane University

14 What to do about the most frequent words  >>> from nltk.corpus import stopwords  >>> stopwords.words('english')  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] 29-Oct-2014NLP, Prof. Howard, Tulane University 14

15 Conditional frequency distribution 29-Oct-2014 15 NLP, Prof. Howard, Tulane University

16 19-mar-14SPAN 4350 - Harry Howard - Tulane University 16 A corpus with genres or categories  The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: 1. >>> from nltk.corpus import brown 2. >>> brown.categories() 3. ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance','science_fiction'] 4. >>> brown.words(categories='news') 5. ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',...]

17 19-mar-14SPAN 4350 - Harry Howard - Tulane University 17 Two simultaneous tallies  Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'),...]  These can be understood as (condition, sample).

18 ConditionalFreqDist 1. >>> from nltk.probability import ConditionalFreqDist 2. >>> cat = ['news', 'romance'] 3. >>> catWord = [(c,w) 4. for c in cat 5. for w in brown.words(categories=c)] 6. >>> cfd=ConditionalFreqDist(catWord) 29-Oct-2014NLP, Prof. Howard, Tulane University 18

19 Check results 1. >>> len(catWords) 2. 170576 3. >>> catWords[:4] 4. [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] 5. >>> catWords[-4:] 6. [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] 7. >>> cfd 8. 9. >>> cfd.conditions() 10. ['news', 'romance'] 11. >>> cfd['news'] 12. 13. >>> cfd['romance'] 14. 15. >>> cfd['romance']['could'] 16. 193 17. >>> list(cfd['romance']) 18. [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she',...] 29-Oct-2014NLP, Prof. Howard, Tulane University 19

20 Q7 Conditional frequency Next time 29-Oct-2014NLP, Prof. Howard, Tulane University 20


Download ppt "TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google