TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 29-Oct-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

Open Spyder 29-Oct-2014 3 NLP, Prof. Howard, Tulane University

>>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> wubFD = FreqDist(word.lower() for word in text) >>> wubFD.plot() Review of dictionaries & FreqDist 29-Oct-2014 4 NLP, Prof. Howard, Tulane University

29-Oct-2014NLP, Prof. Howard, Tulane University 5 wubFD.plot(50)

29-Oct-2014NLP, Prof. Howard, Tulane University 6 wubFD.plot(50, cumulative=True)

Want to graph frequency by rank, logarithmicly 29-Oct-2014NLP, Prof. Howard, Tulane University 7 Rank Frequency

Tuples, cf. freqdist.items(k,v) 1. >>> singleton = (1) 2. >>> double = (1,2) 3. >>> triple = (1,2,3) 4. >>> quadruple = (1,2,3,4) 5. >>> singleton 6. >>> singleton[0] 7. >>> double[0] 8. >>> double[1] 9. >>> double[2] 10. >>> triple[3] 11. >>> quadruple[4] 29-Oct-2014NLP, Prof. Howard, Tulane University 8

Range() 1. >>> range(5) 2. [0, 1, 2, 3, 4] 3. # how would you get [1, 2, 3, 4, 5]? 4. >>> range(1,6) # range(0+1,5+1) 5. [1, 2, 3, 4, 5] 29-Oct-2014NLP, Prof. Howard, Tulane University 9

How to print values on a logarithmic scale  The task is to extract the values/outcomes from the frequency distribution in order to graph them against their rank without any words. The values must be sorted from high to low in order to reflect their rank order. 1. >>> freq = [v for (k,v) in wubFD.items()] 2. >>> freq = sorted(freq,reverse=True) 3. >>> rank = range(1,len(freq)+1) 4. >>> import matplotlib.pyplot as plt 5. >>> plt.loglog(rank,freq) 6. >>> plt.title("Logarithmic rank-frequency plot of 'Beyond Lies the Wub'") 7. >>> plt.xlabel('Rank') 8. >>> plt.ylabel('Frequency') 29-Oct-2014NLP, Prof. Howard, Tulane University 10

The plot 29-Oct-2014NLP, Prof. Howard, Tulane University 11

Homework  Make a logarithmic rank-frequency plot of the words in the vampire novel. Is it straighter? 29-Oct-2014NLP, Prof. Howard, Tulane University 12

The problem of the most frequent words 29-Oct-2014 13 NLP, Prof. Howard, Tulane University

What to do about the most frequent words  >>> from nltk.corpus import stopwords  >>> stopwords.words('english')  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] 29-Oct-2014NLP, Prof. Howard, Tulane University 14

Conditional frequency distribution 29-Oct-2014 15 NLP, Prof. Howard, Tulane University

19-mar-14SPAN 4350 - Harry Howard - Tulane University 16 A corpus with genres or categories  The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: 1. >>> from nltk.corpus import brown 2. >>> brown.categories() 3. ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance','science_fiction'] 4. >>> brown.words(categories='news') 5. ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',...]

19-mar-14SPAN 4350 - Harry Howard - Tulane University 17 Two simultaneous tallies  Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'),...]  These can be understood as (condition, sample).

ConditionalFreqDist 1. >>> from nltk.probability import ConditionalFreqDist 2. >>> cat = ['news', 'romance'] 3. >>> catWord = [(c,w) 4. for c in cat 5. for w in brown.words(categories=c)] 6. >>> cfd=ConditionalFreqDist(catWord) 29-Oct-2014NLP, Prof. Howard, Tulane University 18

Check results 1. >>> len(catWords) 2. 170576 3. >>> catWords[:4] 4. [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] 5. >>> catWords[-4:] 6. [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] 7. >>> cfd 8. 9. >>> cfd.conditions() 10. ['news', 'romance'] 11. >>> cfd['news'] 12. 13. >>> cfd['romance'] 14. 15. >>> cfd['romance']['could'] 16. 193 17. >>> list(cfd['romance']) 18. [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she',...] 29-Oct-2014NLP, Prof. Howard, Tulane University 19

Q7 Conditional frequency Next time 29-Oct-2014NLP, Prof. Howard, Tulane University 20

TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback