Presentation is loading. Please wait.

Presentation is loading. Please wait.

TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 03-Nov-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

3 Final project 03-Nov-2014NLP, Prof. Howard, Tulane University 3

4 Open Spyder 03-Nov-2014 4 NLP, Prof. Howard, Tulane University

5 Review 03-Nov-2014 5 NLP, Prof. Howard, Tulane University

6 ConditionalFreqDist 1. >>> from nltk.corpus import brown 2. >>> from nltk.probability import ConditionalFreqDist 3. >>> cat = ['news', 'romance'] 4. >>> catWord = [(c,w) 5. for c in cat 6. for w in brown.words(categories=c)] 7. >>> cfd=ConditionalFreqDist(catWord) 03-Nov-2014NLP, Prof. Howard, Tulane University 6

7 Conditional frequency distribution 03-Nov-2014 7 NLP, Prof. Howard, Tulane University

8 03-Nov-2014NLP, Prof. Howard, Tulane University 8 A more interesting example cancouldmaymightmustwill news9386663850389 religion825978125471 hobbies268581312283264 sci fi1649412816 romance7419311514543 humor163088913

9 Conditions = categories, sample = modal verbs 1. # from nltk.corpus import brown 2. # from nltk.probability import ConditionalFreqDist 3. >>> cat = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] 4. >>> mod = ['can', 'could', 'may', 'might', 'must', 'will'] 5. >>> catWord = [(c,w) 6. for c in cat 7. for w in brown.words(categories=c) 8. if w in mod] 9. >>> cfd = ConditionalFreqDist(catWord) 10. >>> cfd.tabulate() 11. >>> cfd.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 9

10 cfd.tabulate() can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 science_fiction 16 49 4 12 8 16 romance 74 193 11 51 45 43 humor 16 30 8 8 9 13 03-Nov-2014NLP, Prof. Howard, Tulane University 10

11 cfd.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 11

12 03-Nov-2014NLP, Prof. Howard, Tulane University 12 Another example  The task is to find the frequency of 'America' and 'citizen' in NLTK's corpus of presedential inaugural addresses: 1. >>> from nltk.corpus import inaugural 2. >>> inaugural.fileids() 3. ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt',..., '2009-Obama.txt']

13 03-Nov-2014NLP, Prof. Howard, Tulane University 13 cfd2.plot()

14 First try 1. from nltk.corpus import inaugural 2. from nltk.probability import ConditionalFreqDist 3. keys = ['america', 'citizen'] 4. keyYear = [(w, title[:4]) 5. for title in inaugural.fileids() 6. for w in inaugural.words(title) 7. if w.lower() in keys] 8. cfd2 = ConditionalFreqDist(keyYear) 9. cfd2.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 14

15 03-Nov-2014NLP, Prof. Howard, Tulane University 15 cfd2.plot()

16 Second try 1. from nltk.corpus import inaugural 2. from nltk.probability import ConditionalFreqDist 3. keys = ['america', 'citizen'] 4. keyYear = [(key, title[:4]) 5. for title in inaugural.fileids() 6. for w in inaugural.words(title) 7. for k in keys 8. if w.lower().startswith(k)] 9. cfd3 = ConditionalFreqDist(keyYear) 10. cfd3.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 16

17 dfc3.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 17

18 Stemming 03-Nov-2014NLP, Prof. Howard, Tulane University 18

19 Third try 1. from nltk.stem.snowball import EnglishStemmer 2. stemmer = EnglishStemmer() 3. from nltk.corpus import inaugural 4. from nltk.probability import ConditionalFreqDist 5. keys = ['america', 'citizen'] 6. keyYear = [(w, title[:4]) 7. for title in inaugural.fileids() 8. for w in inaugural.words(title) 9. if stemmer.stem(w) in keys] 10. cfd4 = ConditionalFreqDist(keyYear) 11. cfd4.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 19

20 cfd4.plot() 03-Nov-2014NLP, Prof. Howard, Tulane University 20

21 Twitter Next time 03-Nov-2014NLP, Prof. Howard, Tulane University 21


Download ppt "TEXT STATISTICS 7 DAY 30 - 11/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google