Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING 3820 & 6820 Natural Language Processing Harry Howard

Similar presentations


Presentation on theme: "LING 3820 & 6820 Natural Language Processing Harry Howard"— Presentation transcript:

1 Text statistics 6 Day 29 - 11/03/14
LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction. Chapter numbering 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 6. Control NLP, Prof. Howard, Tulane University 03-Nov-2014

3 Open Spyder NLP, Prof. Howard, Tulane University 03-Nov-2014

4 Review of dictionaries & FreqDist
NLP, Prof. Howard, Tulane University 03-Nov-2014

5 The plot NLP, Prof. Howard, Tulane University 03-Nov-2014

6 What to do about the most frequent words
>>> from nltk.corpus import stopwords >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] NLP, Prof. Howard, Tulane University 03-Nov-2014

7 Usage >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> from nltk.corpus import stopwords >>> wubFD = FreqDist(word.lower() for word in text if word.lower() not in stopwords.words('english')) >>> wubFD.plot(50) NLP, Prof. Howard, Tulane University 03-Nov-2014

8 Plot without stopwords
NLP, Prof. Howard, Tulane University 03-Nov-2014

9 A corpus with genres or categories
The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014

10 Two simultaneous tallies
Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] These can be understood as (condition, sample). NLP, Prof. Howard, Tulane University 03-Nov-2014

11 ConditionalFreqDist # from nltk.corpus import brown
>>> from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'romance'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c)] >>> cfd=ConditionalFreqDist(catWord) NLP, Prof. Howard, Tulane University 03-Nov-2014

12 Conditional frequency distribution
NLP, Prof. Howard, Tulane University 03-Nov-2014

13 Check results NLP, Prof. Howard, Tulane University 03-Nov-2014
>>> len(catWords) 170576 >>> catWords[:4] [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] >>> catWords[-4:] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] >>> cfd <ConditionalFreqDist with 2 conditions> >>> cfd.conditions() ['news', 'romance'] >>> cfd['news'] <FreqDist with outcomes> >>> cfd['romance'] <FreqDist with outcomes> >>> cfd['romance']['could'] 193 >>> list(cfd['romance']) [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014

14 A more interesting example
can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 sci fi 16 49 4 8 romance 74 193 11 51 45 43 humor 30 9 13 NLP, Prof. Howard, Tulane University 03-Nov-2014

15 Conditions = categories, sample = modal verbs
# from nltk.corpus import brown # from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> mod = ['can', 'could', 'may', 'might', 'must', 'will'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c) if w in mod] >>> cfd = ConditionalFreqDist(catWord) >>> cfd.tabulate() >>> cfd.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

16 dfc.tabulate() can could may might must will news religion hobbies science_fiction romance humor NLP, Prof. Howard, Tulane University 03-Nov-2014

17 dfc.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

18 Another example The task is to find the frequency of 'america' and 'citizen' in NLTK's corpus of presedential inaugural addresses: >>> from nltk.corpus import inaugural >>> inaugural.fileids() ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt'] NLP, Prof. Howard, Tulane University 03-Nov-2014

19 The code # from nltk.corpus import inaugural
# from nltk.probability import ConditionalFreqDist >>> keys = ['america', 'citizen'] >>> keyYear = [(w, title[:4]) for title in inaugural.fileids() for w in inaugural.words(title) for w in keys if w.lower().startswith(keys)] >>> cfd2 = ConditionalFreqDist(keyYear) >>> cfd2.tabulate() >>> cfd2.plot() annos = [fileid[:4] for fileid in inaugural.fileids()] NLP, Prof. Howard, Tulane University 03-Nov-2014

20 cfd2.tabulate() cannot be read
america citizen NLP, Prof. Howard, Tulane University 03-Nov-2014

21 It's something like … 1789 1793 1797 1801 1805… america 2 1 8 citizen
citizen 5 9 7 10 NLP, Prof. Howard, Tulane University 03-Nov-2014

22 Same thing, changing axes
america citizen NLP, Prof. Howard, Tulane University 03-Nov-2014

23 dfc2.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

24 Next time Q8 Twitter maybe NLP, Prof. Howard, Tulane University
03-Nov-2014


Download ppt "LING 3820 & 6820 Natural Language Processing Harry Howard"

Similar presentations


Ads by Google