LING 3820 & 6820 Natural Language Processing Harry Howard

Text statistics 6 Day 29 - 11/03/14
LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction. Chapter numbering 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 6. Control NLP, Prof. Howard, Tulane University 03-Nov-2014

Open Spyder NLP, Prof. Howard, Tulane University 03-Nov-2014

Review of dictionaries & FreqDist
NLP, Prof. Howard, Tulane University 03-Nov-2014

The plot NLP, Prof. Howard, Tulane University 03-Nov-2014

What to do about the most frequent words
>>> from nltk.corpus import stopwords >>> stopwords.words('english') ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] NLP, Prof. Howard, Tulane University 03-Nov-2014

Usage >>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> from nltk.corpus import stopwords >>> wubFD = FreqDist(word.lower() for word in text if word.lower() not in stopwords.words('english')) >>> wubFD.plot(50) NLP, Prof. Howard, Tulane University 03-Nov-2014

Plot without stopwords

A corpus with genres or categories
The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: >>> from nltk.corpus import brown >>> brown.categories() ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction'] >>> brown.words(categories='news') ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014

Two simultaneous tallies
Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ...] These can be understood as (condition, sample). NLP, Prof. Howard, Tulane University 03-Nov-2014

ConditionalFreqDist # from nltk.corpus import brown
>>> from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'romance'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c)] >>> cfd=ConditionalFreqDist(catWord) NLP, Prof. Howard, Tulane University 03-Nov-2014

Conditional frequency distribution

Check results NLP, Prof. Howard, Tulane University 03-Nov-2014
>>> len(catWords) 170576 >>> catWords[:4] [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] >>> catWords[-4:] [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] >>> cfd <ConditionalFreqDist with 2 conditions> >>> cfd.conditions() ['news', 'romance'] >>> cfd['news'] <FreqDist with outcomes> >>> cfd['romance'] <FreqDist with outcomes> >>> cfd['romance']['could'] 193 >>> list(cfd['romance']) [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she', ...] NLP, Prof. Howard, Tulane University 03-Nov-2014

A more interesting example
can could may might must will news 93 86 66 38 50 389 religion 82 59 78 12 54 71 hobbies 268 58 131 22 83 264 sci fi 16 49 4 8 romance 74 193 11 51 45 43 humor 30 9 13 NLP, Prof. Howard, Tulane University 03-Nov-2014

Conditions = categories, sample = modal verbs
# from nltk.corpus import brown # from nltk.probability import ConditionalFreqDist >>> cat = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor'] >>> mod = ['can', 'could', 'may', 'might', 'must', 'will'] >>> catWord = [(c,w) for c in cat for w in brown.words(categories=c) if w in mod] >>> cfd = ConditionalFreqDist(catWord) >>> cfd.tabulate() >>> cfd.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

dfc.tabulate() can could may might must will news religion hobbies science_fiction romance humor NLP, Prof. Howard, Tulane University 03-Nov-2014

dfc.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

Another example The task is to find the frequency of 'america' and 'citizen' in NLTK's corpus of presedential inaugural addresses: >>> from nltk.corpus import inaugural >>> inaugural.fileids() ['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', ..., '2009-Obama.txt'] NLP, Prof. Howard, Tulane University 03-Nov-2014

The code # from nltk.corpus import inaugural
# from nltk.probability import ConditionalFreqDist >>> keys = ['america', 'citizen'] >>> keyYear = [(w, title[:4]) for title in inaugural.fileids() for w in inaugural.words(title) for w in keys if w.lower().startswith(keys)] >>> cfd2 = ConditionalFreqDist(keyYear) >>> cfd2.tabulate() >>> cfd2.plot() annos = [fileid[:4] for fileid in inaugural.fileids()] NLP, Prof. Howard, Tulane University 03-Nov-2014

cfd2.tabulate() cannot be read
america citizen NLP, Prof. Howard, Tulane University 03-Nov-2014

It's something like … 1789 1793 1797 1801 1805… america 2 1 8 citizen
citizen 5 9 7 10 NLP, Prof. Howard, Tulane University 03-Nov-2014

Same thing, changing axes
america citizen NLP, Prof. Howard, Tulane University 03-Nov-2014

dfc2.plot() NLP, Prof. Howard, Tulane University 03-Nov-2014

Next time Q8 Twitter maybe NLP, Prof. Howard, Tulane University
03-Nov-2014

LING 3820 & 6820 Natural Language Processing Harry Howard

Similar presentations

Presentation on theme: "LING 3820 & 6820 Natural Language Processing Harry Howard"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

LING 3820 & 6820 Natural Language Processing Harry Howard

Similar presentations

Presentation on theme: "LING 3820 & 6820 Natural Language Processing Harry Howard"— Presentation transcript:

Similar presentations

About project

Feedback