TEXT STATISTICS 5 DAY 27 - 10/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
Of.
Advertisements

High Frequency Words List A Group 1
The.
Help me out.
Dolch Words.
The.
First Grade Sight Words
Text Corpora and Lexical Resources Chapter 2 of Natural Language Processing with Python.
TEXT STATISTICS 1 DAY /20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TEXT STATISTICS 7 DAY /05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Pronouns.
100 Most Common Words.
Reflexive Pronouns Grammar Test. Reflexive Pronouns Select the best reflexive pronoun to complete each sentence.
1st 100 sight words.
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Directions: Press F5 to begin the slide show. Press the enter key to view each part of the review.
List A Sight Words.
Sight Words - List A Words
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
A word that takes the place of a noun
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
The.
Grammar Fix Part 1. Pronouns What are they? Words that take the place of a noun How many can you think of? There are many, but they fall in to Five main.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Types of Pronouns Pages Personal Pronouns Refers to the one speaking, the one spoken to, or the one spoken about I, me, my, mine, we, us, our,
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Pronouns Kinds of Pronouns Subject Relative Object Interrogative Possessive Demonstrative Reflexive Intensive A pronoun is a word that is used in place.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
The Grammar Business © 2001 Glenrothes College The Grammar Business Part Two 5. Reflexive pronouns: when not to use them.
REGULAR EXPRESSIONS 2 DAY 7 - 9/10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & Python Day 6 LING Computational Linguistics Harry Howard Tulane University.
Pronoun Case Her smacked he.. Determining which form of a pronoun to use is a matter of determining how the pronoun is functioning in the sentence and.
REGULAR EXPRESSIONS 1 DAY 6 - 9/08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TEXT STATISTICS 3 DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
The Grammar Business © 2001 Glenrothes College The Grammar Business Reflexive pronouns: when not to use them.
Parts of Speech Part 1. NOUNS A noun is any word that names a person, place or thing.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Reflexive Pronouns Interactive Game BEGIN!. This morning, I dressed.  yourself  myself  herself  ourselves.
First Grade Sight Words. the of and a an to.
2 MINUTE CHALLENGE: What ’ s the word?. The Pronoun  A pronoun is used to substitute a noun (person or thing).  To decide if a word is a pronoun, you.
Pronouns. What is a pronoun?  A pronoun takes the place of a noun.  Pronouns can be used in the following ways: Subject Predicate noun or adjective.
Lecture 7 NLTK POS Tagging Topics Taggers Rule Based Taggers Probabilistic Taggers Transformation Based Taggers - Brill Supervised learning Readings: Chapter.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Pronouns. Subject Pronouns Take the place of a noun that is used as the subject of the sentence. They are found at the beginning of a phrase or clause.
Pronouns Definition: A word used in place of a noun or more than one noun. We use them to help make our speech less repetitive and awkward. ANTECEDENT:
LING 3820 & 6820 Natural Language Processing Harry Howard
Regular expressions 2 Day /23/16
Organizing and Displaying Data
LING 3820 & 6820 Natural Language Processing Harry Howard
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Control 3 Day /05/16 LING 3820 & 6820 Natural Language Processing
NLP 2 Day /07/16 LING 3820 & 6820 Natural Language Processing
Regular expressions 3 Day /26/16
Sight Words.
Computation with strings 4 Day 5 - 9/09/16
Reflexive Pronouns Interactive Game BEGIN!.
Two other people.
High Frequency Words Set #1.
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 29-Oct-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction.   Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

Open Spyder 29-Oct NLP, Prof. Howard, Tulane University

>>> from corpFunctions import textLoader >>> text = textLoader('Wub.txt') >>> from nltk.probability import FreqDist >>> wubFD = FreqDist(word.lower() for word in text) >>> wubFD.plot() Review of dictionaries & FreqDist 29-Oct NLP, Prof. Howard, Tulane University

29-Oct-2014NLP, Prof. Howard, Tulane University 5 wubFD.plot(50)

29-Oct-2014NLP, Prof. Howard, Tulane University 6 wubFD.plot(50, cumulative=True)

Want to graph frequency by rank, logarithmicly 29-Oct-2014NLP, Prof. Howard, Tulane University 7 Rank Frequency

Tuples, cf. freqdist.items(k,v) 1. >>> singleton = (1) 2. >>> double = (1,2) 3. >>> triple = (1,2,3) 4. >>> quadruple = (1,2,3,4) 5. >>> singleton 6. >>> singleton[0] 7. >>> double[0] 8. >>> double[1] 9. >>> double[2] 10. >>> triple[3] 11. >>> quadruple[4] 29-Oct-2014NLP, Prof. Howard, Tulane University 8

Range() 1. >>> range(5) 2. [0, 1, 2, 3, 4] 3. # how would you get [1, 2, 3, 4, 5]? 4. >>> range(1,6) # range(0+1,5+1) 5. [1, 2, 3, 4, 5] 29-Oct-2014NLP, Prof. Howard, Tulane University 9

How to print values on a logarithmic scale  The task is to extract the values/outcomes from the frequency distribution in order to graph them against their rank without any words. The values must be sorted from high to low in order to reflect their rank order. 1. >>> freq = [v for (k,v) in wubFD.items()] 2. >>> freq = sorted(freq,reverse=True) 3. >>> rank = range(1,len(freq)+1) 4. >>> import matplotlib.pyplot as plt 5. >>> plt.loglog(rank,freq) 6. >>> plt.title("Logarithmic rank-frequency plot of 'Beyond Lies the Wub'") 7. >>> plt.xlabel('Rank') 8. >>> plt.ylabel('Frequency') 29-Oct-2014NLP, Prof. Howard, Tulane University 10

The plot 29-Oct-2014NLP, Prof. Howard, Tulane University 11

Homework  Make a logarithmic rank-frequency plot of the words in the vampire novel. Is it straighter? 29-Oct-2014NLP, Prof. Howard, Tulane University 12

The problem of the most frequent words 29-Oct NLP, Prof. Howard, Tulane University

What to do about the most frequent words  >>> from nltk.corpus import stopwords  >>> stopwords.words('english')  ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', 'her', 'hers', 'herself', 'it', 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', 'should', 'now'] 29-Oct-2014NLP, Prof. Howard, Tulane University 14

Conditional frequency distribution 29-Oct NLP, Prof. Howard, Tulane University

19-mar-14SPAN Harry Howard - Tulane University 16 A corpus with genres or categories  The Brown corpus has 1,161,192 samples (words) divided into 15 genres or categories: 1. >>> from nltk.corpus import brown 2. >>> brown.categories() 3. ['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance','science_fiction'] 4. >>> brown.words(categories='news') 5. ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said',...]

19-mar-14SPAN Harry Howard - Tulane University 17 Two simultaneous tallies  Every token of the corpus is paired with a category label: [('news', 'The'), ('news', 'Fulton'), ('news', 'County'),...]  These can be understood as (condition, sample).

ConditionalFreqDist 1. >>> from nltk.probability import ConditionalFreqDist 2. >>> cat = ['news', 'romance'] 3. >>> catWord = [(c,w) 4. for c in cat 5. for w in brown.words(categories=c)] 6. >>> cfd=ConditionalFreqDist(catWord) 29-Oct-2014NLP, Prof. Howard, Tulane University 18

Check results 1. >>> len(catWords) >>> catWords[:4] 4. [('news', 'The'), ('news', 'Fulton'), ('news', 'County'), ('news', 'Grand')] 5. >>> catWords[-4:] 6. [('romance', 'afraid'), ('romance', 'not'), ('romance', "''"), ('romance', '.')] 7. >>> cfd >>> cfd.conditions() 10. ['news', 'romance'] 11. >>> cfd['news'] >>> cfd['romance'] >>> cfd['romance']['could'] >>> list(cfd['romance']) 18. [',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was', 'with', 'you', 'for', 'at', 'He', 'on', 'him','said', '!' 'I', 'in', 'he', 'had','?', 'her', 'that', 'it', 'his', 'she',...] 29-Oct-2014NLP, Prof. Howard, Tulane University 19

Q7 Conditional frequency Next time 29-Oct-2014NLP, Prof. Howard, Tulane University 20