NLTK & Python Day 5 LING 681.02 Computational Linguistics Harry Howard Tulane University.

NLTK & Python Day 5 LING 681.02 Computational Linguistics Harry Howard Tulane University

31-Aug-2009LING 681.02, Prof. Howard, Tulane University2 Course organization  I have requested that Python and NLTK be installed on the computers in this room.

NLPP §1.2 A Closer Look at Python: Texts as Lists of Words

31-Aug-2009LING 681.02, Prof. Howard, Tulane University4 Variables  variable = expression >>> my_sent = ['Bravely', 'bold', 'Sir', 'Robin', ',', 'rode',... 'forth', 'from', 'Camelot', '.'] >>> noun_phrase = my_sent[1:4] >>> noun_phrase ['bold', 'Sir', 'Robin'] >>> wOrDs = sorted(noun_phrase) >>> wOrDs ['Robin', 'Sir', 'bold']

31-Aug-2009LING 681.02, Prof. Howard, Tulane University5 How to name variables  Valid names (or identifiers) …  must start with a letter, optionally followed by digits or letters;  are case-sensitive;  cannot contain whitespace (use an underscore) or a dash (means minus);  cannot be a reserved word.

31-Aug-2009LING 681.02, Prof. Howard, Tulane University6 Strings  Strings are individual words, i.e. a single element list.  Some methods for strings >>> name = 'Monty' >>> name[0] 'M' >>> name[:4] 'Mont' >>> name * 2 'MontyMonty' >>> name + '!' 'Monty!' >>> ' '.join(['Monty', 'Python']) 'Monty Python' >>> 'Monty Python'.split() ['Monty', 'Python']

NLPP §1.3. Computing with Language: Simple Statistics

31-Aug-2009LING 681.02, Prof. Howard, Tulane University8 Frequency distribution  What is a frequency distribution?  It tells us the frequency of each vocabulary item in a text.  It is a "distribution" because it tells us how the total number of word tokens in the text are distributed across the vocabulary items.  What function in NLTK calculates it?  FreqDist(text_name)  What expression lists the tokens with their distribution?  text_name.keys()

31-Aug-2009LING 681.02, Prof. Howard, Tulane University9 Very frequent words  How would you describe the 50 most frequent elements in Moby Dick? >>>fdist1.plot(50, cumulative=True)

31-Aug-2009LING 681.02, Prof. Howard, Tulane University10 Very infrequent words  Words that occur only once are called hapaxes.  >>>fdist1.hapaxes()  In Moby Dick, "lexicographer, cetological, contraband, expostulations", and about 9,000 others.  How would you describe them?

31-Aug-2009LING 681.02, Prof. Howard, Tulane University11 Summary Most frequentLeast frequent Lengthshortlong Meaningvery generalvery specific Coverage of textlarge proportionsmall proportion

31-Aug-2009LING 681.02, Prof. Howard, Tulane University12 Question  Which group would you look in to find words that help you understand what the text is about?  Neither.

31-Aug-2009LING 681.02, Prof. Howard, Tulane University13 Fine-grained word selection  Some Python expressions are based on set theory. a) {w | w ∈ V & P(w)} b) [w for w in V if p(w)], though this returns a list, not a set. (What's the difference?)  Real NLTK >>> V = set(text1) >>> long_words = [w for w in V if len(w) > 15]

31-Aug-2009LING 681.02, Prof. Howard, Tulane University14 Finding words that characterize a text  Not too short (>?) and not too infrequent (>?)  >>> informative_words = [w for w in V if len(w) > 7 and FreqDist(V) > 7]

31-Aug-2009LING 681.02, Prof. Howard, Tulane University15 Finding groups of words  What is the name for a sequence of two words?  Bigram ~ bigrams() >>> bigrams(['more', 'is', 'said', 'than', 'done']) [('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]  What is the name for a sequence of words that occur together unusually often?  Collocation ~ collocations()  They are essentially bigrams that occur more often than we would expect based on the frequency of individual words.

31-Aug-2009LING 681.02, Prof. Howard, Tulane University16 Example  >>> text4.collocations()  Building collocations list  United States; fellow citizens; years ago; Federal Government; General Government; American people; Vice President; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice; God bless; Indian tribes; public debt; foreign nations; political parties; State governments; National Government; United Nations; public money

31-Aug-2009LING 681.02, Prof. Howard, Tulane University17 Counting Other Things

Next time First quiz/project NLPP: finish §1 and do all exercises; do up to Ex 8 in §2

NLTK & Python Day 5 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Similar presentations

Presentation on theme: "NLTK & Python Day 5 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NLTK & Python Day 5 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Similar presentations

Presentation on theme: "NLTK & Python Day 5 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback