# NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University.

## Presentation on theme: "NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University

31-Aug-2009LING 681.02, Prof. Howard, Tulane University2 Course organization  I have requested that Python and NLTK be installed on the computers in this room.

NLPP §1 Language processing & Python §1.1 Computing with language

31-Aug-2009LING 681.02, Prof. Howard, Tulane University4 Loading the book's texts >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1,..., text9 and sent1,..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G. K. Chesterton 1908 >>>

31-Aug-2009LING 681.02, Prof. Howard, Tulane University5 Searching text  Show every token of a word in context, called concordance view.  text1.concordance("monstrous")  Show the words that appear in a similar range of contexts.  text1.similar("monstrous")  Show the contexts that two words share.  text1.common_contexts("monstrous")

31-Aug-2009LING 681.02, Prof. Howard, Tulane University6 Searching text, cont.  Plot how far each token of a word is from the beginning of a text.  text1.dispersion_plot(["monstrous"])  Needs NumPy & Matplotlib, though it didn't work for me.  Generate random text.  text1.generate()

31-Aug-2009LING 681.02, Prof. Howard, Tulane University7 Counting vocabulary  Count the word and punctuation tokens in a text:  len(text1)  List the distinct words, i.e. the word types, in a text:  set(text1)  Count how many types there are in a text:  len(set(text1))  Count the tokens of a word type:  text1.count("smote")

31-Aug-2009LING 681.02, Prof. Howard, Tulane University8 Lexical richness or diversity  The lexical richness or diversity of a text can be estimated as tokens per type:  len(text1) / len(set(text1)  The frequency of a type can be estimated as tokens per all tokens:  100 * text1.count('a') / len(text1)  This is integer division, however.  p. 8 "_future_" is some kind of error

31-Aug-2009LING 681.02, Prof. Howard, Tulane University9 Making your own function in Python  To save you from typing the same thing over and over, you can define your own function: >>> def lexical_diversity(text):...return len(text1) / len(set(text1)  You call this function just by typing it and filling in the argument, a text name, in the parenthesis: >>> lexical_diversity(text1)

31-Aug-2009LING 681.02, Prof. Howard, Tulane University10 Other functions  Sort the word types in a text alphabetically:  sorted(set(text1))

31-Aug-2009LING 681.02, Prof. Howard, Tulane University11 Exercises 1.8.…  4. … How many words are there in text2? How many distinct words are there?  5. Compare the lexical diversity scores for humor and romance fiction in Table 1.1. Which genre is more lexically diverse?Table 1.1  8. Consider the following Python expression: len(set(text4)). State the purpose of this expression. Describe the two steps involved in performing this computation.

NLPP §1.2 A Closer Look at Python: Texts as Lists of Words

31-Aug-2009LING 681.02, Prof. Howard, Tulane University13 The representation of a text  We will think of a text as nothing more than a sequence of words and punctuation.  The opening sentence of Moby Dick: >>> sent1 = ['Call', 'me', 'Ishmael', '.']  The bracketed material is known as a list in Python.  We can inspect it by typing the name.  How would you find out how many words it has?

31-Aug-2009LING 681.02, Prof. Howard, Tulane University14 List construction  Append one list to the end of another with '+', known as concatenation: >>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'] ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail'] >>> sent4 + sent1 ['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the','House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.']  Append a single item to a list  >>> sent1.append("Some")  sent1 ['Call', 'me', 'Ishmael', '.', 'Some']

31-Aug-2009LING 681.02, Prof. Howard, Tulane University15 List indexing  Each element in a list is numbered in sequence, a number known as the element's index.  Show the item that occurs at an index such as 173 in a text: >>> text4[173] 'awaken'  Show the index of an element's first occurrence: >>>text4.index('awaken') 173  Show the elements between two indices (slicing): >>> text5[16715:16735] >>> text5[16715:] >>> text5[:16735]  Assign an element to an index: >>> text[0] = 'First'

31-Aug-2009LING 681.02, Prof. Howard, Tulane University16 Python counts from 0  Create a list: >>> sent = ['word1', 'word2', 'word3', 'word4', 'word5',... 'word6', 'word7', 'word8', 'word9', 'word10']  Find the first word: >>> sent[0] 'word1' Find the last word: >>> sent[9] 'word10'  What does sent[10] do?  It produces a runtime error.

31-Aug-2009LING 681.02, Prof. Howard, Tulane University17 List exercises

Next time NLPP: finish §1 and do all exercises; do up to Ex 8 in §2

Download ppt "NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University."

Similar presentations