Presentation on theme: "NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:
NLTK & Python Day 4 LING 681.02 Computational Linguistics Harry Howard Tulane University
31-Aug-2009LING 681.02, Prof. Howard, Tulane University2 Course organization I have requested that Python and NLTK be installed on the computers in this room.
NLPP §1 Language processing & Python §1.1 Computing with language
31-Aug-2009LING 681.02, Prof. Howard, Tulane University4 Loading the book's texts >>> from nltk.book import * *** Introductory Examples for the NLTK Book *** Loading text1,..., text9 and sent1,..., sent9 Type the name of the text or sentence to view it. Type: 'texts()' or 'sents()' to list the materials. text1: Moby Dick by Herman Melville 1851 text2: Sense and Sensibility by Jane Austen 1811 text3: The Book of Genesis text4: Inaugural Address Corpus text5: Chat Corpus text6: Monty Python and the Holy Grail text7: Wall Street Journal text8: Personals Corpus text9: The Man Who Was Thursday by G. K. Chesterton 1908 >>>
31-Aug-2009LING 681.02, Prof. Howard, Tulane University5 Searching text Show every token of a word in context, called concordance view. text1.concordance("monstrous") Show the words that appear in a similar range of contexts. text1.similar("monstrous") Show the contexts that two words share. text1.common_contexts("monstrous")
31-Aug-2009LING 681.02, Prof. Howard, Tulane University6 Searching text, cont. Plot how far each token of a word is from the beginning of a text. text1.dispersion_plot(["monstrous"]) Needs NumPy & Matplotlib, though it didn't work for me. Generate random text. text1.generate()
31-Aug-2009LING 681.02, Prof. Howard, Tulane University7 Counting vocabulary Count the word and punctuation tokens in a text: len(text1) List the distinct words, i.e. the word types, in a text: set(text1) Count how many types there are in a text: len(set(text1)) Count the tokens of a word type: text1.count("smote")
31-Aug-2009LING 681.02, Prof. Howard, Tulane University8 Lexical richness or diversity The lexical richness or diversity of a text can be estimated as tokens per type: len(text1) / len(set(text1) The frequency of a type can be estimated as tokens per all tokens: 100 * text1.count('a') / len(text1) This is integer division, however. p. 8 "_future_" is some kind of error
31-Aug-2009LING 681.02, Prof. Howard, Tulane University9 Making your own function in Python To save you from typing the same thing over and over, you can define your own function: >>> def lexical_diversity(text):...return len(text1) / len(set(text1) You call this function just by typing it and filling in the argument, a text name, in the parenthesis: >>> lexical_diversity(text1)
31-Aug-2009LING 681.02, Prof. Howard, Tulane University10 Other functions Sort the word types in a text alphabetically: sorted(set(text1))
31-Aug-2009LING 681.02, Prof. Howard, Tulane University11 Exercises 1.8.… 4. … How many words are there in text2? How many distinct words are there? 5. Compare the lexical diversity scores for humor and romance fiction in Table 1.1. Which genre is more lexically diverse?Table 1.1 8. Consider the following Python expression: len(set(text4)). State the purpose of this expression. Describe the two steps involved in performing this computation.
NLPP §1.2 A Closer Look at Python: Texts as Lists of Words
31-Aug-2009LING 681.02, Prof. Howard, Tulane University13 The representation of a text We will think of a text as nothing more than a sequence of words and punctuation. The opening sentence of Moby Dick: >>> sent1 = ['Call', 'me', 'Ishmael', '.'] The bracketed material is known as a list in Python. We can inspect it by typing the name. How would you find out how many words it has?
31-Aug-2009LING 681.02, Prof. Howard, Tulane University14 List construction Append one list to the end of another with '+', known as concatenation: >>> ['Monty', 'Python'] + ['and', 'the', 'Holy', 'Grail'] ['Monty', 'Python', 'and', 'the', 'Holy', 'Grail'] >>> sent4 + sent1 ['Fellow', '-', 'Citizens', 'of', 'the', 'Senate', 'and', 'of', 'the','House', 'of', 'Representatives', ':', 'Call', 'me', 'Ishmael', '.'] Append a single item to a list >>> sent1.append("Some") sent1 ['Call', 'me', 'Ishmael', '.', 'Some']
31-Aug-2009LING 681.02, Prof. Howard, Tulane University15 List indexing Each element in a list is numbered in sequence, a number known as the element's index. Show the item that occurs at an index such as 173 in a text: >>> text4 'awaken' Show the index of an element's first occurrence: >>>text4.index('awaken') 173 Show the elements between two indices (slicing): >>> text5[16715:16735] >>> text5[16715:] >>> text5[:16735] Assign an element to an index: >>> text = 'First'
31-Aug-2009LING 681.02, Prof. Howard, Tulane University16 Python counts from 0 Create a list: >>> sent = ['word1', 'word2', 'word3', 'word4', 'word5',... 'word6', 'word7', 'word8', 'word9', 'word10'] Find the first word: >>> sent 'word1' Find the last word: >>> sent 'word10' What does sent do? It produces a runtime error.
31-Aug-2009LING 681.02, Prof. Howard, Tulane University17 List exercises
Next time NLPP: finish §1 and do all exercises; do up to Ex 8 in §2