Presentation is loading. Please wait.

Presentation is loading. Please wait.

NLTK & Python Day 7 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Similar presentations


Presentation on theme: "NLTK & Python Day 7 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

1 NLTK & Python Day 7 LING 681.02 Computational Linguistics Harry Howard Tulane University

2 09-Sept-2009LING 681.02, Prof. Howard, Tulane University2 Course organization  I have requested that NLTK be installed on the computers in this room.

3 NLPP §2 Accessing text corpora and lexical resources §2.1 Accessing text corpora

4 09-Sept-2009LING 681.02, Prof. Howard, Tulane University4 What's that word  What is a corpus/corpora?  "large bodies of linguistic data"

5 09-Sept-2009LING 681.02, Prof. Howard, Tulane University5 Some corpora in NLTK  The Project Gutenberg electronic text archive  25k free electronic books at http://www.gutenberg.org/http://www.gutenberg.org/  Web and chat text  The Brown corpus  First 1M word e-corpus, from 500 sources  The Reuters corpus  The Inaugural Address corpus  Annotated text corpora  Corpora in other languages

6 09-Sept-2009LING 681.02, Prof. Howard, Tulane University6 Using corpora in NLTK  Only the corpora in the nltk.book corpus are formatted as lists and so can be arguments to NLTK functions.  To convert another corpus into a list, use: your_text_name = nltk.Text(corpus_name)

7 09-Sept-2009LING 681.02, Prof. Howard, Tulane University7 Basic corpus functions Table 2.3 ExampleDescription fileids() the files of the corpus categories() the categories of the corpus fileids([categories]) the files of the corpus corresponding to these categories categories([fileids]) the categories of the corpus corresponding to these files raw() the raw content of the corpus raw(fileids=[f1,f2,f3]) the raw content of the specified files raw(categories=[c1,c2]) the raw content of the specified categories

8 09-Sept-2009LING 681.02, Prof. Howard, Tulane University8 Basic corpus functions Table 2.3 ExampleDescription words() the words of the whole corpus words(fileids=[f1,f2,f3]) the words of the specified fileids words(categories=[c1,c2]) the words of the specified categories sents() the sentences of the whole corpus sents(fileids=[f1,f2,f3]) the sentences of the specified fileids sents(categories=[c1,c2]) the sentences of the specified categories

9 09-Sept-2009LING 681.02, Prof. Howard, Tulane University9 Code to get started >>> from nltk.corpus import gutenberg >>> >>> emma = gutenberg.words('austen-emma.txt') >>> >>> emma = nltk.Text(emma) >>> >>> emma.collocations() Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss Fairfax; young man; great deal; John Knightley; Maple Grove; Miss Smith; Miss Taylor; Robert Martin; Colonel Campbell; Box Hill; Harriet Smith; William Larkins; Brunswick Square; young lady; young woman; Miss Hawkins

10 09-Sept-2009LING 681.02, Prof. Howard, Tulane University10 Loading your own corpus Table 2.3 ExampleDescription abspath(fileid) the location of the file on disk encoding(fileid) the encoding of the file (if known) open(fileid) open a stream for reading the given corpus file root() the path to the root of locally installed corpus readme() the contents of the README file of the corpus

11 NLPP §2 Accessing text corpora and lexical resources §2.2 Conditional frequency distributions

12 09-Sept-2009LING 681.02, Prof. Howard, Tulane University12 Back to frequency  FreqDist(mylist) calculates the number of occurrences of each item in 'mylist'.  ConditionalFreqDist(mypairs) calculates the number of occurrences of each pair of items in 'mypairs',  where the pairing might be of author & word, genre & word, topic & word, etc.: condition & text

13 09-Sept-2009LING 681.02, Prof. Howard, Tulane University13 An example >>> from nltk.corpus import brown >>> cfd = nltk.ConditionalFreqDist(... (genre, word)... for genre in brown.categories()... for word in brown.words(categories=genre))

14 Next time NLPP: §2.3ff Do "Your Turn" up to p. 55 Exercises 2.8.2-4, 2.8.8


Download ppt "NLTK & Python Day 7 LING 681.02 Computational Linguistics Harry Howard Tulane University."

Similar presentations


Ads by Google