Download presentation
Presentation is loading. Please wait.
Published byJemima Daniels Modified over 9 years ago
1
ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University
2
Course organization 17-Oct-2014NLP, Prof. Howard, Tulane University 2 http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/ The syllabus is under construction. http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/ Chapter numbering 3.7. How to deal with non-English characters 3.7. How to deal with non-English characters 4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters 6. Control 6. Control
3
Open Spyder 17-Oct-2014 3 NLP, Prof. Howard, Tulane University
4
How to download a file from Project Gutenberg Review 17-Oct-2014 4 NLP, Prof. Howard, Tulane University
5
The global working directory 2.2.1. How to set the global working directory in Spyder I am getting tired of constantly double-checking that Python saves my stuff to pyScripts, but fortunately Spyder can do it for us. Open Spyder: On a Mac, click on the python menu (top left). In Windows, click on the Tools menu. Open Preferences > Global working directory > Startup … At startup, the global working directory is: > the following directory: /Users/harryhow/Documents/pyScripts Set the next two selections to "the global working directory". Leave the last untouched & unchecked. 17-Oct-2014NLP, Prof. Howard, Tulane University 5
6
gutenLoader def gutenLoader(url, name): 1. from urllib import urlopen 2. download = urlopen(url) 3. text = download.read() 4. print 'text length = '+str(len(text)) 5. lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 6. startIndex = text.index('\n',lineIndex) 7. endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') 8. story = text[startIndex:endIndex] 9. print 'story length = '+str(len(story)) 10. tempFile = open(name,'w') 11. tempFile.write(story.encode('utf8')) 12. tempFile.close() 13. print 'File saved' 14. return 17-Oct-2014NLP, Prof. Howard, Tulane University 6
7
Usage 1. # make sure that the current working directory is pyScripts 2. >>> from corpFunctions import gutenLoader 3. >>> url = 'http://www.gutenberg.org/cache/epub/32154/pg32154.txt' 4. >>> name = 'VariableMan.txt' 5. >>> gutenLoader(url, name) 17-Oct-2014NLP, Prof. Howard, Tulane University 7
8
Homework Since the documents in Project Gutenberg are numbered consecutively, you can download a series of them all at once with a for loop. Write the code for doing so. HINT: remember that strings and intergers are different types. To turn a string into an integer, use the built-in str() method, e.g. str(1). 17-Oct-2014NLP, Prof. Howard, Tulane University 8
9
Answer 1. # make sure that the current working directory is pyScripts 2. >>> from corpFunctions import gutenLoader 3. >>> base = 28554 4. >>> for n in [1,2,3]: 5.... num = str(base+n) 6.... url = 'http://www.gutenberg.org/cache/epub/'+num+'/pg'+num+'.txt' 7.... name = 'story'+str(n)+'.txt' 8.... gutenLoader(url, name) 9.... 17-Oct-2014NLP, Prof. Howard, Tulane University 9
10
Get pdfminer You only do this once ~ 2.7.3.2. How to install a package by hand Point your web browser at https://pypi.python.org/pypi/pdfminer/. https://pypi.python.org/pypi/pdfminer/ Click on the green button to download the compressed folder. It download to your Downloads folder. Double click it to decompress it if it doesn't decompress automatically. If you have no decompression utility, you need to get one. Open the Terminal/Command Prompt (Windows Start > Search > cmd > Command Prompt) $> cd {drop file here} $> python setup.py install $> pdf2txt.py samples/simple1.pdf 17-Oct-2014NLP, Prof. Howard, Tulane University 10
11
Get a pdf file To get an idea of the source: http://www.federalreserve.gov/monetarypolicy/fomchistorical2008.htm http://www.federalreserve.gov/monetarypolicy/fomchistorical2008.htm Download the file to pyScripts: 1. # make sure that the current working directory is pyScripts 2. from urllib import urlopen 3. url = 'http://www.federalreserve.gov/monetarypolicy/files/FOMC2008 0130meeting.pdf' 4. download = urlopen(url) 5. doc = download.read() 6. tempFile = open('FOMC20080130.pdf','w') 7. tempFile.write(doc) 8. tempFile.close() 17-Oct-2014NLP, Prof. Howard, Tulane University 11
12
Run pdf2text 1. # make sure that Python is looking at pyScripts 2. >>> from corpFunctions import pdf2text 3. >>> text = pdf2text('FOMC20080130.pdf') 4. >>> len(text) 5. >>> text[:50] 17-Oct-2014NLP, Prof. Howard, Tulane University 12
13
7.3.2. How to pre-process a text with the PlaintextCorpusReader 17-Oct-2014 13 NLP, Prof. Howard, Tulane University
14
NLTK One of the reasons for using NLTK is that it relieves us of much of the effort of making a raw text amenable to computational analysis. It does so by including a module of corpus readers, which pre- process files for certain tasks or formats. Most of them are specialized for particular corpora, so we will start with the basic one, called the PlaintextCorpusReader. 17-Oct-2014NLP, Prof. Howard, Tulane University 14
15
PlaintextCorpusReader The PlaintextCorpusReader needs to know two things: where your file is and what its name is. If the current working directory is where the file is, the location argument can be left ‘blank’ by using the null string ''. We only have one file, ‘Wub.txt’. It will also prevent problems down the line to give the method an optional third argument that relays its encoding, encoding='utf- 8'. Now let NLTK tokenize the text into words and punctuation. 17-Oct-2014NLP, Prof. Howard, Tulane University 15
16
Usage 1. # make sure that the current working directory is pyScripts 2. >>> from nltk.corpus import PlaintextCorpusReader 3. >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8') 4. >>> wubWords = wubReader.words() 17-Oct-2014NLP, Prof. Howard, Tulane University 16
17
Initial look at the text 1. >>> len(wubWords) 2. >>> wubWords[:50] 3. >>> set(wubWords) 4. >>> len(set(wubWords)) 5. >>> wubWords.count('wub') 6. >>> len(wubWords) / len(set(wubWords)) 7. >>> from __future__ import division 8. >>> 100 * wubWords.count('a') / len(wubWords) 17-Oct-2014NLP, Prof. Howard, Tulane University 17
18
Basic text analysis with NLTK 1. >>> from nltk.text import Text 2. >>> t = Text(wubWords) 3. >>> t.concordance('wub') 4. >>> t.similar('wub') 5. >>> t.common_contexts(['wub','captain']) 6. >>> t.dispersion_plot(['wub']) 7. >>> t.generate() 17-Oct-2014NLP, Prof. Howard, Tulane University 18
19
Q6 take home Intro to text stats Next time 17-Oct-2014NLP, Prof. Howard, Tulane University 19
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.