Presentation is loading. Please wait.

Presentation is loading. Please wait.

ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 17-Oct-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

3 Open Spyder 17-Oct-2014 3 NLP, Prof. Howard, Tulane University

4 How to download a file from Project Gutenberg Review 17-Oct-2014 4 NLP, Prof. Howard, Tulane University

5 The global working directory 2.2.1. How to set the global working directory in Spyder  I am getting tired of constantly double-checking that Python saves my stuff to pyScripts, but fortunately Spyder can do it for us.  Open Spyder:  On a Mac, click on the python menu (top left).  In Windows, click on the Tools menu.  Open Preferences > Global working directory > Startup … At startup, the global working directory is: > the following directory: /Users/harryhow/Documents/pyScripts  Set the next two selections to "the global working directory".  Leave the last untouched & unchecked. 17-Oct-2014NLP, Prof. Howard, Tulane University 5

6 gutenLoader def gutenLoader(url, name): 1. from urllib import urlopen 2. download = urlopen(url) 3. text = download.read() 4. print 'text length = '+str(len(text)) 5. lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 6. startIndex = text.index('\n',lineIndex) 7. endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') 8. story = text[startIndex:endIndex] 9. print 'story length = '+str(len(story)) 10. tempFile = open(name,'w') 11. tempFile.write(story.encode('utf8')) 12. tempFile.close() 13. print 'File saved' 14. return 17-Oct-2014NLP, Prof. Howard, Tulane University 6

7 Usage 1. # make sure that the current working directory is pyScripts 2. >>> from corpFunctions import gutenLoader 3. >>> url = 'http://www.gutenberg.org/cache/epub/32154/pg32154.txt' 4. >>> name = 'VariableMan.txt' 5. >>> gutenLoader(url, name) 17-Oct-2014NLP, Prof. Howard, Tulane University 7

8 Homework  Since the documents in Project Gutenberg are numbered consecutively, you can download a series of them all at once with a for loop. Write the code for doing so.  HINT: remember that strings and intergers are different types. To turn a string into an integer, use the built-in str() method, e.g. str(1). 17-Oct-2014NLP, Prof. Howard, Tulane University 8

9 Answer 1. # make sure that the current working directory is pyScripts 2. >>> from corpFunctions import gutenLoader 3. >>> base = 28554 4. >>> for n in [1,2,3]: 5.... num = str(base+n) 6.... url = 'http://www.gutenberg.org/cache/epub/'+num+'/pg'+num+'.txt' 7.... name = 'story'+str(n)+'.txt' 8.... gutenLoader(url, name) 9.... 17-Oct-2014NLP, Prof. Howard, Tulane University 9

10 Get pdfminer You only do this once ~ 2.7.3.2. How to install a package by hand  Point your web browser at https://pypi.python.org/pypi/pdfminer/. https://pypi.python.org/pypi/pdfminer/  Click on the green button to download the compressed folder.  It download to your Downloads folder. Double click it to decompress it if it doesn't decompress automatically. If you have no decompression utility, you need to get one.  Open the Terminal/Command Prompt (Windows Start > Search > cmd > Command Prompt)  $> cd {drop file here}  $> python setup.py install  $> pdf2txt.py samples/simple1.pdf 17-Oct-2014NLP, Prof. Howard, Tulane University 10

11 Get a pdf file  To get an idea of the source: http://www.federalreserve.gov/monetarypolicy/fomchistorical2008.htm http://www.federalreserve.gov/monetarypolicy/fomchistorical2008.htm  Download the file to pyScripts: 1. # make sure that the current working directory is pyScripts 2. from urllib import urlopen 3. url = 'http://www.federalreserve.gov/monetarypolicy/files/FOMC2008 0130meeting.pdf' 4. download = urlopen(url) 5. doc = download.read() 6. tempFile = open('FOMC20080130.pdf','w') 7. tempFile.write(doc) 8. tempFile.close() 17-Oct-2014NLP, Prof. Howard, Tulane University 11

12 Run pdf2text 1. # make sure that Python is looking at pyScripts 2. >>> from corpFunctions import pdf2text 3. >>> text = pdf2text('FOMC20080130.pdf') 4. >>> len(text) 5. >>> text[:50] 17-Oct-2014NLP, Prof. Howard, Tulane University 12

13 7.3.2. How to pre-process a text with the PlaintextCorpusReader 17-Oct-2014 13 NLP, Prof. Howard, Tulane University

14 NLTK  One of the reasons for using NLTK is that it relieves us of much of the effort of making a raw text amenable to computational analysis. It does so by including a module of corpus readers, which pre- process files for certain tasks or formats.  Most of them are specialized for particular corpora, so we will start with the basic one, called the PlaintextCorpusReader. 17-Oct-2014NLP, Prof. Howard, Tulane University 14

15 PlaintextCorpusReader  The PlaintextCorpusReader needs to know two things: where your file is and what its name is.  If the current working directory is where the file is, the location argument can be left ‘blank’ by using the null string ''.  We only have one file, ‘Wub.txt’. It will also prevent problems down the line to give the method an optional third argument that relays its encoding, encoding='utf- 8'.  Now let NLTK tokenize the text into words and punctuation. 17-Oct-2014NLP, Prof. Howard, Tulane University 15

16 Usage 1. # make sure that the current working directory is pyScripts 2. >>> from nltk.corpus import PlaintextCorpusReader 3. >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8') 4. >>> wubWords = wubReader.words() 17-Oct-2014NLP, Prof. Howard, Tulane University 16

17 Initial look at the text 1. >>> len(wubWords) 2. >>> wubWords[:50] 3. >>> set(wubWords) 4. >>> len(set(wubWords)) 5. >>> wubWords.count('wub') 6. >>> len(wubWords) / len(set(wubWords)) 7. >>> from __future__ import division 8. >>> 100 * wubWords.count('a') / len(wubWords) 17-Oct-2014NLP, Prof. Howard, Tulane University 17

18 Basic text analysis with NLTK 1. >>> from nltk.text import Text 2. >>> t = Text(wubWords) 3. >>> t.concordance('wub') 4. >>> t.similar('wub') 5. >>> t.common_contexts(['wub','captain']) 6. >>> t.dispersion_plot(['wub']) 7. >>> t.generate() 17-Oct-2014NLP, Prof. Howard, Tulane University 18

19 Q6 take home Intro to text stats Next time 17-Oct-2014NLP, Prof. Howard, Tulane University 19


Download ppt "ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google