Presentation is loading. Please wait.

Presentation is loading. Please wait.

ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization 13-Oct-2014NLP, Prof. Howard, Tulane University 2  http://www.tulane.edu/~howard/LING3820/ http://www.tulane.edu/~howard/LING3820/  The syllabus is under construction.  http://www.tulane.edu/~howard/CompCultEN/ http://www.tulane.edu/~howard/CompCultEN/  Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

3 Basic text analysis The NLTK archive 13-Oct-2014 3 NLP, Prof. Howard, Tulane University

4 Open Spyder 13-Oct-2014 4 NLP, Prof. Howard, Tulane University

5 Now that you have gotten a taste of Python, let us turn to the main course, textual computing or the computational analysis of text. But we do not have a text to work with yet, so let’s go and find one. 7. Corpora of digital texts 13-Oct-2014 5 NLP, Prof. Howard, Tulane University

6 The first step is to figure out where to put the file. 7.1. How to get a text from an on-line archive 13-Oct-2014 6 NLP, Prof. Howard, Tulane University

7 7.1.1. How to navigate folders with os 1. >>> import os 2. >>> os.getcwd() 3. '/Applications/IDEs/Spyder.app/Contents/Resources' 4. # if the path is not to your pyScripts folder, then change it: 5. >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts/') 6. >>> os.getcwd() 7. '/Users/{your_user_name}/Documents/pyScripts/' 13-Oct-2014NLP, Prof. Howard, Tulane University 7

8 7.1.2. Project Gutenberg http://www.gutenberg.org/ebooks/28554 13-Oct-2014NLP, Prof. Howard, Tulane University 8

9 7.1.3. How to download a file with urllib and convert it to a string with read() 1. >>> from urllib import urlopen 2. >>> url = 'http://www.gutenberg.org/cache/epub/28554/pg28554. txt' 3. >>> download = urlopen(url) 4. >>> downloadString = download.read() 5. >>> type(downloadString) 6. >>> len(downloadString) # 35739? 7. >>> downloadString[:50] 13-Oct-2014NLP, Prof. Howard, Tulane University 9

10 7.1.4. How to save a file to your drive with open(), write(), and close()  # it is assumed that Python is looking at your pyScripts folder  >>> tempFile = open('Wub.txt','w')  >>> tempFile.write(downloadString.encode('utf8'))  >>> tempFile.close()  # import os if you haven't already done so  >>> os.listdir('.') 13-Oct-2014NLP, Prof. Howard, Tulane University 10

11 7.1.5. How to look at a file with open() and read() 1. >>> tempFile = open('Wub.txt','r') 2. >>> text = tempFile.read() 3. >>> type(text) 4. >>> len(text) 5. >>> text[:50] 13-Oct-2014NLP, Prof. Howard, Tulane University 11

12 7.1.6. How to slice away what you don’t need 1. >>> text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 2. 499 3. >>> lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 4. >>> startIndex = text.index('\n',lineIndex) 5. >>> text[:startIndex] 6. >>> text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') 7. >>> endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') 8. >>> story = text[startIndex:endIndex] 13-Oct-2014NLP, Prof. Howard, Tulane University 12

13 Now save it as “Wub.txt” 1. # it is assumed that Python is looking at your pyScripts folder 2. >>> tempFile = open('Wub.txt','w') 3. >>> tempFile.write(story.encode('utf8')) 4. >>> tempFile.close() 13-Oct-2014NLP, Prof. Howard, Tulane University 13

14 Homework  Turn the commands reviewed above into a function in a script that takes a url and the name of a text file as arguments and results in a Project Gutenberg file being saved to your pyScripts folder without the Project Gutenberg header & footer. 13-Oct-2014NLP, Prof. Howard, Tulane University 14

15 How to use PDF files Next time 13-Oct-2014NLP, Prof. Howard, Tulane University 15


Download ppt "ON-LINE DOCUMENTS DAY 20 - 10/13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google