Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing

Similar presentations


Presentation on theme: "Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing"— Presentation transcript:

1 Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing
Harry Howard Tulane University

2 Course organization http://www.tulane.edu/~howard/NLP/
Schedule of assignments Is there anyone here that wasn't here last week? NLP, Prof. Howard, Tulane University 12-Sep-2016

3 Review The quiz was the review. NLP, Prof. Howard, Tulane University
12-Sep-2016

4 5. Flat text Now that you have gotten a taste of Python, let us turn to the main course, textual computing or the computational analysis of text. But we do not have a text to work with yet, so let’s go and find one. NLP, Prof. Howard, Tulane University 12-Sep-2016

5 7.1. How to get a text from an on-line archive
The first step is to figure out where to put the file. NLP, Prof. Howard, Tulane University 12-Sep-2016

6 How to navigate folders with os
# check your current working directory in Python >>> import os >>> os.getcwd() '/Users/harryhow/Documents/pyScripts' >>> os.listdir('.') # if the path is not to your pyScripts folder, then change it: >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts') # if you have no pyScripts folder >>> os.chdir('/Users/{your_user_name}/Documents/') >>> os.makedirs('pyScripts') >>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts') NLP, Prof. Howard, Tulane University 12-Sep-2016

7 Project Gutenberg http://www.gutenberg.org/ebooks/28554
NLP, Prof. Howard, Tulane University 12-Sep-2016

8 How to download a file with requests and convert it to a string with read()
>>> import requests >>> url = ' txt' >>> download = requests.get(url).text # find out about it >>> type(download) >>> len(download) # 35739? >>> download[:150] NLP, Prof. Howard, Tulane University 12-Sep-2016

9 How to save a file to your hard drive
# it is assumed that Python is looking at your pyScripts folder >>> tempF = open('Wub.txt','w') >>> tempF.write(download.encode('utf8')) >>> tempF.close() >>> tempF NLP, Prof. Howard, Tulane University 12-Sep-2016

10 How to read a file from your hard drive
>>> tempF = open('Wub.txt','r') >>> doc = tempF.read() >>> tempF.close() # these can be combined: >>> doc = open('Wub.txt', 'r').read() NLP, Prof. Howard, Tulane University 12-Sep-2016

11 Find out about it >>> type(doc) >>> len(doc)
>>> import chardet >>> chardet.detect(doc) NLP, Prof. Howard, Tulane University 12-Sep-2016

12 How to slice away what you don’t need
>>> text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 499 >>> lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') >>> startIndex = text.index('\n',lineIndex) >>> text[:startIndex] >>> text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> story = text[startIndex:endIndex] NLP, Prof. Howard, Tulane University 12-Sep-2016

13 Now save it as “Wub.txt” # it is assumed that Python is looking at your pyScripts folder >>> tempFile = open('Wub.txt','w') >>> tempFile.write(story.encode('utf8')) >>> tempFile.close() NLP, Prof. Howard, Tulane University 12-Sep-2016

14 Homework Get another text from Project Gutenberg onto your computer.
(NOT YET) Turn the commands reviewed above into a function in a script that takes a url and the name of a text file as arguments and results in a Project Gutenberg file being saved to your pyScripts folder without the Project Gutenberg header & footer. NLP, Prof. Howard, Tulane University 12-Sep-2016

15 Next time Other sources of flat text
NLP, Prof. Howard, Tulane University 12-Sep-2016


Download ppt "Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing"

Similar presentations


Ads by Google