Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization http://www.tulane.edu/~howard/NLP/
Schedule of assignments NLP, Prof. Howard, Tulane University 14-Sep-2016

Review I am going to review everything, because I have expanded on what I said Monday. NLP, Prof. Howard, Tulane University 14-Sep-2016

5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 14-Sep-2016

How to navigate folders with os
# check your current working directory in Python >>> import os >>> os.getcwd() '/Users/harryhow/Documents/pyScripts' >>> os.listdir('.') # if the path is not to your pyScripts folder, then change it: >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts') # if you have no pyScripts folder >>> os.chdir('/Users/{your_user_name}/Documents/') >>> os.makedirs('pyScripts') >>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts') NLP, Prof. Howard, Tulane University 14-Sep-2016

Project Gutenberg http://www.gutenberg.org/ebooks/28554
NLP, Prof. Howard, Tulane University 14-Sep-2016

How to download a file with requests and convert it to a string with read()
>>> import requests >>> url = ' >>> download = requests.get(url).text # find out about it >>> type(download) >>> len(download) # 35739? >>> download[:150] ($> pip install chardet) >>> import chardet >>> chardet.detect(download) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to start a function for recurring file operations
# In "textProc.py" def gutenLoader(url, name): import requests download = requests.get(url).text NLP, Prof. Howard, Tulane University 14-Sep-2016

How to use try to catch errors
... download = requests.get(url).text ... except: ... print 'Download failed!' ... NLP, Prof. Howard, Tulane University 14-Sep-2016

Add the try block to gutenLoader():
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' NLP, Prof. Howard, Tulane University 14-Sep-2016

Warning Project Gutenberg keeps track of how frequently you access it and will ask you to prove that you are human with a captcha. You will know that this has happened if the text that you downloaded is actually a bunch of HTML, as illustrated in the appendix A snippet of Project Gutenberg’s captcha page. Since requests does download a sort of text, it does not throw an exception. NLP, Prof. Howard, Tulane University 14-Sep-2016

How to save a file to your hard drive
>>> tempF = open('Wub.txt','w') >>> tempF.write(download.encode('utf8')) >>> tempF.close() >>> tempF >>> import os >>> os.listdir('.') NLP, Prof. Howard, Tulane University 14-Sep-2016

How to read a file from your hard drive
>>> tempFile = open('Wub.txt','r') >>> doc = tempFile.read() >>> tempFile.close() # these can be combined: >>> doc = open('Wub.txt', 'r').read() NLP, Prof. Howard, Tulane University 14-Sep-2016

Find out about it >>> type(doc) >>> len(doc)
>>> import chardet >>> chardet.detect(doc) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to read from and write to a file
>>> doc1 = open('Wub.txt', 'r').read() >>> tempText = doc1.replace('Gutenberg', 'GUTENBERG') >>> tempText = tempText.encode('utf8') >>> tempFile = open('Wub2.txt','w') >>> tempFile.write(tempText) >>> tempFile.close() # examine result >>> doc3 = open('Wub2.txt', 'r').read() >>> doc3[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016

How to simplify file operations with the with statement
>>> with open('Wub.txt','r') as tempFile: ... text = tempFile.read() ... text = text.replace('Gutenberg', 'GUTENBERG') ... >>> with open('Wub3.txt','w') as tempFile: ... tempFile.write(text) # test >>> doc4 = open('Wub3.txt', 'r').read() >>> doc4[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016

Add the with block to gutenLoader():
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to refresh your script with reload()
(>>> import textProc) >>> reload(textProc) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to call your function
>>> url = ' txt' >>> name = 'Eyes.txt' >>> from textProc import gutenLoader >>> gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016

How to get your function to communicate with the outside world with return
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016

Call it by way of print >>> reload(textProc)
>>> from textProc import gutenLoader >>> print gutenLoader(url, name) # open it with open('Wub.txt','r') as tempFile: ... download = tempFile.read() ... NLP, Prof. Howard, Tulane University 14-Sep-2016

How to slice away what you don’t need
>>> download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 499 >>> lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') >>> startIndex = download.index('\n',lineIndex) >>> download[:startIndex] >>> download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> text = download[startIndex:endIndex] NLP, Prof. Howard, Tulane University 14-Sep-2016

Add this code to the function
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016

Call the function as before
(>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016

Next time We more or less did Practice 1 today. Do Practice 2.
Other sources of flat text. NLP, Prof. Howard, Tulane University 14-Sep-2016

Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations

Presentation on theme: "Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback