Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization http://www.tulane.edu/~howard/NLP/
Schedule of assignments NLP, Prof. Howard, Tulane University 14-Sep-2016

3 Review I am going to review everything, because I have expanded on what I said Monday. NLP, Prof. Howard, Tulane University 14-Sep-2016

4 5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 14-Sep-2016

5 How to navigate folders with os
# check your current working directory in Python >>> import os >>> os.getcwd() '/Users/harryhow/Documents/pyScripts' >>> os.listdir('.') # if the path is not to your pyScripts folder, then change it: >>> os.chdir('/Users/{your_user_name}/Documents/pyScripts') # if you have no pyScripts folder >>> os.chdir('/Users/{your_user_name}/Documents/') >>> os.makedirs('pyScripts') >>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts') NLP, Prof. Howard, Tulane University 14-Sep-2016

6 Project Gutenberg http://www.gutenberg.org/ebooks/28554
NLP, Prof. Howard, Tulane University 14-Sep-2016

7 How to download a file with requests and convert it to a string with read()
>>> import requests >>> url = ' >>> download = requests.get(url).text # find out about it >>> type(download) >>> len(download) # 35739? >>> download[:150] ($> pip install chardet) >>> import chardet >>> chardet.detect(download) NLP, Prof. Howard, Tulane University 14-Sep-2016

8 How to start a function for recurring file operations
# In "textProc.py" def gutenLoader(url, name): import requests download = requests.get(url).text NLP, Prof. Howard, Tulane University 14-Sep-2016

9 How to use try to catch errors
... download = requests.get(url).text ... except: ... print 'Download failed!' ... NLP, Prof. Howard, Tulane University 14-Sep-2016

10 Add the try block to gutenLoader():
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' NLP, Prof. Howard, Tulane University 14-Sep-2016

11 Warning Project Gutenberg keeps track of how frequently you access it and will ask you to prove that you are human with a captcha. You will know that this has happened if the text that you downloaded is actually a bunch of HTML, as illustrated in the appendix A snippet of Project Gutenberg’s captcha page. Since requests does download a sort of text, it does not throw an exception. NLP, Prof. Howard, Tulane University 14-Sep-2016

12 How to save a file to your hard drive
>>> tempF = open('Wub.txt','w') >>> tempF.write(download.encode('utf8')) >>> tempF.close() >>> tempF >>> import os >>> os.listdir('.') NLP, Prof. Howard, Tulane University 14-Sep-2016

13 How to read a file from your hard drive
>>> tempFile = open('Wub.txt','r') >>> doc = tempFile.read() >>> tempFile.close() # these can be combined: >>> doc = open('Wub.txt', 'r').read() NLP, Prof. Howard, Tulane University 14-Sep-2016

14 Find out about it >>> type(doc) >>> len(doc)
>>> import chardet >>> chardet.detect(doc) NLP, Prof. Howard, Tulane University 14-Sep-2016

15 How to read from and write to a file
>>> doc1 = open('Wub.txt', 'r').read() >>> tempText = doc1.replace('Gutenberg', 'GUTENBERG') >>> tempText = tempText.encode('utf8') >>> tempFile = open('Wub2.txt','w') >>> tempFile.write(tempText) >>> tempFile.close() # examine result >>> doc3 = open('Wub2.txt', 'r').read() >>> doc3[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016

16 How to simplify file operations with the with statement
>>> with open('Wub.txt','r') as tempFile: ... text = tempFile.read() ... text = text.replace('Gutenberg', 'GUTENBERG') ... >>> with open('Wub3.txt','w') as tempFile: ... tempFile.write(text) # test >>> doc4 = open('Wub3.txt', 'r').read() >>> doc4[:150] NLP, Prof. Howard, Tulane University 14-Sep-2016

17 Add the with block to gutenLoader():
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) NLP, Prof. Howard, Tulane University 14-Sep-2016

18 How to refresh your script with reload()
(>>> import textProc) >>> reload(textProc) NLP, Prof. Howard, Tulane University 14-Sep-2016

19 How to call your function
>>> url = ' txt' >>> name = 'Eyes.txt' >>> from textProc import gutenLoader >>> gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016

20 How to get your function to communicate with the outside world with return
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' with open(name,'w') as tempFile: download = download.encode('utf8') tempFile.write(download) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016

21 Call it by way of print >>> reload(textProc)
>>> from textProc import gutenLoader >>> print gutenLoader(url, name) # open it with open('Wub.txt','r') as tempFile: ... download = tempFile.read() ... NLP, Prof. Howard, Tulane University 14-Sep-2016

22 How to slice away what you don’t need
>>> download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 499 >>> lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') >>> startIndex = download.index('\n',lineIndex) >>> download[:startIndex] >>> download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') >>> text = download[startIndex:endIndex] NLP, Prof. Howard, Tulane University 14-Sep-2016

23 Add this code to the function
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 14-Sep-2016

24 Call the function as before
(>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 14-Sep-2016

25 Next time We more or less did Practice 1 today. Do Practice 2.
Other sources of flat text. NLP, Prof. Howard, Tulane University 14-Sep-2016


Download ppt "Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google