Presentation is loading. Please wait.

Presentation is loading. Please wait.

Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Similar presentations


Presentation on theme: "Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."— Presentation transcript:

1 Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

2 Course organization http://www.tulane.edu/~howard/NLP/
Schedule of assignments NLP, Prof. Howard, Tulane University 16-Sep-2016

3 Review NLP, Prof. Howard, Tulane University 16-Sep-2016

4 Project Gutenberg http://www.gutenberg.org/ebooks/28554
NLP, Prof. Howard, Tulane University 16-Sep-2016

5 Add this code to the function
# In "textProc.py" def gutenLoader(url, name): import requests try: download = requests.get(url).text except: print 'Download failed!' lineIndex = download.index('*** START OF THIS PROJECT GUTENBERG EBOOK') startIndex = download.index('\n',lineIndex) endIndex = download.index('*** END OF THIS PROJECT GUTENBERG EBOOK') text = download[startIndex:endIndex] with open(name,'w') as tempFile: text = text.encode('utf8') tempFile.write(text) return 'File was written.' NLP, Prof. Howard, Tulane University 16-Sep-2016

6 Call the function as before
(>>> import textProc) >>> reload(textProc) >>> from textProc import gutenLoader >>> print gutenLoader(url, name) NLP, Prof. Howard, Tulane University 16-Sep-2016

7 Practices Practice 1: try this with another PG text.
Practice 2: add comments NLP, Prof. Howard, Tulane University 16-Sep-2016

8 5. Flat text The code for this chapter can be downloaded as nlp5.py, presumably to your Downloads folder, and then moved to pyScipts, from whence you can open it in Spyder and run each line one by one. NLP, Prof. Howard, Tulane University 16-Sep-2016

9 Install textract & chardet
.html Mac Install homebrew: $ brew install poppler antiword unrtf tesseract All ($ pip install tesseract) $ pip install textract $ pip install chardet NLP, Prof. Howard, Tulane University 16-Sep-2016

10 EPUBs NLP, Prof. Howard, Tulane University 16-Sep-2016

11 Convert EPUBs >>> from requests import get
>>> url = ' s' >>> response = get(url) >>> type(response) <class 'requests.models.Response'> NLP, Prof. Howard, Tulane University 16-Sep-2016

12 More about the Response object
>>> response.headers {'content-length': '16922', 'x-varnish': ' ', 'x- powered-by': '3', 'set-cookie': 'session_id=c91e2c01ad330b816664af3600b141ed13f5be9 4; Domain=.gutenberg.org; expires=Thu, 15 Sep :05:50 GMT; Path=/', 'age': '0', 'server': 'Apache', 'x- connection': 'Close', 'via': '1.1 varnish', 'x-rate-limiter': 'ratelimiter2.php57', 'date': 'Thu, 15 Sep :35:50 GMT', 'x-frame-options': 'sameorigin', 'content-type': 'application/epub+zip'} >>> response.text[:150] >>> response.content[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016

13 Textract expects to read a file from disk
with open('Wub.epub','wb') as tempFile: tempFile.write(response.content) from textract import process rawText = process('Wub.epub') type(rawText) from chardet import detect detect(rawText) len(rawText) # 34361 rawText[:150] NLP, Prof. Howard, Tulane University 16-Sep-2016

14 Review from requests import get from textract import process
url = ' content/uploads/2011/11/Philip-K-Dick-The-Minority- Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016

15 PDFs NLP, Prof. Howard, Tulane University 16-Sep-2016

16 Download & convert a PDF
url = ' content/uploads/2011/11/Philip-K-Dick-The-Minority-Report.pdf' response = get(url) with open('MinorityReport.pdf', "wb") as tempFile: tempFile.write(response.content) rawText = process('MinorityReport.pdf') NLP, Prof. Howard, Tulane University 16-Sep-2016

17 Images NLP, Prof. Howard, Tulane University 16-Sep-2016

18 Download & write an image
from requests import get url = ' try: response = get(url) except: print 'Download failed!' with open('ocrTest.png', "wb") as tempFile: tempFile.write(response.content) NLP, Prof. Howard, Tulane University 16-Sep-2016

19 Try to OCR it >>> from textract import process
>>> rawText = process('ocrTest.png') # Switch to Terminal for tesseract $ cd /Users/harryhow/Documents/pyScripts $ tesseract ocrTest.png ocrText NLP, Prof. Howard, Tulane University 16-Sep-2016

20 Q1 stats Q1 MIN 6.0 AVG 9.3 MAX 10.0 NLP, Prof. Howard, Tulane University 16-Sep-2016

21 Next time Q2 Regex NLP, Prof. Howard, Tulane University 16-Sep-2016


Download ppt "Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University."

Similar presentations


Ads by Google