ON-LINE DOCUMENTS 3 DAY 22 - 10/17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.

Slides:



Advertisements
Similar presentations
EBooks and Audiobooks. This class will give you an overview of eBooks and electronic Audiobooks available from the Library. We will also explain the basic.
Advertisements

1) Terms to Know 2) Starting an Office 97 Application 8) Finding a missing file 7)File Managment 4) Utilizing the Right Mouse Button 6) Using Help 3)
 Use the Left and Right arrow keys or the Page Up and Page Down keys to move between the pages. You can also click on the pages to move forward.  To.
Strings and regular expressions Day 10 LING Computational Linguistics Harry Howard Tulane University.
TEXT STATISTICS 1 DAY /20/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
European Computer Driving Licence Module 3 – Word Processing Chapter 3.1 – First Steps.
Product Retrieval Statistics Canada / Statistique Canada Chuck Humphrey ACCOLEDS/DLI Training December, 2001.
Go to our website, and click on the eMedia Catalog link To find books, either click on the advanced search (which I will.
The basics of the Online Portal
UNICODE & CONTROL DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
ELN – Natural Language Processing Giuseppe Attardi
How to Download and Install a Sharp Print Driver on a Mac.
Installing the SAFARIODBC.EXE For use with Excel May 3, 2002.
TEXT STATISTICS 5 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
NLTK & BASIC TEXT STATS DAY /08/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 4 DAY 5 - 9/05/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Introduction to Programming Workshop 1 PHYS1101 Discovery Skills in Physics Dr. Nigel Dipper Room 125d
UNICODE DAY /22/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Basic Computer and Word Functions, part 1 Read the information and use to answer the questions in the Basic Computer and Word Functions Study Guide.
NLTK & Python Day 7 LING Computational Linguistics Harry Howard Tulane University.
Structured programming 3 Day 33 LING Computational Linguistics Harry Howard Tulane University.
COMPUTATION WITH STRINGS 2 DAY 2 - 8/29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
SCRIPTS & FUNCTIONS DAY /06/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
9/2/ CS171 -Math & Computer Science Department at Emory University.
TWITTER DAY /07/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 2 DAY /10/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Basic Computer and Word Functions, part 1 Read the information and use to answer the questions in the Basic Computer and Word Functions Study Guide.
SW318 Social Work Statistics Slide 1 Get ready to work on practice problems 1. Create a directory and subdirectory on your computer named C:\StudentData\SW318_Spring_2004.
WEB TEXT DAY /14/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
COMPUTATION WITH STRINGS 1 DAY 2 - 8/27/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
How to Download and Install the Naper eReader and eBook Naper Publishing Group.
NLTK & Python Day 8 LING Computational Linguistics Harry Howard Tulane University.
TEXT STATISTICS 3 DAY /24/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Practical Kinetics Exercise 0: Getting Started Objectives: 1.Install Python and IPython Notebook 2.print “Hello World!”
ON-LINE DOCUMENTS DAY /13/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Click left mouse button to proceed. Windows Tutorial Searching For Files CST-133 Lab AW © Delta College CST Faculty.
CONTROL 2 DAY /26/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
TWITTER 3 DAY /12/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Folio3 IPhone Training Session 2 Testing App on device Presenter: Imam Raza.
COMPUTATION WITH STRINGS 3 DAY 4 - 9/03/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Downloading Procedures From the Web Importing Web Procedures Using ProEdit.
Resources in Moodle Dubravka Crnić. Moodle supports a range of resource types which teachers can add to their courses. In edit mode, a teacher can add.
CONTROL 3 DAY /29/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
How to get started with RefWorks
LING 3820 & 6820 Natural Language Processing Harry Howard
Introduction to Computational Thinking
Python Lesson 12 Mr. Kalmes.
Tutorial Reading in EBSCOhost support.ebsco.com.
Flat text Day 6 - 9/12/16 LING 3820 & 6820 Natural Language Processing
Creating a Brochure In Microsoft Word.
Computation with strings 2 Day 3 - 9/02/16
How to get started with RefWorks
Perl A simple test.
Python Lesson 12 Mr. Kalmes.
Corpus Linguistics I ENG 617
Flat text 2 Day 7 - 9/14/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Flat text 3 Day 8 - 9/16/16 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University.
Introduction to Computational Thinking
Tutorial Reading in EBSCOhost support.ebsco.com.
Electronic Communication
European Computer Driving Licence
CompSci 101 Introduction to Computer Science
Regular expressions 2 Day /23/16
control 4 Day /01/14 LING 3820 & 6820 Natural Language Processing
LING 3820 & 6820 Natural Language Processing Harry Howard
Exporting EBSCO eBooks pages to Google Drive
Teaching London Computing
Regular expressions 3 Day /26/16
Computation with strings 4 Day 5 - 9/09/16
Control 1 Day /30/16 LING 3820 & 6820 Natural Language Processing
Presentation transcript:

ON-LINE DOCUMENTS 3 DAY /17/14 LING 3820 & 6820 Natural Language Processing Harry Howard Tulane University

Course organization 17-Oct-2014NLP, Prof. Howard, Tulane University 2   The syllabus is under construction.   Chapter numbering  3.7. How to deal with non-English characters 3.7. How to deal with non-English characters  4.5. How to create a pattern with Unicode characters 4.5. How to create a pattern with Unicode characters  6. Control 6. Control

Open Spyder 17-Oct NLP, Prof. Howard, Tulane University

How to download a file from Project Gutenberg Review 17-Oct NLP, Prof. Howard, Tulane University

The global working directory How to set the global working directory in Spyder  I am getting tired of constantly double-checking that Python saves my stuff to pyScripts, but fortunately Spyder can do it for us.  Open Spyder:  On a Mac, click on the python menu (top left).  In Windows, click on the Tools menu.  Open Preferences > Global working directory > Startup … At startup, the global working directory is: > the following directory: /Users/harryhow/Documents/pyScripts  Set the next two selections to "the global working directory".  Leave the last untouched & unchecked. 17-Oct-2014NLP, Prof. Howard, Tulane University 5

gutenLoader def gutenLoader(url, name): 1. from urllib import urlopen 2. download = urlopen(url) 3. text = download.read() 4. print 'text length = '+str(len(text)) 5. lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK') 6. startIndex = text.index('\n',lineIndex) 7. endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK') 8. story = text[startIndex:endIndex] 9. print 'story length = '+str(len(story)) 10. tempFile = open(name,'w') 11. tempFile.write(story.encode('utf8')) 12. tempFile.close() 13. print 'File saved' 14. return 17-Oct-2014NLP, Prof. Howard, Tulane University 6

Usage 1. # make sure that the current working directory is pyScripts 2. >>> from corpFunctions import gutenLoader 3. >>> url = ' 4. >>> name = 'VariableMan.txt' 5. >>> gutenLoader(url, name) 17-Oct-2014NLP, Prof. Howard, Tulane University 7

Homework  Since the documents in Project Gutenberg are numbered consecutively, you can download a series of them all at once with a for loop. Write the code for doing so.  HINT: remember that strings and intergers are different types. To turn a string into an integer, use the built-in str() method, e.g. str(1). 17-Oct-2014NLP, Prof. Howard, Tulane University 8

Answer 1. # make sure that the current working directory is pyScripts 2. >>> from corpFunctions import gutenLoader 3. >>> base = >>> for n in [1,2,3]: num = str(base+n) url = ' name = 'story'+str(n)+'.txt' gutenLoader(url, name) Oct-2014NLP, Prof. Howard, Tulane University 9

Get pdfminer You only do this once ~ How to install a package by hand  Point your web browser at  Click on the green button to download the compressed folder.  It download to your Downloads folder. Double click it to decompress it if it doesn't decompress automatically. If you have no decompression utility, you need to get one.  Open the Terminal/Command Prompt (Windows Start > Search > cmd > Command Prompt)  $> cd {drop file here}  $> python setup.py install  $> pdf2txt.py samples/simple1.pdf 17-Oct-2014NLP, Prof. Howard, Tulane University 10

Get a pdf file  To get an idea of the source:  Download the file to pyScripts: 1. # make sure that the current working directory is pyScripts 2. from urllib import urlopen 3. url = ' 0130meeting.pdf' 4. download = urlopen(url) 5. doc = download.read() 6. tempFile = open('FOMC pdf','w') 7. tempFile.write(doc) 8. tempFile.close() 17-Oct-2014NLP, Prof. Howard, Tulane University 11

Run pdf2text 1. # make sure that Python is looking at pyScripts 2. >>> from corpFunctions import pdf2text 3. >>> text = pdf2text('FOMC pdf') 4. >>> len(text) 5. >>> text[:50] 17-Oct-2014NLP, Prof. Howard, Tulane University 12

How to pre-process a text with the PlaintextCorpusReader 17-Oct NLP, Prof. Howard, Tulane University

NLTK  One of the reasons for using NLTK is that it relieves us of much of the effort of making a raw text amenable to computational analysis. It does so by including a module of corpus readers, which pre- process files for certain tasks or formats.  Most of them are specialized for particular corpora, so we will start with the basic one, called the PlaintextCorpusReader. 17-Oct-2014NLP, Prof. Howard, Tulane University 14

PlaintextCorpusReader  The PlaintextCorpusReader needs to know two things: where your file is and what its name is.  If the current working directory is where the file is, the location argument can be left ‘blank’ by using the null string ''.  We only have one file, ‘Wub.txt’. It will also prevent problems down the line to give the method an optional third argument that relays its encoding, encoding='utf- 8'.  Now let NLTK tokenize the text into words and punctuation. 17-Oct-2014NLP, Prof. Howard, Tulane University 15

Usage 1. # make sure that the current working directory is pyScripts 2. >>> from nltk.corpus import PlaintextCorpusReader 3. >>> wubReader = PlaintextCorpusReader('', 'Wub.txt', encoding='utf-8') 4. >>> wubWords = wubReader.words() 17-Oct-2014NLP, Prof. Howard, Tulane University 16

Initial look at the text 1. >>> len(wubWords) 2. >>> wubWords[:50] 3. >>> set(wubWords) 4. >>> len(set(wubWords)) 5. >>> wubWords.count('wub') 6. >>> len(wubWords) / len(set(wubWords)) 7. >>> from __future__ import division 8. >>> 100 * wubWords.count('a') / len(wubWords) 17-Oct-2014NLP, Prof. Howard, Tulane University 17

Basic text analysis with NLTK 1. >>> from nltk.text import Text 2. >>> t = Text(wubWords) 3. >>> t.concordance('wub') 4. >>> t.similar('wub') 5. >>> t.common_contexts(['wub','captain']) 6. >>> t.dispersion_plot(['wub']) 7. >>> t.generate() 17-Oct-2014NLP, Prof. Howard, Tulane University 18

Q6 take home Intro to text stats Next time 17-Oct-2014NLP, Prof. Howard, Tulane University 19