NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University

14-Sept-2009LING 681.02, Prof. Howard, Tulane University2 Course organization  NLTK should be installed on the computers in this room!

NLPP §3 Processing raw text §3.1 Accessing text from the Web and from disk

14-Sept-2009LING 681.02, Prof. Howard, Tulane University4 Using e-books  Download an e-book  they are of type 'str'  Tokenization: break up the string into words and punctuation  Convert to NLTK text  Remove headers

14-Sept-2009LING 681.02, Prof. Howard, Tulane University5 Download an e-book >>> from urllib import urlopen >>> url = "http://www.gutenberg.org/files/2554/2554.txt" >>> raw = urlopen(url).read() >>> type(raw) >>> len(raw) 1176831 >>> raw[:75] 'The Project Gutenberg EBook of Crime and Punishment, by Fyodor Dostoevsky\r\n'

14-Sept-2009LING 681.02, Prof. Howard, Tulane University6 Tokenize >>> tokens = nltk.word_tokenize(raw) >>> type(tokens) >>> len(tokens) 255809 >>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook', 'of', 'Crime', 'and', 'Punishment', ',', 'by']

14-Sept-2009LING 681.02, Prof. Howard, Tulane University7 Convert to NLTK text >>> text = nltk.Text(tokens) >>> type(text) >>> text[1020:1060] ['CHAPTER', 'I', 'On', 'an', 'exceptionally', 'hot', 'evening', 'early', 'in', 'July', 'a', 'young', 'man', 'came', 'out', 'of', 'the', 'garret', 'in', 'which', 'he', 'lodged', 'in', 'S', '.', 'Place', 'and', 'walked', 'slowly', ',', 'as', 'though', 'in', 'hesitation', ',', 'towards', 'K', '.', 'bridge', '.'] >>> text.collocations() Katerina Ivanovna; Pulcheria Alexandrovna; Avdotya Romanovna; Pyotr Petrovitch; Project Gutenberg; Marfa Petrovna; Rodion Romanovitch; Sofya Semyonovna; Nikodim Fomitch; did not; Hay Market; etc.

14-Sept-2009LING 681.02, Prof. Howard, Tulane University8 Remove headers >>> raw.find("PART I") 5303 >>> raw.rfind("End of Project Gutenberg's Crime") 1157681 >>> raw = raw[5303:1157681] >>> raw.find("PART I") 0

14-Sept-2009LING 681.02, Prof. Howard, Tulane University9 Dealing with HTML  Download a webpage  they are of type 'str'  Tokenization: break up the string into words and punctuation  Convert to NLTK text  Remove headers

14-Sept-2009LING 681.02, Prof. Howard, Tulane University10 Download a web page >>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm" >>> html = urlopen(url).read() >>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0 Transitional//EN' >>> print html # only if you want to see the html code

14-Sept-2009LING 681.02, Prof. Howard, Tulane University11 Tokenize >>> raw = nltk.clean_html(html) >>> tokens = nltk.word_tokenize(raw) >>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to', 'die', 'out',...]

14-Sept-2009LING 681.02, Prof. Howard, Tulane University12 Convert to NLTK text >>> tokens = tokens[96:399] >>> text = nltk.Text(tokens) >>> text.concordance('gene') they say too few people now carry the gene for blondes to last beyond the next tw t blonde hair is caused by a recessive gene. In order for a child to have blonde to have blonde hair, it must have the gene on both sides of the family in the gra there is a disadvantage of having that gene or by chance. They don ' t disappear ondes would disappear is if having the gene was a disadvantage and I do not think

14-Sept-2009LING 681.02, Prof. Howard, Tulane University13 Remove headers  Trial and error.

14-Sept-2009LING 681.02, Prof. Howard, Tulane University14 For more sophisticated processing of HTML  Use the Beautiful Soup package, available from:  http://www.crummy.com/software/BeautifulSoup/ http://www.crummy.com/software/BeautifulSoup/

14-Sept-2009LING 681.02, Prof. Howard, Tulane University15 Other Internet formats  Search engine results  Feeds/RSS

14-Sept-2009LING 681.02, Prof. Howard, Tulane University16 Search engine results  Advantages  large size  easy to do  Disadvantages  search engine restricts patterns  results vary according to time and place  content may be duplicated

14-Sept-2009LING 681.02, Prof. Howard, Tulane University17 Search engine API  What is the Google AJAX Search API?  The Google AJAX Search API lets you put Google Search in your web pages with JavaScript.  You can embed a simple, dynamic search box and display search results in your own web pages or use the results in innovative, programmatic ways.  http://code.google.com/apis/ajaxsearch/ http://code.google.com/apis/ajaxsearch/

14-Sept-2009LING 681.02, Prof. Howard, Tulane University18 RSS  What is it?  Use the Universal Feed Parser from http://feedparser.org/ to access the content of a blog, as in the following example.

14-Sept-2009LING 681.02, Prof. Howard, Tulane University19 RSS example >>> import feedparser >>> llog = feedparser.parse("http://languagelog.ldc.upenn.edu/nll/?feed=atom ") >>> llog['feed']['title'] u'Language Log' >>> len(llog.entries) 15 >>> post = llog.entries[2] >>> post.title u"He's My BF"

14-Sept-2009LING 681.02, Prof. Howard, Tulane University20 RSS example, cont. >>> content = post.content[0].value >>> content[:70] u' Today I was chatting with three of our visiting graduate students f' >>> nltk.word_tokenize(nltk.html_clean(content)) >>> nltk.word_tokenize(nltk.clean_html(llog.entries[2].content[0].value)) [u'Today', u'I', u'was', u'chatting', u'with', u'three', u'of', u'our', u'visiting', u'graduate', u'students', u'from', u'the', u'PRC', u'.', u'Thinking', u'that', u'I', u'was', u'being', u'au', u'courant', u',', u'I', u'mentioned', u'the', u'expression', u'DUI4XIANG4', u'\u5c0d\u8c61', u'("', u'boy', u'/', u'girl', u'friend', u'"',...]

14-Sept-2009LING 681.02, Prof. Howard, Tulane University21 Reading local files  Plain text or ascii  Binary formats  User input

14-Sept-2009LING 681.02, Prof. Howard, Tulane University22 Plain text or ASCII files  Use the functions mentioned in §2 that involve open(), repeated in next slide from there.

14-Sept-2009LING 681.02, Prof. Howard, Tulane University23 Loading your own corpus Table 2.3 ExampleDescription abspath(fileid) the location of the file on disk encoding(fileid) the encoding of the file (if known) open(fileid) open a stream for reading the given corpus file root() the path to the root of locally installed corpus readme() the contents of the README file of the corpus

14-Sept-2009LING 681.02, Prof. Howard, Tulane University24 Your turn, p. 84  Create a file called document.txt using a text editor, and type in a few lines of text, and save it as plain text.  If you are using IDLE, select the New Window command in the File menu, typing the required text into this window, and then saving the file as document.txt inside the directory that IDLE offers in the pop-up dialogue box.  Next, in the Python interpreter, open the file using f = open('document.txt'), then inspect its contents using print f.read().

14-Sept-2009LING 681.02, Prof. Howard, Tulane University25 Text from binary  Plain text of a pdf file: %PDF-1.2 % ｄ恃 2 0 obj << /Length 3205 /Filter /FlateDecode >> stream HâîW€r€» Õ3øÇo°™$ 3∏ø≠Ì›$fiZªíX[~ê¸ í# kê‡‚"Y˛˙ÙÙm ä©T U 1 ÙÙtü>}:Z>.fi¸˝≥Y>ˆã? q - ˝øõd Kc”uY À‹ÿuë fπ=,fi}Xº˘≤4ã7ˇˇΩ˚Á{ˇÁ√«Â–çnÒ·ÁÂ_ ø|X¸kÒÓvÒÊ÷öYfi>,¢µ±d õØÌ2Àãu^,o ãhyª]¨~sUw¨èèW∑,≤<Ü›∑?/Vü›„¡ ˝ZöDº6TC›‚Zí ^€¯«"„ß ˇd‚Ç¬ˇ] wÌ '9Ø}¡-i¬èUÛ≠˜+ø‹N£`¶Q»≤hù%≈ˇ ÑËb l AHñôç ◊ ÷˙(¨>V]}¨º qú≤[¬]›†£Ö‹Â>≤)Æà„x≠B ≠+åå…- ≠‹Ahn‚(äV-¸µ˛«Á}}ııˆWÕî‰€çImæ,-ö¯˝X?ë IÇÎ˙z†xqÑmÊsv∑ziØlÊm>\›†Ì/hŸ&¸)∂Í˜êÿÅrñFôxˆ 2Ó kfl¢ëèÆi`”ç-¸ ˙·≠ÂeÚ_ °T√M æâç ú¡wá√Èß –≤~f ¸fi\!– ◊ Ñ û€!ä‰˝àπ0íã?¿∆O€~Ω=åk∑ ˘‹úœı åì wæ›ÙCWmfi"A.¡≤O§¬/ó§:|ípü: ◊ Êm¨òß4OìûãüG˜L˘*y·©¶ »à\ı wDrD}® ÒPõ»GΩVöÕÏ´J≥©ä%…4¸z©™ªÁ∫w b˘¢Ø ıç_∫± Ä√0k ÓCÌ˙ı|Û¿»aıë»Aºs Ñ

14-Sept-2009LING 681.02, Prof. Howard, Tulane University26 Text from binary  Open it with third-party libraries such as pypdf or pywin32, or  open it with the corresponding program and save it as text, or  if it is on the Internet, see if Google has a html version.

14-Sept-2009LING 681.02, Prof. Howard, Tulane University27 User input  raw_input() to capture what the user has typed or pasted.

14-Sept-2009LING 681.02, Prof. Howard, Tulane University28 Summary: NLP pipeline Fig. 3.1

Next time NLPP §3.2 Strings NLPP §3.3 RegEx

NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Similar presentations

Presentation on theme: "NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University.

Similar presentations

Presentation on theme: "NLTK & Python Day 9 LING 681.02 Computational Linguistics Harry Howard Tulane University."— Presentation transcript:

Similar presentations

About project

Feedback