CSCE 590 Web Scraping - NLTK

CSCE 590 Web Scraping - NLTK
Topics Introduction to NLTK Parsing with the NLTK Readings: Online book February 21, 2017

0. Preface 1. Language Processing and Python 2. Accessing Text Corpora and Lexical Resources 3. Processing Raw Text 4. Writing Structured Programs 5. Categorizing and Tagging Words (minor fixes still required) 6. Learning to Classify Text 7. Extracting Information from Text 8. Analyzing Sentence Structure 9. Building Feature Based Grammars 10. Analyzing the Meaning of Sentences (minor fixes still required) 11. Managing Linguistic Data (minor fixes still required) 12. Afterword: Facing the Language Challenge Bibliography Term Index

Installing NLTK Install Setuptools: Install Pip: run sudo easy_install pip Install Numpy (optional): run sudo pip install -U numpy Install PyYAML and NLTK: run sudo pip install -U pyyaml nltk Test installation: run python then type import nltk

Installing NLTK Data >>> import nltk >>> nltk.download()

Test NLTK Installation
1) Test Brown Corpus: >>> from nltk.corpus import brown >>> brown.words()[0:10] ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of'] >>> brown.tagged_words()[0:10] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')] >>> len(brown.words())

Sent Tokenize(sentence boundary detection, sentence segmentation), Word Tokenize and Pos Tagging:
>>> from nltk import sent_tokenize, word_tokenize, pos_tag >>> text = "Machine learning …” >>> sents = sent_tokenize(text) >>> sents >>> tokens = word_tokenize(text) >>> tokens

Part of Speech Tagging >>> len(tokens) 161 >>> tagged_tokens = pos_tag(tokens) >>> tagged_tokens [('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('science', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('act', 'VB'), …

Parsing

Recursive Descent Paring with NLTK
Parsers nltk.parse_cfg( grammar) # build cfg nltk.ChartParser(g) nltk.RecursiveDescentParser(g) # build parser from grammar nltk.app.rdparser_app.RecursiveDescentApp nltk.app.srparser_app.ShiftReduceApp Imports import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import *

The ChartParser program
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] print sent parser = nltk.ChartParser(groucho_grammar) trees = parser.nbest_parse(sent) for tree in trees: print tree

Groucho Output ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] (S (NP I) (VP (V shot) (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas)))))) (VP (V shot) (NP (Det an) (N elephant))) (PP (P in) (NP (Det my) (N pajamas)))))

Loading grammars # NLTK - mygrammar.cfg - to illustrate loading of grammars # grammar1 = nltk.data.load('file:mygrammar.cfg') S -> NP VP VP -> V NP NP -> N | DET N N -> 'Mary' | 'Bob' | 'dog' V -> 'saw' DET -> 'the' | 'a'

Example loading “mygrammar.cfg”
grammar1 = nltk.data.load('file:mygrammar.cfg') print grammar1 sent = "Mary saw Bob".split() print sent rd_parser = nltk.RecursiveDescentParser(grammar1) for tree in rd_parser.nbest_parse(sent): print tree

Checking the grammar # to dump the grammar grammar1 = nltk.data.load('file:mygrammar.cfg') print grammar1 # or you can iterate through the productions for p in grammar1.productions(): print p

Extending the grammar sent = 'Mary saw a cat'.split() for t in rd_parser.nbest_parse(sent): print t Traceback (most recent call last): File "C:/Python25/PythonCodeExamplesMMM/rdparser.py", line 59, in <module> for t in rd_parser.nbest_parse(sent): File "C:\Python25\lib\site-packages\nltk\parse\rd.py", line 77, in nbest_parse self._grammar.check_coverage(tokens) File "C:\Python25\lib\site-packages\nltk\grammar.py", line 431, in check_coverage "input words: %r." % missing) ValueError: Grammar does not cover some of the input words: "'cat'".

Tracing rd_parser = nltk.RecursiveDescentParser(grammar1, 2) Parsing 'Mary saw a dog' [ * S ] E [ * NP VP ] E [ * N VP ] E [ * 'Mary' VP ] M [ 'Mary' * VP ] E [ 'Mary' * V NP ] E [ 'Mary' * 'saw' NP ] M [ 'Mary' 'saw' * NP ] E [ 'Mary' 'saw' * N ] E [ 'Mary' 'saw' * 'Mary' ] E [ 'Mary' 'saw' * 'Bob' ] E [ 'Mary' 'saw' * 'dog' ] E [ 'Mary' 'saw' * DET N ] E [ 'Mary' 'saw' * 'the' N ] … (S (NP (N Mary)) (VP (V saw) (NP (DET a) (N dog)))) RecursiveDescentParser() takes an optional parameter trace. If trace is greater than zero, then the parser will report the steps that it takes as it parses a text.

Example nltk.app.rdparser
import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import * sent = 'the dog saw a man in the park'.split() RecursiveDescentApp(grammar, sent).mainloop()

Example nltk.app.srparser
#import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import * from nltk import parse_cfg from nltk.app import * nltk.app.srparser()

CSCE 590 Web Scraping - NLTK

Similar presentations

Presentation on theme: "CSCE 590 Web Scraping - NLTK"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CSCE 590 Web Scraping - NLTK

Similar presentations

Presentation on theme: "CSCE 590 Web Scraping - NLTK"— Presentation transcript:

Similar presentations

About project

Feedback