Presentation is loading. Please wait.

Presentation is loading. Please wait.

CSCE 590 Web Scraping - NLTK

Similar presentations


Presentation on theme: "CSCE 590 Web Scraping - NLTK"— Presentation transcript:

1 CSCE 590 Web Scraping - NLTK
Topics Introduction to NLTK Parsing with the NLTK Readings: Online book February 21, 2017

2

3

4

5

6

7 0. Preface 1. Language Processing and Python 2. Accessing Text Corpora and Lexical Resources 3. Processing Raw Text 4. Writing Structured Programs 5. Categorizing and Tagging Words (minor fixes still required) 6. Learning to Classify Text 7. Extracting Information from Text 8. Analyzing Sentence Structure 9. Building Feature Based Grammars 10. Analyzing the Meaning of Sentences (minor fixes still required) 11. Managing Linguistic Data (minor fixes still required) 12. Afterword: Facing the Language Challenge Bibliography Term Index

8 Installing NLTK Install Setuptools: Install Pip: run sudo easy_install pip Install Numpy (optional): run sudo pip install -U numpy Install PyYAML and NLTK: run sudo pip install -U pyyaml nltk Test installation: run python then type import nltk

9 Installing NLTK Data >>> import nltk >>> nltk.download()

10 Test NLTK Installation
1) Test Brown Corpus: >>> from nltk.corpus import brown >>> brown.words()[0:10] ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of'] >>> brown.tagged_words()[0:10] [('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')] >>> len(brown.words())

11 Sent Tokenize(sentence boundary detection, sentence segmentation), Word Tokenize and Pos Tagging:
>>> from nltk import sent_tokenize, word_tokenize, pos_tag >>> text = "Machine learning …” >>> sents = sent_tokenize(text) >>> sents >>> tokens = word_tokenize(text) >>> tokens

12 Part of Speech Tagging >>> len(tokens) 161 >>> tagged_tokens = pos_tag(tokens) >>> tagged_tokens [('Machine', 'NN'), ('learning', 'NN'), ('is', 'VBZ'), ('the', 'DT'), ('science', 'NN'), ('of', 'IN'), ('getting', 'VBG'), ('computers', 'NNS'), ('to', 'TO'), ('act', 'VB'), …

13 Parsing

14 Recursive Descent Paring with NLTK
Parsers nltk.parse_cfg( grammar) # build cfg nltk.ChartParser(g) nltk.RecursiveDescentParser(g) # build parser from grammar nltk.app.rdparser_app.RecursiveDescentApp nltk.app.srparser_app.ShiftReduceApp Imports import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import *

15 Groucho Grammar groucho_grammar = nltk.parse_cfg(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP | 'I' VP -> V NP | VP PP Det -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' """)

16 The ChartParser program
sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] print sent parser = nltk.ChartParser(groucho_grammar) trees = parser.nbest_parse(sent) for tree in trees: print tree

17 Groucho Output ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] (S (NP I) (VP (V shot) (NP (Det an) (N elephant) (PP (P in) (NP (Det my) (N pajamas)))))) (VP (V shot) (NP (Det an) (N elephant))) (PP (P in) (NP (Det my) (N pajamas)))))

18 Loading grammars # NLTK - mygrammar.cfg - to illustrate loading of grammars # grammar1 = nltk.data.load('file:mygrammar.cfg') S -> NP VP VP -> V NP NP -> N | DET N N -> 'Mary' | 'Bob' | 'dog' V -> 'saw' DET -> 'the' | 'a'

19 Example loading “mygrammar.cfg”
grammar1 = nltk.data.load('file:mygrammar.cfg') print grammar1 sent = "Mary saw Bob".split() print sent rd_parser = nltk.RecursiveDescentParser(grammar1) for tree in rd_parser.nbest_parse(sent): print tree

20 Checking the grammar # to dump the grammar grammar1 = nltk.data.load('file:mygrammar.cfg') print grammar1 # or you can iterate through the productions for p in grammar1.productions(): print p

21 Extending the grammar sent = 'Mary saw a cat'.split() for t in rd_parser.nbest_parse(sent): print t Traceback (most recent call last): File "C:/Python25/PythonCodeExamplesMMM/rdparser.py", line 59, in <module> for t in rd_parser.nbest_parse(sent): File "C:\Python25\lib\site-packages\nltk\parse\rd.py", line 77, in nbest_parse self._grammar.check_coverage(tokens) File "C:\Python25\lib\site-packages\nltk\grammar.py", line 431, in check_coverage "input words: %r." % missing) ValueError: Grammar does not cover some of the input words: "'cat'".

22 Tracing rd_parser = nltk.RecursiveDescentParser(grammar1, 2) Parsing 'Mary saw a dog' [ * S ] E [ * NP VP ] E [ * N VP ] E [ * 'Mary' VP ] M [ 'Mary' * VP ] E [ 'Mary' * V NP ] E [ 'Mary' * 'saw' NP ] M [ 'Mary' 'saw' * NP ] E [ 'Mary' 'saw' * N ] E [ 'Mary' 'saw' * 'Mary' ] E [ 'Mary' 'saw' * 'Bob' ] E [ 'Mary' 'saw' * 'dog' ] E [ 'Mary' 'saw' * DET N ] E [ 'Mary' 'saw' * 'the' N ] … (S (NP (N Mary)) (VP (V saw) (NP (DET a) (N dog)))) RecursiveDescentParser() takes an optional parameter trace. If trace is greater than zero, then the parser will report the steps that it takes as it parses a text.

23 Example grammar L0 based on the ATIS corpus
S -> NP VP NP -> Pronoun | Proper-noun | Det Nominal Nominal -> Nominal Noun VP -> Verb | Verb NP | Verb NP PP | Verb PP PP -> Preposition NP

24 Lexicon for L0 Noun -> flights | breeze | trip | morning Verb -> is | prefer | like | need | want | fly …

25 nltk.app.rdparser_app Lines 864-886
-def app(): """ Create a recursive descent parser demo, using a simple grammar and text. """ from nltk import parse_cfg grammar = parse_cfg(""" # Grammatical productions. S -> NP VP NP -> Det N PP | Det N VP -> V NP PP | V NP | V PP -> P NP # Lexical productions. NP -> 'I' Det -> 'the' | 'a' N -> 'man' | 'park' | 'dog' | 'telescope' V -> 'ate' | 'saw' P -> 'in' | 'under' | 'with' """)

26 Example nltk.app.rdparser
import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import * sent = 'the dog saw a man in the park'.split() RecursiveDescentApp(grammar, sent).mainloop()

27

28 Example nltk.app.srparser
#import string import nltk from nltk import parse, tokenize, Tree, in_idle from nltk.draw.util import * from nltk.draw.tree import * from nltk.draw.cfg import * from nltk import parse_cfg from nltk.app import * nltk.app.srparser()

29


Download ppt "CSCE 590 Web Scraping - NLTK"

Similar presentations


Ads by Google