Presentation is loading. Please wait.

Presentation is loading. Please wait.

Asma Naseer.  Shallow Parsing or Partial Parsing  At first proposed by Steven Abney (1991)  Breaking text up into small pieces  Each piece is parsed.

Similar presentations


Presentation on theme: "Asma Naseer.  Shallow Parsing or Partial Parsing  At first proposed by Steven Abney (1991)  Breaking text up into small pieces  Each piece is parsed."— Presentation transcript:

1 Asma Naseer

2  Shallow Parsing or Partial Parsing  At first proposed by Steven Abney (1991)  Breaking text up into small pieces  Each piece is parsed separately [1]

3  Words are not arranged flatly in a sentence but are grouped in smaller parts called phrases The girl was playing in the street اس نے احمد کو کتاب دی

4  Chunks are non-recursive (does not contain a phrase of the same category as it self) NP  D? AdjP? AdjP? N The big red balloon [NP[D The] [AdjP [Adj big]] [AdjP [Adj red]] [N balloon]] [1]

5  Each phrase is dominated by a head h A man proud of his son. A proud man  The root of the chunk has h as s-head (semantic head)  Head of a Noun phrase is usually a Noun or pronoun [1] [1]

6  IOBE  IOB  IO

7  IOB (Inside Outside Begin) I-NP O-NP B-NP I-VP O-VP B-BP قائد اعظم محمد علی جناح نے قوم سے خطاب کیا [ جناح I-NP] [ علی I-NP] [ محمد I-NP] [ قائد اعظم B-NP] [ خطاب B-NP] [ سے O-NP ] [ قوم B-NP] [ نے O-NP] [ کیا O-NP]

8  Rule Based Vs Statistical Based Chunking [2]  Use of Support Vector Learning for Chunk Identification [5]  A Context Based Maximum Likelihood Approach to Chunking [6]  Chunking with Maximum Entropy Models [7]  Single-Classifier Memory-Based Phrase Chunking [8]  Hybrid Text Chunking [9]  Shallow Parsing as POS Tagging [3]

9  Two techniques are used Regular expressions rules ○ Shallow Parse based on regular expressions N-gram statistical tagger (machine based chunking) ○ NLTK (Natural Language Toolkit) based on TnT Tagger (Trigramsb’n’Tags). ○ Basic Idea: Reuse POS tagger for chunking.

10 Regular expressions rules  Necessary to develop regular expressions manually N-gram statistical tagger  Can be trained on gold standard chunked data

11  Focus is on Verb and Noun phrase chunking  Noun Phrases Noun or pronoun is the head Also contains ○ Determiners i.e. Articles, Demonstratives, Numerals, Possessives and Quantifiers ○ Adjectives ○ Complements ( ad-positional, relative clauses )  Verb Phrases Verb is the head Often one or two complements Any number of Adjuncts

12  Training NLTK on Chunk Data Starts with empty rule set ○ 1. Define or refine a rule ○ 2. Execute chunker on training data ○ 3. Compare results with previous run Repeat (1,2 & 3) until performance does not improve significantly  Issues: Total 211,727 phrases. Taken subset 1,000 phrases.

13  Training TnT on Chunk Data Chunking is treated as statistical tagging Two steps ○ Parameter generation : create model parameters from training corpus ○ Tagging : tag each word with chunk label

14  Data Set WSJ: Wall Street Journal Newspaper NY ○ US ○ International Business ○ Financial News Training: section 15-18 Testing: section 20 Both tagged with POS and IOB Special characters are treated as other POS, punctuation are tagged as O

15  Results Precision P = |reference ∩ test| / test Recall R = |reference ∩ test| / reference F- Measure F α = 0.5 = 1 / (α/P + (1-α)/PR) F- Rate F = (2 * P* R) / (R+P)

16  Results NLTK TnT PRF-Measure VP79.3 %80.1 %79.7 % NP76.5 %84.4 %80.3 % PRF-Measure VP79.59 %82.35 %80.95 % NP78.36 %76.76 %77.55 %

17  SVM (Large Margin Classifiers)  Introduced by Vapnik 1995  Two class pattern recognition problem  Good generalization performance  High accuracy in text categorization without over fitting (Joachims, 1998; Taira and Haruono, 1999)

18  Training data  (x i, y i )…. (x l, y l ) x i Є R n, y i Є {+1, -1} x i is the i-th sample represented by n dimensional vector yi is (+ve or –ve class) label of i-th sample  In SVM +ve and –ve examples are separated by a hyperplane SVM finds optimal hyperplane

19  Two possible hyperplanes

20  Chunks in CoNLL-2000 shared task, are IOB Tagged  Each chunk type belongs to either I or B I-NP or B-NP  22 types of chunks are found in CoNLL-2000  Chunking problem is classification of these 22 types  SVM is binary classifier, so its extended to k- classes One class vs. all others Pairwise classification ○ k * (k-1) / 2 classifiers  22 * 21 / 2 = 231 classifiers ○ Majority decides final class

21  Feature vector consists of Words: w POS tags: t Chunk tags: c  To identify chunk c i at i-th word w j, t j (j = i-2, i-1, i, i+1, i+2) cj (j = i-2, i-1)  All features are expanded to binary values; either 0 or 1  The total dimensions of feature vector becomes 92837

22 Results  It took about 1 day to train 231 classifiers  PC-Linux Celeron 500 MHz, 512 MB  ADJP, ADVP, CONJP, INTJ, LST, NP, PP, PRT, SBAR, VP Precision = 93.45 % Recall = 93.51 % F β=1 = 93.48 %

23 Training  POS Tags based  Construct symmetric n-context from training corpus 1-context: most common chunk label for each tag 3-context: tag followed by the tag before and after it [t -1, t 0, t +1 ] 5-context [t -2,t -1, t 0, t +1, t +2 ] 7-context [t -3, t -2,t -1, t 0, t +1, t +2, t +3 ]

24 Training  For each context find the most frequent label CC  [O CC] PRP CC RP  [B-NP CC]  To save storage space n-context is added if its different from its nearest lower order context

25 Testing  Construct maximum context for each tag  Look up in the database of most likely patterns  If the largest context is not found context is diminished step by step  The only rule for chunk-labeling is to look up [t -3, t -2,t -1, t 0, t +1, t +2, t +3 ].… [t 0 ] until the context is found

26 Results  The best results are achieved for 5- context ADJP, ADVP, CONJP, INTJ, LST, NP, PP, PRT, SBAR, VP ○ Precision = 86.24% ○ Recall = 88.25% ○ F β=1 = 87.23%

27  Maximum Entropy models are exponential models  Collect as much information as possible Frequencies of events relevant to the process  MaxEnt model has the form P(w|h) = 1 / Z(h). e Σ i λ i f i (h,w) f i (h,w) is a binary valued featured vector describing an event λ i describes how important is f i Z(h) is a normalization factor

28 Attributes Used  Information in WSJ Corpus Current Word POS Tag of Current Word Surrounding Words POS Tags of Surrounding Words  Context Left Context: 3 words Right Context: 2 words  Additional Information Chunk tags of previous 2 words

29 Results  Tagging Accuracy = 95.5% # of correct tagged words Total # of words  Recall = 91.86% # of correct proposed base NPs Number of correct base NPs  Precision = 92.08% # of correct proposed base NPs Number of proposed base NPs  F β=1 = 91.97% (β 2 +1). Recall.Precision β 2. (Recall + Precision)

30  Context based Lexicon and HMM based chunker  Statistics were used for chunking by Church(1998) Corpus frequencies were used Non-recursive noun phrases were identified  Skut & Brants (1998) modifeid Church approach and used Viterbi Tagger

31  Error-driven HMM based text chunker  Memory is decreased by keeping only +ve lexical entries  HMM based text chunker with context- dependent lexicon Given G n 1 = g 1, g 2,..., g n Find optimal sequence T n 1 = t 1, t 2,..., t n Maximize log P( T n 1 | G n 1 ) log P( T n 1 | G n 1 ) = log P(T n 1 ) + log P( T n 1, G n 1 ) P( T n 1 ) P ( G n 1 )

32  CoNLL 2000 : for testing and training  Ratnaparkhi’s maximum entropy based POS tagger No change in internal operation Information for training is increased

33 Shallow Parsing VS POS Tagging  Shallow Parsing requires more surrounding POS/lexical syntactic environment  Training Configurations Words w 1 w 2 w 3 POS Tags t 1 t 2 t 3 Chunk Types c 1 c 2 c 3 Suffixes or Prefixes

34  Amount of information is gradually increased Word w 1 Tag t 1 Word, Tag, Chunk Label (w 1 t 1 c 1 ) ○ Current chunk label is accessed through another model with configurations of words and tags (w 1 t 1 ) To deal with sparseness ○ t 1, t 2 ○ c 1 ○ c 2 (last two letters) ○ w 1 (first two letters)

35  Word w 1

36  Tag t 1

37  (w 1 t 1 c 1 )

38  Sparseness Handling

39 PrecisionRecallF β=1 Word w 1 88.06%88.71%80.38% Tag t 1 88.15%88.07%88.11% (w 1 t 1 c 1 )89.79%90.70%90.24% Sparseness Handling 91.65%92.23%91.94%  Over all Results

40 Error Analysis  Three groups of errors Difficult syntactic constructs ○ Punctuations ○ Treating di-transitive VPs and transitive VPs ○ Adjective vs. Adverbial Phrases Mistakes made in training or testing by annotator ○ Noise ○ POS Errors ○ Odd annotation decisions Errors peculiar to approach ○ Exponential Distribution assigns non zero probability to all events ○ Tagger may assign illegal chunk-labels (I-NP while w is not NP)

41 Comments  PPs are easy to identify  ADJP and ADVP are hard to identify correctly (more syntactic information is required)  Performance at NPs can be further improved  Performance using w 1 or t 1 is almost same. Using both the features enhances performance

42  [1] Philip Brooks, “A Simple Chunk Parser”, May 8, 2003.  [2] Igor Boehm, “Rule Based vs. Statistical Chunking of CoNLL data Set”.  [3] Miles Osborne, “Shallow Parsing as POS Tagging”  [4] Hans van Halteren, “Chunking with WPDV Models”  [5] Taku Kudoh and Yuji Matsumoto, “Use of Support Vector Learning for Chunk Identification”, In proceeding of CoNLL-2000 and LLL-2000, page 142-144, Portugal 2000.  [6] Christer Johanson, “A Context Sensitive Maximum Likelihood Approach to Chunking”  [7] Rob Koeling, “Chunking with Maximum Entropy Models”  [8] Jorn Veenstra and Antal van den Bosch, “Single Cassifier Memory Based Phrase Chunking”  [9] Guo dong Zhou and Jian Su and TongGuan Tey, “Hybrid Text Chunking”


Download ppt "Asma Naseer.  Shallow Parsing or Partial Parsing  At first proposed by Steven Abney (1991)  Breaking text up into small pieces  Each piece is parsed."

Similar presentations


Ads by Google