Comparing Information Extraction Pattern Models Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.

Comparing Information Extraction Pattern Models Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK

Information Extraction Patterns Popular approach to Information Extraction use lexico-syntactic patterns which match text and identify items of interest Several recent approaches have been based on extraction patterns derived from dependency parses Unsupervised approaches to learning extraction patterns extract all possible patterns and try to identify the useful ones

“Microsoft, forced to recruit after Adams unexpectedly resigned, last week hired Boor as interim replacement.” hire/V Microsoft/NBoor/N resign/V Adams/N nsubjnobj nsubj unexpectedly/R as force/V recruit/N last/J week/N as replacement/N an/DTinterim/J toafter partmod dep detamod

“Microsoft, forced to recruit after Adams unexpectedly resigned, last week hired Boor as interim replacement.” hire/V Microsoft/NBoor/N resign/V Adams/N nsubjnobj nsubj

Predicate Argument Model Pattern consists of a subject-verb-object tuple; Yangarber (2003); Stevenson and Greenwood (2005) hire/V IBM/NSmith/Nresign/V Jones/N nsubj nobj after nsubj

Chain Model Extraction patterns are chain-shaped paths in the dependency tree rooted at a verb; Sudo et. al. (2001), Sudo et. al. (2003) hire/V IBM/NSmith/Nresign/V Jones/N nsubj nobj after nsubj

Linked Chain Model Patterns are chains or any pair of chains sharing their root; (Greenwood et. al. 2005) hire/V IBM/NSmith/Nresign/V Jones/N nsubj nobj after nsubj

Subtree Model Patterns are any subtree of the dependency tree By its definition it contains all the patterns proposed by the previous two models; Sudo et. al. (2003) hire/V IBM/NSmith/Nresign/V Jones/N nsubj nobj after nsubj

Comparing Models The models identify different parts of a sentence. –“Smith joined Acme Inc. as CEO” –SVO model identifies link between “Smith” and “Acme Inc.” –Chain model identify link between “Acme Inc.” and “CEO” –Linked chain and subtree models could identify both links But there is a price to be paid –Models generate different numbers of patterns for a given dependency tree –More patterns probably require more memory and processing

Let T be a dependency tree consisting of N nodes. V is the set of verb nodes Now let d(v) be the count of a node v (a member of V) and its descendents. Linear Linear, polynomial in worst case Model Complexity

Let C(v) denote the set of child nodes for a verb v and c i be the i-th child. (So, C(v) = {c 1, c 2, …. c |C(v)| }) The number of subtrees can be defined recursively: Polynomial Exponential

Experiments Aim to identify how well each pattern model captures the relations occurring in an IE corpus Extract patterns from a parsed corpus and, for each model, check whether it contains the items participating in the relationship Do NOT attempting to extract the relations, just to determine whether they can be represented

Corpora Stevens succeeds Fred Casey who retired from the OCC in June Expression of sigma(K)-dependent cwlH gene depended on gerE Used corpora representing two extraction tasks –Management succession –Various biomedical texts

Parsers 1.MINIPAR 2.Machinese Syntax Parser 3.Stanford Parser SVOChainsLinked Chains Subtrees Minipar 2,98052,659149,504353,778,240,702,149,000 Machinese Syntax 2,38267,690265,6314,641,825,924 Stanford 2,95076,620478,643 1,696,259,251,073

Evaluating Expressivity Coverage: proportion of relations in corpus for which there exists a pattern that includes both items participating in that relation Analysis showed that parsers often failed to generate a parse which included all words in the sentence. For some relations it may be impossible to generate a pattern which covers it. No model can outperform subtree model. Bounded coverage: proportion of relations in corpus which can be represented (given a dependency parse) for which there exists a pattern that includes both participating items.

Management Succession Results Coverage (%)Bounded Coverage (%) ParserSVOChainsLinked Chains SubtreesSVOChainsLinked Chains MINIPAR741828395099 Machinese Syntax 236767734699 Stanford15419599.7154193 SVO and chains do not cover many of the relations Subtree and linked chains models have roughly same coverage

Biomedical Results Coverage (%)Bounded Coverage (%) ParserSVOChainsLinked Chains SubtreesSVOChainsLinked Chains MINIPAR0.9341657115092 Machinese Syntax 0.193665710.272092 Stanford0.461789950.491793 More difference between linked chains and simpler models on biomedical text SVO and chains consistently perform badly, linked chains do well

Bounded coverage results for all models is lower on the biomedical corpora –Parsers are not generally well adapted to deal with these sorts of text; more parsing errors? –Nominalisations appear more common in these texts “the DNA-dependent assembly of regulon into rings” assembly/N dependent/A regulon/N rings/NDNA/N

Results Summary Average coverage for each pattern model over all texts No statistical difference between (1) SVO and chains or (2) linked chains and subtrees

Summary Comparison of four models for Information Extraction patterns based on dependency trees –Trade off between pattern complexity and tractability Linked chain model performs well –But may have problems with certain linguistic constructions (such as nominalizations)

Comparing Information Extraction Pattern Models Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.

Similar presentations

Presentation on theme: "Comparing Information Extraction Pattern Models Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparing Information Extraction Pattern Models Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK.

Similar presentations

Presentation on theme: "Comparing Information Extraction Pattern Models Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield, UK."— Presentation transcript:

Similar presentations

About project

Feedback