Presentation on theme: "Andrew Hickl, Jeremy Bensley, John Williams, Kirk Roberts, Bryan Rink and Ying Shi Recognizing."— Presentation transcript:
Andrew Hickl, Jeremy Bensley, John Williams, Kirk Roberts, Bryan Rink and Ying Shi firstname.lastname@example.org email@example.com Recognizing Textual Entailment with LCC’s Groundhog System
Introduction We were grateful for the opportunity to participate in this year’s PASCAL RTE-2 Challenge –First exposure to RTE came as part of the Fall 2005 AQUAINT “Knowledge Base” Evaluation Included PASCAL veterans: University of Colorado at Boulder, University of Illinois at Urbana-Champaign, Stanford University, University of Texas at Dallas, and LCC (Moldovan) While this year’s evaluation represented our first foray into RTE, our group has worked extensively towards performing the types of textual inference that’s crucial for: –Question-Answering –Information Extraction –Multi-Document Summarization –Named Entity Recognition –Temporal and Spatial Normalization –Semantic Parsing
Outline of Today’s Talk Introduction Groundhog Overview –Preprocessing –New Sources of Training Data –Performing Lexical Alignment –Paraphrase Acquisition –Feature Extraction –Entailment Classification Evaluation Conclusions di marmotta americana 2 Feb: Giorno della Marmotta, the RTE-2 deadline
Architecture of the Groundhog System Preprocessing Paraphrase Acquisition Feature Extraction Entailment Classification Named Entity Recognition Syntactic Parsing Semantic Parsing Temporal Normalization Temporal Ordering RTE Test Lexical Alignment WWW RTE Dev Positive Examples Negative Examples Training Corpora Name Aliasing Name Coreference Semantic Annotation YESNO
A Motivating Example Questions we need to answer: –What are the “important” portions that should be considered by a system? Can lexical alignment be used to identify these strings? –How do we determine that the same meaning is being conveyed by phrases that may not necessarily be lexically related? Can phrase- level alternations (“paraphrases”) help? –How do we deal with the complexity that reduces the effectiveness of syntactic and semantic parsers? Annotations? Rules? Compression? The Bills now appear ready to hand the reins over to one of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg. The Bills plan to give the starting job to J.P. Losman. Text Hypothesis Example 139 Task=SUM, Judgment=YES, LCC=YES, Conf = +0.8875 The Bills now appear ready to hand the reins over to one of their two- top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg. The Bills plan to give the starting job to J.P. Losman. [The Bills] Arg0 now appear ready to hand [the reins] Arg1 over to [one] Arg2 of their two-top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg.
Preprocessing Groundhog starts the process of RTE by annotating t-h pairs with a wide range of lexicosemantic information: Named Entity Recognition –LCC’s CiceroLite NER software is used to categorize over 150+ different types of named entities: [The Bills] SPORTS_ORG plan to give the starting job to [J.P. Losman] PERSON. [The Bills] SPORTS_ORG [now] TIMEX appear ready to hand the reins over to one of their two-top picks from [a year ago] TIMEX in quarterback [J.P. Losman] PERSON, who missed most of [last season] TIMEX with [a broken leg] BODY_PART. Name Aliasing and Coreference –Lexica and grammars found in CiceroLite are used to identify coreferential names and to identify potential antecedents for pronouns [The Bills] ID=01 plan to give the starting job to [J.P. Losman] ID=02. [The Bills] ID=01 now appear ready to hand the reins over to [one of [their] ID=01 two-top picks] ID=02 from a year ago in [quarterback] ID=02 [J.P. Losman] ID=02, [who] ID=02 missed most of last season with a broken leg.
Preprocessing Temporal Normalization and Ordering –Heuristics found in LCC’s TASER temporal normalization system is then used to normalize time expressions to their ISO 9000 values and to compute the relative order of time expressions within a context POS Tagging and Syntactic Parsing –We use LCC’s own implementation of the Brill POS tagger and the Collins Parser in order to syntactically parse sentences and to identify phrase chunks, phrase heads, relative clauses, appositives, and parentheticals. The Bills plan to give the starting job to J.P. Losman. The Bills [now] 2006/01/01 appear ready to hand the reins over to one of their two-top picks from [a year ago] 2005/01/01-2005/12/31 in quarterback J.P. Losman, who missed most of [last season] 2005/01/01-2005/12/31 with a broken leg.
Preprocessing Semantic Parsing –Semantic parsing is performed using a Maximum Entropy-based semantic role labeling system trained on PropBank annotations appear ready hand The Billsnow the reins one of their two-top picks Arg0 ArgM Arg1 Arg2 The Bills now appear ready to hand the reins over to one of their two- top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg. missedwho most of last seasona broken leg Arg0Arg1Arg3
Preprocessing Semantic Parsing –Semantic parsing is performed using a Maximum Entropy-based semantic role labeling system trained on PropBank annotations plan give The Bills the starting job J.P. Losman Arg0 Arg1 Arg2 The Bills plan to give the starting job to J.P. Losman.
Preprocessing Semantic Annotation Heuristics were used to annotate the following semantic information: –Polarity: Predicates and nominals were assigned a negative polarity value when found in the scope of an overt negative marker (no, not, never) or when associated with a negation-denoting verb (refuse). Both owners and players admit there is [unlikely] TRUE to be much negotiating. Never before had ski racing [seen] FALSE the likes of Alberto Tomba. –Factive Verbs: Predicates such as acknowledge, admit, and regret conventionally imply the truth of their complement; complements associated with a list of factive verbs were always assigned a positive polarity value. Members of Iraq's Governing Council refused to [sign] FALSE an interim constitution.
Preprocessing Semantic Annotation (Continued) –Non-Factive Verbs: We refer to predicates that do not imply the truth of their complements as non-factive verbs. –Predicates found as complements of the following contexts were marked as unresolved: non-factive speech act verbs (deny, claim) psych verbs (think, believe) verbs of uncertainty or likelihood (be uncertain, be likely) verbs marking intentions or plans (scheme, plot, want) verbs in conditional contexts (whether, if) Congress approved a different version of the COCOPA Law, which did not include the autonomy clauses, claiming they [were in contradiction] UNRESOLVED with constitutional rights. Defence Minister Robert Hill says a decision would need to be made by February next year, if Australian troops [extend] UNRESOLVED their stay in southern Iraq.
Preprocessing Semantic Annotation (Continued) –Supplemental Expressions: Following (Huddleston and Pullum 2002), constructions that are known to trigger conventional implicatures – including nominal appositives, epithets/name aliases, as-clauses, and non-restrictive relative clauses – were also extracted from text and appended to the end of each text or hypothesis. Shia pilgrims converge on Karbala to mark the death of Hussein, the prophet Muhammad’s grandson, 1300 years ago. Shia pilgrims converge on Karbala to mark the death of Hussein 1300 years ago AND Hussein is the prophet Muhammad’s grandson. Ali al-Timimi had previously avoided prosecution, but now the radical Islamic cleric is behind bars in an American prison. Ali al-Timimi had previously avoided prosecution but now the radical Islamic cleric is behind bars... AND Ali al-Timimi is a radical Islamic cleric. Nominal Appositives Epithets / Name Aliases
Preprocessing Semantic Annotation (Continued) –Supplemental Expressions: Following (Huddleston and Pullum 2002), constructions that are known to trigger conventional implicatures – including nominal appositives, epithets/name aliases, as-clauses, and non-restrictive relative clauses – were also extracted from text and appended to the end of each text or hypothesis. The LMI was set up by Mr. Torvalds with John Hall as a non-profit organization to license the use of the word Linux. The LMI was set up by Mr. Torvalds with John Hall as non-profit organization to license the use of the word Linux AND the LMI is a non-profit organization The Bills now appear ready to... quarterback J.P. Losman, who missed most of last season with a broken leg. The Bills now appear ready to... quarterback J.P. Losman, who missed most of last season with a broken leg AND J.P. Losman missed most of last season... As-Clauses Non-Restrictive Relative Clauses
Lexical Alignment We believe that these lexicosemantic annotations – along with the individual forms of the words – can provide us with the input needed to identify corresponding tokens, chunks, or collocations from the text and the hypothesis. hand The Bills J.P. Losman the reins give The Bills J.P. Losman the starting job Arg0, ID=01, Organization ID=02, PersonArg2, ID=02, Person Arg1 Alignment Probability: 0.74 Alignment Probability: 0. 94 Alignment Probability: 0. 91 Alignment Probability: 0. 49 Unresolved, WN Similar
Lexical Alignment In Groundhog, we used a Maximum Entropy classifier to compute the probability that an element selected from a text corresponds to – or can be aligned with – an element selected from a hypothesis. Three-step Process: –First, sentences were decomposed into a set of “alignable chunks” that were derived from the output of a chunk parser and a collocation detection system. –Next, chunks from the text (C t ) and hypothesis (C h ) were assembled into an alignment matrix (C t C h ). –Finally, each pair of chunks were then submitted to a classifier which output the probability that the pair represented a positive example of alignment.
Lexical Alignment Four sets of features were used: –Statistical Features: Cosine Similarity (Glickman and Dagan 2005)’s Lexical Entailment Probability –Lexicosemantic Features: WordNet Similarity (Pedersen et al. 2004) WordNet Synonymy/Antonymy Named Entity Features Alternations –String-based Features Levenshtein Edit Distance Morphological Stem Equality –Syntactic Features Maximal Category Headedness Structure of entity NPs (modifiers, PP attachment, NP-NP compounds)
Training the Alignment Classifier Two developers annotated a selection of held-out set of 10,000 alignment chunk pairs from the RTE-2 Development Set as either positive or negative examples of alignment. Performance for two different classifiers on a randomly selected set of 1000 examples from the RTE-2 Dev Set is presented below: ClassifierTraining SetPrecisionRecall F1F1 Hillclimber10K pairs0.8370.7740.804 Maximum Entropy10K pairs0.8810.8510.866 While both classifiers performed relatively satisfactorily, F- measure varied significantly (p < 0.05) on different test sets.
Creating New Sources of Training Data In order to perform more robust alignment, we experimented with gathering two techniques for gathering training data: Positive Examples: –Following (Burger and Ferro 2005), we created a corpus of 101,329 positive examples of entailment examples by pairing the headline and first sentence from newswire documents. First Line: Sydney newspapers made a secret bid not to report on the fawning and spending made during the city’s successful bid for the 2000 Olympics, former Olympics Minister Bruce Baird said today. Headline: Papers Said To Protect Sydney Bid –Examples filtered extensively in order to select only those examples where the headline and the first line both synopsized the content of a document –In a evaluation set of 2500 examples, annotators found 91.8% to be positive examples of “rough” entailment
Creating New Sources of Training Data Text: One player losing a close friend is Japanese pitcher Hideki Irabu, who was befriended by Wells during spring training last year. Hypothesis: Irabu said he would take Wells out to dinner when the Yankees visit Toronto. Negative Examples: –We gathered 119,113 negative examples of textual entailment by: Selecting sequential sentences from newswire texts that featured a repeat mention of a named entity (98,062 examples) Extracting pairs of sentences linked by discourse connectives such as even though, although, otherwise, and in contrast (21,051 examples) Text: According to the professor, present methods of cleaning up oil slicks are extremely costly and are never completely efficient. Hypothesis: [In contrast], Clean Mag has a 1000 percent pollution retrieval rate, is low cost, and can b recycled.
Training the Alignment Classifier For performance reasons, the hillclimber trained on the 10K human-annotated pairs was used to annotate a selection of 450K chunk pairs selected equally from these two corpora. These annotations were then used to train a final MaxEnt classifier that was used in our final submission. Comparison of the three alignment classifiers is presented below for the same evaluation set of 1000 examples: ClassifierTraining SetPrecisionRecall F1F1 Hillclimber10K pairs0.8370.7740.804 Maximum Entropy10K pairs0.8810.8510.866 Maximum Entropy450K pairs0.9020.9440.922
Paraphrase Acquisition Groundhog uses techniques derived from automatic paraphrase acquisition (Dolan et al. 2004, Barzilay and Lee 2003, Shinyama et al. 2002) in order to identify phrase-level alternations for each t-h pair. Output from an alignment classifier can be used to determine a “target region” of high correspondence between a text and a hypothesis: The Bills now appear ready to hand the reins over to one of their two- top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg. The Bills plan to give the starting job to J.P. Losman. The Bills now appear ready to hand the reins over to one of their two- top picks from a year ago in quarterback J.P. Losman, who missed most of last season with a broken leg. If paraphrases can be found for the “target regions” of both the text and the hypothesis, we may have strong evidence that the two sentences exist in an entailment relationship. The Bills plan to give the starting job to J.P. Losman.
... plan to give the starting job to... Paraphrase Acquisition For example, if a passage (or set of passages) can be found that are paraphrases of both a text and a hypothesis, those paraphrases can be said to encode the meaning that is common between the t and the h. The BillsJ.P. Losman... appear ready to hand the reins over to...... may go with quarterback......could decide to put their trust in...... might turn the keys of the offense over to... However, not all sentences containing both aligned entities will be true paraphrases...... benched Bledsoe in favor of...... is molding their QB of the future...... are thinking about cutting...
Paraphrase Acquisition Like Barzilay and Lee (2003), our approach focuses on creating clusters of potential paraphrases acquired automatically from the WWW. –Step 1. The two entities with the highest alignment confidence from each t-h pair were selected from each example. –Step 2. Text passages containing both aligned entities (and a context window of m words) were extracted from each original t and h. –Step 3. The top 500 documents containing each pair of aligned entities are retrieved from Google; only the sentences that contain both entities are kept. –Step 4. Text passages containing the aligned entities are extracted from the sentences collected from the WWW. –Step 5. WWW passages and original t-h passages are then clustered using the complete-link clustering algorithm outlined in Barzilay and Lee (2003); Clusters with less than 10 passages are discarded, even if they include the original t-h passage.
Entailment Classification As with other approximation-based approaches to RTE (Haghighi et al. 2005, MacCartney et al. 2006), we use a supervised machine learning classifier in order to determine whether an entailment relationship exists for a particular t-h pair. –Experimented with a number of machine learning techniques: Support Vector Machines (SVMs) Maximum Entropy Decision Trees –February 2006: Decision Trees outperformed MaxEnt, SVMs –April 2006: MaxEnt comparable to Decision Trees, SVMs still lag behind
Entailment Classification Information from the previous three components are used to extract 4 types of features to inform this entailment classifier: Selected examples of features used: –Alignment Features: Longest Common Substring: Longest contiguous string common to both t and h Unaligned Chunk: Number of chunks in h not aligned with chunks in t –Dependency Features: Entity Role Match: Aligned entities assigned same role Entity Near Role Match: Collapsed semantic roles commonly confused by semantic parser (e.g. Arg1, Arg2 >> Arg1&2; ArgM, etc.) Predicate Role Match: Roles assigned by aligned predicates Predicate Role Near Match: Compared collapsed set of roles assigned by aligned predicates
Entailment Classification Classifier Features (Continued) –Paraphrase Features Single Paraphrase Match: Paraphrase from a surviving cluster matches either the text or the hypothesis –Did we select the correct entities at alignment? –Are we dealing with something that can be expressed in multiple ways? Both Unique Paraphrase Match: Paraphrase P 1 matches t, while paraphrase P 2 matches h; P 1 P 2 Category Match: Paraphrase P 1 matches t, while paraphrase P 2 matches h; P 1 and P 2 found in same surviving cluster of paraphrases. –Semantic Features Truth-Value Mismatch: Aligned predicates differ in any truth value ( true, false, unresolved ) Polarity Mismatch: Aligned predicates assigned truth values of opposite polarity
Entailment Classification Alignment Features: What elements align in the t or h? The Bills J.P. Losman The Bills J.P. Losman hand give the reins the starting job Dependency Features: Are the same dependencies assigned to corresponding entities in the t and h? Arg0 -- Arg0Arg2 hand give Arg1 Paraphrase Features: Were any paraphrases found that could be paraphrases of portions of the t and the h?... have gone with quarterback...... has turned the keys of the offense over to... Semantic Features: Were predicates assigned the same truth values? hand give Likely Entailment! Good Alignment 0.94 Good Alignment 0.91 Passable Alignment 0.79 Marginal Alignment 0.49 unresolved
Another Example Not all examples, however, include as many complementary features as Example 139: In spite of that, the government’s “economic development first” priority did not initially recognize the need for preventative measures to halt pollution, which may have slowed economic growth. The government took measures to reduce pollution. Text Hypothesis Example 734 Task=IR, Judgment=NO, LCC=NO, Conf = -0.8344
Example 734 Even though this pair has a number of points of alignment, annotations suggest that there significant discrepancies between the sentences. the need for preventative measures haltpollution measuresreducepollution the government’s priority the governmenttook not recognize Partial Alignment, non-head Arg Role Match NE Category Mismatch Passable Alignment 0.39 Partial Alignment, non-head Arg Role Match Passable Alignment 0.41 Degree POS Match Good 0.84 Lemma match Arg Role Match Good 0.93 POS Alignment Polarity Mismatch Non- Synonymous Poor 0.23 In addition, few “paraphrases” could be found that clustered with passages extracted from either the t or the h.... not recognize need for measures to halt...... took measures to reduce... pollution the govt’s priority the government... has allowed companies to get away with...... is looking for ways to deal with...... wants to forget about... Unlikely Entailment!
Evaluation: 2006 RTE Performance Groundhog correctly recognized entailment in 75.38% of examples in this year’s RTE-2 Test Set: TaskAccuracyAverage Precision QA-Test69.5%0.8237 IE-Test73.0%0.8351 IR-Test74.5%0.7774 SUM-Test84.5%0.8343 Total75.38%0.8082 Performance differed markedly across the 4 subtasks: while the system netted 84.5% of the examples in the summarization set, Groundhog only categorized 69.5% of the examples in the question-answering set. This has something to do with our training data: –Headline corpus features a large number of “sentence compression”- like examples; when Groundhog is trained on a balanced training corpus, performance on SUM task falls to 79.3%.
Evaluation: Role of Training Data Training data did play an important role in boosting our overall accuracy on the 2006 Test Set: performance increased from 65.25% to 75.38% when then entire training corpus was used. Refactoring features has allowed us to obtain some performance gains with smaller training sets, however: our performance when only using the 800 examples from the 2006 Dev Set has increased by 5.25%. Training Set # of Examples Feb 2006 Accuracy Change in Accuracy April 2006 Accuracy Change in Accuracy 2006 Dev80065.25%n/a70.50%n/a “25% LCC”50,60067.00%+1.75%73.75%+3.25% “50% LCC”101,30072.25%+7.00%74.625%+4.125% “75% LCC”151,00074.38%+9.13%76.00%+5.5% “100% LCC”202,60075.38%+10.13%76.25%+5.75% Performance increase appears to be tapering off as amount of training data increases...
Evaluation: Role of Features in Entailment Classifier While best results were obtained by combining all 4 sets of features used in our entailment classifier, largest gains were observed by adding Paraphrase features: 58.00% 65.88% 62.50% 65.25% 66.25% 69.13% 68.00% 71.25% 73.62% 75.38% + Semantic +Paraphrase +Dependency +Alignment + Dependency+ Paraphrase
Conclusions We have introduced a three-tiered approach for RTE: –Alignment Classifier Identifies “aligned” constituents using a wide range of lexicosemantic features –Paraphrase Acquisition Derives phrase-level alternations for passages contained high- confidence aligned entities –Entailment Classifier Combines lexical, semantic, and syntactic information with phrase- level alternation information in order to make an entailment decision In addition, we showed that it is possible – by relaxing of the notion of strict entailment – in order to create training corpora that can prove effective in training systems for RTE –200K+ examples (100K positive, 100K negative)