Third Recognizing Textual Entailment Challenge Potential SNeRG Submission
RTE3 Quick Notes RTE Web Site: http://www.pascal-network.org/Challenges/RTE3/ Textual Entailment resource pool: http://aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool New development set released to correct errors last week Test set released on March 5 th !!! New !!! submission date March 12 th Report deadline date March 26th
Development set examples Example of a YES result A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured. 30 die in a bus collision in Uganda. Example of a NO result Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One. Blue Mountain Lumber owns Ernslaw One.
Development set examples – cont. 4 Different types of entailment tasks –Information Retrieval (IR) –Question Answering (QA) –Information Extraction (IE) –Multi-document summarization (SUM) Development set consists of 200 samples of each type of entailment 400 evaluate to “YES” and 400 to “NO” Another attribute “length” in the development set has only 134 long and 666 short. [Note to self: gather a group of demon hunters to hunt down the short samples, will need volunteers and holy water.]
Evaluation Two submissions per team can be made Program output is a file that contains the following information. Line 1 must contain: “ ranked: yes/no” Line 2..end contains: “ pair_id judgment “ For example: ranked: yes 4 YES 3 YES 6 YES 1 NO 5 NO 2 NO Accuracy is calculated from the answers returned correct Precision is determined by the order and the correctness of the answers returned by the formula: 1/R * sum for i=1 to n (E(i) *#-correct-up-to-pair-i/i) n is the number of the pairs in the test set R is the total number of positive pairs in the test set E(i) is 1 if the i-th pair is positive and 0 otherwise and i ranges over the pairs, ordered by their ranking.
Possible Implementation Discover features that can be measured with a continuous variable For example: –Wordbag match ratio = # of words matched between text and hypothesis / # of words in the hypothesis Arrange feature values in a feature vector x Apply the general multivariate normal density for the assembled feature vector x
Implementation to Determine Baseline I have done an implementation to determine the baseline of what we can expect out of a full implementation of all syntactic features First baseline result: Used 1 feature: Wordbag count > n, where n is decided after development set is processed Success: 509, Fail: 290 Final rate: 63.9% Second baseline result: Used simple preprocessing and Wordbag count: removing punctuation, case insensitivity, ignoring simple words Success: 534, Fail: 265 Final rate: 66.8% Attempted a little semantic processing, like increasing weight based on “negative” words for returning negative results, but results did not increase In RTE2 competition highest accuracy was only 70%!
Potential Features Wordbag ratio = # of matches between text and hypothesis / # of words in hypothesis Works for: A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured. 30 die in a bus collision in Uganda. Wordbag ratio = 6 / 8 Fails for: Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One. Blue Mountain Lumber owns Ernslaw One. Wordbag ratio = 5 / 6 Potential solution needs to include processing semantic knowledge about the relationship between the highlighted red words.
Potential Features – cont. Word proximity = average distance between matched words in the text For example: A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured. 30 die in a bus collision in Uganda. Matched words: 30 in bus collision in Uganda 30: 3, 12, 11, 3, 6 in: 3, 5, 4, 1 bus: 12, 5, 1, 5, 6 collision: etc… May not help much or at all, but by adding additional independent features (from a gaussian distribution), we can potentially increase the P(w n |x)
Potential Features – cont. Word grouping = average of counts of word groups of length 2 / possible combos For example: A bus collision with a truck in Uganda has resulted in at least 30 fatalities and has left a further 21 injured. 30 die in a bus collision in Uganda. Matched groups: “bus collision”, “in Uganda”, 7 possible combinations = 2/7 Blue Mountain Lumber is a subsidiary of Malaysian forestry transnational corporation, Ernslaw One. Blue Mountain Lumber owns Ernslaw One. Matched groups: “Blue Mountain”, “Mountain Lumber”, “Ernslaw One”, 5 combinations = 3/5 Once again this may not help much or at all, but may help us brainstorm a bit
Potential Features – cont. Quick and easy stats we can generate may include using –Stemmers – count matching verbs? –Synonyms/Antonyms – count any matches for both types –Parts of speech - brainstorm anyone? –Removal or weighting of names, place-names – make a multiple word “match” into a single symbol so as not to give extra weight to names or place-names –Matching phrases that appear similar in both the text and the hypothesis Any “count” that can be created from any processing of semantic or syntactic information would be able to be used I am now using Matlab to implement, so any Unix program can be used to process a feature – maybe there is some existing feature extraction Unix command-line program that someone knows about
RTE3 Important Dates Test set released on March 5 th –Gives us 10 days before we can submit Last day to submit is March 12 th –Submission consists of running the data yourself and then submitting the result file –A cheater says whaaaa? Technical report deadline March 26 th I will be working on this on and off until March 6 th, then I can devote full time for our submission