Presentation on theme: "Overspecified reference in hierarchical domains: measuring the benefits for readers Ivandre Paraboni * Judith Masthoff # Kees van Deemter # * = University."— Presentation transcript:
Overspecified reference in hierarchical domains: measuring the benefits for readers Ivandre Paraboni * Judith Masthoff # Kees van Deemter # * = University of Sao Paulo # = University of Aberdeen
What this is about Generation of Referring Expressions (GRE) Referring expression is overspecified if a clear referring expression can be obtained by removing a property Informally: overspecified = logically redundant
Introduction to the problem Suppose – I live on Western Road, the longest street in Aberdeen –I live at number 968. No other house in Aberdeen has that number Number 968, Aberdeen is a distinguishing description, but its not very useful Its better to add logically redundant information, e.g., 968 Western Road, Aberdeen, or even 968 Western Road, Bon Accord, Aberdeen
Overspecification in referring expressions Any GRE algorithm that does not achieve Full Brevity (Dale 1989) Investigated in its own right by e.g. –Arts 2004 (role of location; purely empirical) –Jordan 2000 (overspec in specific situations, e.g., when a sale is confirmed) –Horacek 2005 (overspec when there is uncertainty about applicability of properties)
Our focus: The need for overspecification when a large domain is not fully known in advance to a hearer. Typical examples involve space or time: –A house in a city, a photocopier in a building, a picture in a document –(An event or object in time, e.g., the minister of the colonies in the XYZ government ) This talk: empirical validation of algorithms
Caveat Overspecification can make it easier to identify the referent but it is bound to lengthen reading times Our terminology: we expect overspecification –to make interpretation harder –to make resolution easier
Short history... Paraboni & van Deemter (INLG-2002): A simple theory of the way in which hearers perform search. Ancestral Search (AS) Two types of situations that AS predicts to be problematic for hearers: Lack of Orientation (LO) and Dead End (DE). An algorithm (in two flavours) that adds redundant information when AS predicts these problems An experiment to test whether these algorithms improve the output of GRE
(1) Lack of Orientation (LO) University of Brighton Watts building Cockcroft building North Wing South WingNorth West South biblioteca auditorium the West Wing
(2) Dead End (DE) University of Brighton Watts building Cockcroft building North Wing ? South WingNorth West South library auditorium the library in the North Wing
Explanation (informal!) Why are LO and DE bad? Ancestral Search (AS): Search locally, then one level up at a time Essentially, this is just salience (cf. Krahmer & Theune 2000) applied to hierarchies
Summary of Experiment 1: Descriptions compared by subjects 15 subjects were shown documents from which most of the words were deleted Binary forced choice between two expressions that refer to document parts: 1.the obvious minimal description 2.the redundant description generated by our algorithm
What the subjects chose between (example)
Hypotheses & Outcomes Hyp 1: In problematic situations, redundant descriptions are preferred Hyp 2: In non-problematic situations, non-redundant descriptions are preferred Outcomes: –Hyp 1: overwhelmingly confirmed –Hyp 2: trend in the right direction (57%), but not statistically significant. (Too few subjects?)
Limitations of first experiment This experiment was hybrid: partly about reading, partly about writing It did not teach us why redundant descriptions were preferred (in problematic cases) We think this was because non-redundant descriptions caused problems for resolution but the experiment did not address resolution separately. (Subjects may have balanced interpretation and resolution when judging).
What next? Therefore, a new experiment was called for, which addresses resolution only. Documents as our domain again Add hyperlinks to support non-linear search through the document Track readers resolution (i.e., search) process Intricate experiment, hence a new author: Judith Masthoff (University of Aberdeen)
Experiment 2: Tracking resolution Effect of logical redundancy on the performance of readers Focussing on resolution
Experimental Design 40 subjects completed experiment Within-subjects design: each subject shown 20 documents Order of documents randomized Documents were made to look different Reader had knowledge of hierarchical structure Reader was given task: Please click on.. Navigation actions recorded
Lets talk about helicopters. Please click on picture 4 in part C Reader Location
Hypothesis 1 In a problematic (DE/LO) situation, the number of navigation actions required for a long (FI/SL) description is smaller than that required for a minimal description. Informally: redundancy helps resolution! (in problematic situations)
But... it seems likely that redundant information will always help resolution so lets compare the Gain in problematic/unproblematic situations
Hypothesis 2 The Gain achieved by a long description over a minimal description will be larger in a problematic situation than in a non-problematic situation Informally: redundancy helps especially in problematic situations
But... Even more redundancy might have helped even more The obvious candidate: a complete description Compare cases where our algorithm prescribes a complete description with ones where it does not. We want b to be greater than a: a = Gain(complete-description, incomplete-description-generated-by-algorithm) b = Gain(complete-description-generated-by-algorithm, incomplete-description)
Hypothesis 3 The Gain of a complete description over a less complete one will be larger for a situation in which our algorithms generated the complete description, than for a situation in which our algorithms generated the less complete description.
Results: Hypothesis 1 Do redundant descriptions benefit problematic situations?
Results: Hypothesis 2 Do redundant descriptions benefit problematic situations MORE than non-problematic situations?
Comparing like with like General Linear Model (GML) with repeated measures Comparison of similar situations, e.g. 2 and 7 sit2&7: minimal = pic.3 in part A redundant = pic.3 in part A of section 2 sit2: reader is in same section as target sit7: reader is in a different section
Results: Hypothesis 2 Do redundant descriptions benefit problematic situations MORE than non-problematic situations? Yes!
Results: Hypothesis 3 FI Are our algorithms economical with redundancy?
Results: Hypothesis 3 FI Are our algorithms economical with redundancy? Yes!
How much overspecification is optimal ? University of Brighton Watts building Cockcroft building North Wing SouthNorth West South library auditorium The auditorium The...in the North Wing The.... in the Watts building The.... on this campus
Which of all these descriptions is best? Depends on issues other than the structure of the domain, e.g., –how much time/space has the speaker/writer available? –how important is it that misunderstandings are avoided? [cf., Van Deemter et al., this conference] –is there room for negotiation through dialogue [cf., Khan et al., this conference])
In setting of this experiment We did not find a point beyond which overspecification backfires We did find a point of diminishing returns for resolution speed Given that interpretation deteriorates with every added property, the figures are suggestive
Getting a feeling for the numbers Nonproblematic situations (situations 7 and 8): –short descr: 1.53 clicks (2 properties) –redundant (other): 1.34 clicks (3 properties) Problematic situations (situations 3 and 4): –short descr: 4.05 clicks (1 property) –redundant (algorithm): 1.77 clicks (2 properties) –redundant(other): 1.31 clicks (3 properties)
Conclusion Overspec can have many reasons (Jordan 2000, Horacek 2005) Overspec isnt always equally necessary Focus on overspec for guiding resolution The optimum amount of overspec is hard to determine But we have found a point of diminishing returns, based on the need to avoid DE and LO.
[ A medical comparison A hospital with two types of patients, all of whom have coughing (cf., clicking!) as their main symptom –chest infections (serious patients) –throat infections (light patients) you can administer 1, 2, or 3 of pills (cf., properties). But pills can be harmfull, so the doctor uses them sparingly
The doctors regime: light patients should get 1 pill serious patients should get 2 pills on a normal night, and 3 pills on a bad night Is this a wise regime? Tests were done...
Test of effectiveness of pills 1.Serious patients who get their 2 or 3 pills start coughing less 2.Serious patients benefit more from getting their prescribed high number of pills (as opposed to just 1) than light patients 3.Focus on serious patients. Try giving the ones that are having a good night 3 pills (i.e. one more than prescribed). They benefit less (from getting 3 instead of 2 pills) than the ones that are having a bad night benefitted (from getting 3 instead of 2 pills). ]
Results on Search Behaviour # Deviations from Ancestral Search in first navigation action for 12 documents with incomplete descriptions