Presentation is loading. Please wait.

Presentation is loading. Please wait.

SELLC Winter School 2010 Evaluating Algorithms for GRE Kees van Deemter (work with Albert Gatt, Ielka van der Sluis, and Richard Power) University of Aberdeen,

Similar presentations


Presentation on theme: "SELLC Winter School 2010 Evaluating Algorithms for GRE Kees van Deemter (work with Albert Gatt, Ielka van der Sluis, and Richard Power) University of Aberdeen,"— Presentation transcript:

1 SELLC Winter School 2010 Evaluating Algorithms for GRE Kees van Deemter (work with Albert Gatt, Ielka van der Sluis, and Richard Power) University of Aberdeen, Scotland, UK

2 SELLC Winter School 2010 Outline GRE: Generation of Referring Expressions TUNA project: Corpus and Annotation Evaluation of Algorithms –Furniture Domain –People Domain [ Evaluation in the real world: STEC ]

3 SELLC Winter School 2010 TUNA project (ended Feb. 2007) TUNA: Towards a UNified Algorithm for Generating Referring Expressions. 1.Extend coverage of GRE algorithms (plurals, negation, gradable properties,…) 2.Improve empirical foundations of GRE Focus on –Content Determination –First mention NPs (no anaphora!)

4 Background Dale and Reiter hypothesised that the Incremental Algorithm (IA) led to better output than other algorithms –better: more human-like –other algorithms: see below SELLC Winter School 2010

5 Other GRE Algorithms Full Brevity (FB; Dale 1989) –Generation of minimal descriptions –For example, by first trying all descriptions of length 1, then length 2, and so on. Greedy Algorithm (GR; Dale 1989) –Always add property that removes the most distractors

6 SELLC Winter School 2010 Elicitation experiment Participants were told that we wanted to test an AI program that interprets referring expressions Participants were shown a series of domains Each domain included 1 or 2 target objects Participants entered their descriptions, then the referents were removed To make the interaction seem real, we sometimes removed the wrong object! (25% of trials) –The experiment was later repeated without this feature –Essentially the same outcomes were found For generality: two types of domains (furniture, people)

7 SELLC Winter School 2010 Furniture trial

8 SELLC Winter School 2010 People trial

9 SELLC Winter School 2010 Method (overview) Experiment leads to transparent corpus of referring expressions: –referent and distractors are known –Domain attributes are known Transparent corpora can be used for many purposes This talk: Compare some classic algorithms –giving each algorithm the same input as subjects –computing how similar algorithms output is to subjects output –We count semantic content only

10 SELLC Winter School 2010 Elicitation Experiment Furniture (simple domain) –TYPE, COLOUR, SIZE, ORIENTATION People (complex domain) –Nine annotated properties in total Location: –Vertical location (Y-DIMENSION) –Horizontal location (X-DIMENSION) the green desk facing backwards the sofa and the desk which are red the young man with a white shirt the man with the funny haircut the man on the left the chair in the top right

11 SELLC Winter School 2010 Corpus setup Each corpus was carefully balanced, e.g. between singulars and plurals. Between-subjects design: -Location: Subjects discouraged from using locative expressions. +Location: Subjects not discouraged. -FaultCritical: Subjects could correct their utterances +FaultCritical: Subjects could not correct their utterances After discounting outliers and (self-reported) non-fluent speakers, 45 subjects were left

12 SELLC Winter School 2010 Experiment design: Furniture (-Location) 18 trials (C=Colour, O=orientation, S=size) –1 referent: minimal identification uses {c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials] –2 similar referents {c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials] –2 dissimilar referents {c}, {o}, {s}, {c,o}, {c,s}, or {o,s} [6 trials]

13 SELLC Winter School 2010 Other evaluation studies Limitations: Limited numbers of subjects/referents Few attempts at balancing the corpus IA: no teasing apart of preference orders NB Some of these studies were more ambitious in some respects, looking at context, and going beyond identification

14 SELLC Winter School 2010 Other evaluation studies Jordan 2000, Jordan & Walker 2005 –More than just identification (Jordan 2000) Siddharthan & Copestake 2004 –References in linguistic context Gupta & Stent 2005 –Realisation mixed with Content Determination Viethen & Dale 2006 –Only Colour and Location

15 SELLC Winter School 2010 Extensions to the classics Plurality: (van Deemter 2002) –Extend each algorithm to search through disjunctions of increasing length Location: (van Deemter 2006) –Locatives treated as gradable: the leftmost table/person –E.g., suppose the referent x is located in column 3 => x is left of column 4, x is left of column 5 … => x is right of column 2, x is right of column 1… Type: –People tend to use TYPE (Dale & Reiter 1995) –Here: All algorithms added TYPE.

16 SELLC Winter School 2010 Evaluation aims Hypothesis in Dale & Reiter 1995: –IA resembles human output most Our main questions: –Is this true? –How important are parameters (PO) for the IA? More generally: –assess quality of classic GRE algorithms : –calculate average match between the description generated by an algorithm and the descriptions produced by people (for the same referent)

17 SELLC Winter School 2010 Evaluation metric Dice Coefficient: 2 x |Common properties| |total properties| corpus: {A,B,C} algorithm: {B,C} Dice = … corpus: {A,B,C} algorithm: {A,B,C,D} Dice = …

18 SELLC Winter School 2010 Evaluation metric Dice Coefficient: 2 x |Common properties| |total properties| corpus: {A,B,C} algorithm: {B,C} Dice = (2*2)/5 = 4/5 corpus: {A,B,C} algorithm: {A,B,C,D} Dice = (2*3)/7 = 6/7

19 SELLC Winter School 2010 Evaluation metric Dice Coefficient: 2 x |Common properties| |total properties| A coefficient result of 1 indicates identical sets. 0 means no common terms We also used this to measure agreement between annotators of the corpus

20 SELLC Winter School 2010 Assumptions behind DICE The discriminatory power of a description does not matter All properties are equidistant See Gatt & Van Deemter 2007, Content Determination in GRE: evaluating the evaluator

21 SELLC Winter School 2010 Evaluation (I): Furniture Which preference orders for the IA? –Psycholinguistic evidence: COLOUR >> {ORIENTATION, SIZE} (Pechmann 89; Eikmeyer & Ahlsen 96; Belke & Meyer 02) Y-DIMENSION >> X-DIMENSION (Bryant et al, 1992; Arts 2004) Split data: +LOCATION vs –LOCATION This talk: focus on –LOCATION –LOCATION = approx. 800 descriptions Compare algorithms to a randomized IA (RAND)

22 SELLC Winter School 2010 Furniture: -LOCATION Significant FB/GR

23 SELLC Winter School 2010 Beyond Toy Domains More on Furniture corpus: Gatt et al. (ENLG-2007) With complex real-world objects: –Many different attributes can be used –Number of POs explodes –Few psycholinguistic precedents People domain attributes: –{ hasBeard, hasGlasses, age, hasTie, hasSuit, hasSuit, hasHair, hairColour, orientation } –9 Attributes, so 9! = 362880 possible POs

24 SELLC Winter School 2010 IA: Preference Orders for People Domain Little psycholinguistic evidence for choosing between all 362880 possible POs Focus on the most frequent Attributes: G=hasGlasses, B=hasBeard, H=hasHair, C=haircolour –Assumption: H and B must precede C –This leaves us with eight POs: { GBHC, GHBC,HBGC,HBCG, HGBC,BHGC, BHCG, BGHC }

25 SELLC Winter School 2010 Preference Orders and frequency MeanSum type1.39475 hasGlasses.68231 hasBeard.66226 HairColour.61210 hasHair.46158 orientation.2173 age.1034 hasTie.0412 hasSuit.014 hasShirt.013 For attributes other than {G,C,H,B}, we let corpus frequency determine the order E.g, IA-GBHC uses type, G,B,H,C, age, hasTie, hasSuit,hasShirt as its PO

26 SELLC Winter School 2010 Results People Domain IA-BASE Significant Significant by subjects GR

27 SELLC Winter School 2010 Results People domain IA_base performs very badly now So much about the best IAs that start with {B,H,G,C} and end with Some of these did much worse: –IA_BHCG had DICE=0.6, making it significantly worse (by subjects) than GR!

28 SELLC Winter School 2010 Summary People domain gives much lower DICE scores than Furniture domain Difference between good and bad POs was –small (but significant) in the Furniture domain, –big (and significant) in the People domain

29 SELLC Winter School 2010 Summary The Incremental Algorithm (IA): –not an algorithm but a class of algorithms The best IA beats all other algorithms, but the worst is very bad... GR performs remarkably well. How to choose a suitable PO? –Furniture: few attributes; psycholinguistic precedent Still, there is variation. –People: more attributes; no precedents Variation even greater!

30 SELLC Winter School 2010 Discussion Suppose you want to build a GRE algorithm for a new and complex domain, for which no transparent corpus is available. Psycholinguistic principles are unlikely to help you much If corpus is also not balanced, then frequency may not say much either …

31 SELLC Winter School 2010 Other uses of this method: STEC Summer 2007: First NLG Shared task Evaluation Challenge (STEC) STEC involved GRE only, focussing on Content Determination 22 GRE Algorithms were submitted and evaluated (6 teams) Reported in UCNLG+MT workshop, Copenhagen, Sept 2007

32 SELLC Winter School 2010 Other uses of this corpus: STEC An even bigger STEC one year later Each algorithm was compared with the TUNA corpus (minus 40% training set) –Both Furniture and People domain –DICE measured humanlikeness –Singulars only Each algorithm was also tested in terms of identification time (by human reader)

33 SELLC Winter School 2010 Some STEC results 1.The more minimal the descriptions generated by these 22 systems were, the worse their DICE scores were

34 SELLC Winter School 2010

35 2. No relation between humanlikeness and identification time –Best system in terms of DICE was worst- but-one in terms of identification time More research needed on the different criteria for judging NLG output

36 SELLC Winter School 2010 Thank you

37 SELLC Winter School 2010 Annotator agreement Semantic markup was applied manually to all descriptions in the corpus. 2 annotators were given a stratified random sample Comparison used Dice. meanmode Furniture0.89 (A/B) 1 (71.1%) Annotator A0.93 (A/us) 1 (74.4%) Annotator B0.92 (B/us) 1(73%) People0.89 (A/B) 1(70%) Annotator A0.84 (A/us) 1(41.1%) Annotator B.78 (B/us) 1(36.3%)


Download ppt "SELLC Winter School 2010 Evaluating Algorithms for GRE Kees van Deemter (work with Albert Gatt, Ielka van der Sluis, and Richard Power) University of Aberdeen,"

Similar presentations


Ads by Google