Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga.

Similar presentations


Presentation on theme: "Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga."— Presentation transcript:

1 Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga Tokyo Institute of Technology IJCNLP 2011 (Nov 9 2011)

2 Research background 2  Typical coreference/anaphora resolution  Researchers have tackled problems provided by MUC, ACE and CoNLL shared tasks (a.k.a. OntoNote)  Mainly focused on linguistic aspect of reference function  Multi-modal research community (Byron, 2005; Prasov and Chai, 2008; Prasov and Chai, 2010; Schütte et al., 2010, Iida et al. 2010)  Essential for human-computer interaction  Identifying referents of referring expressions in a static scene or a situated world, taking extra-linguistic clues into account

3 Multi-modal reference resolution 3 move the triangle to the left Rotate the triangle at top right 60 degrees clockwise All right… done it.. O.K. dialogue history … piece 1: move (X:230,Y:150) piece 7: move (X:311, Y:510) piece 3: rotate 60° … piece 1: move (X:230,Y:150) piece 7: move (X:311, Y:510) piece 3: rotate 60° action history eye-gaze

4 Aim 4  Integrate several types of multi-modal information into a machine learning-based reference resolution model  Investigate which kinds of clues are effective on multi-modal reference resolution

5 Multi-modal problem setting: related work 5  3D virtual world (Byron 2005, Stonia et al. 2008)  e.g. Participants controlled an avatar in a virtual world for exploring hidden treasures  Frequently occurring scene updates  Referring expressions will be relatively skewed to exophoric cases  Static scene (Dale 1992)  Centrality and size of each object in computer display is fixed through dialogues  Change of visual salience of objects is not observed

6 Evaluation data set creation 6  REX-J corpus (Spanger et al. 2010)  Dialogues and transcripts of collaborative work (solving Tangram puzzles) by two Japanese participants  Designed the puzzle solving task to require the frequent use of both anaphoric and exophoric referring expressions

7 solver operator Setting of collecting data 7 not available shield screen working area goal shape

8 Collecting eye gaze data 8  Recruited 18 Japanese graduate students  split them into 9 pairs  All pairs knew each other previously and were of the same sex and approximately the same age  Introduced to solve 4 different Tangram puzzles  Use the Tobii T60 Eye Tracker, sampling at 60 Hz for recording users’ eye gaze with 0.5 degrees in accuracy  5 dialogues in which the tracking results contained more than 40% errors were removed

9 Annotating referring expressions 9  Conducted using a multimedia annotation tool, ELAN  Annotator manually detects a referring expression and then selects its referent out of the possible puzzle pieces shown on the computer display  Total number of annotated referring expressions: 1,462 instances in 27 dialogues  1,192 instances in solver’s utterances (81.5%)  270 instances in operator’s utterances (18.5%)

10 Multi-modal reference resolution 10  Base model  Ranking candidate referents is important for better accuracy (Iida et al. 2003, Yang et al. 2003, Denis & Baldridge 2008)  Apply Ranking SVM algorithm (Joachims, 2002) Learn a weight vector to rank candidates for a given partial ranking of each referent  Training instances To define the partial ranking of candidate referents, simply rank referents referred to by a given referring expression as first place and any other referents as second place

11 Feature set 11 1. Linguistic features: Ling (Iida et al. 2010): 10 features  Capture the linguistic salience of each referent based on the discourse history 2. Task-specific features: TaskSp (Iida et al. 2010):12 features  Consider the visual salience based on the recent movements of mouse cursor and recent pieces manipulated by the operator 3. Eye-gaze features: Gaze (proposed):14 features

12 Eye gaze as clues of reference function 12  Eye gaze  Saccades: quick, simultaneous movements of both eyes in the same direction  Eye-fixations: maintaining of the visual gaze on a single location  Direction of eye gaze directly reflects the focus of attention (Richardson et al., 2007)  Used the eye fixations as clues for identifying the pieces focused on  Separating saccades and eye fixations: Dispersion-threshold identification (Salvucci and Anderson, 2001)

13 Eye gaze features 13 time “First you need to move the smallest triangle to the left” a b c d e f g fixating on piece_b t-T T = 1500msec (Prasov and Chai 2010) t t’ fixating on piece_a how frequently or how long the speaker fixates on each piece

14 Empirical evaluation 14  Compared models with different combinations of the three types of features  Conducted 5-fold cross-validation  Proposed model with model separation (Iida et al. 2010)  the referential behaviour of pronouns is completely different from non-pronouns  Separately create two reference resolution models; pronoun model: identifies a referent of a given pronoun non-pronoun model: identifies a referent of all other expressions (e.g. NP)

15 Results of (non-)pronouns 15 modelpronounnon-pronoun Ling56.065.4 Gaze56.748.0 TaskSp79.221.1 Ling+Gaze66.575.7 Ling+TaskSp79.067.1 TaskSp+Gaze78.048.4 Ling+TaskSp+Gaze78.776.0

16 Overall results 16 modelaccuracy Ling61.8 Gaze51.2 TaskSp42.8 Ling+Gaze72.3 Ling+TaskSp71.5 TaskSp+Gaze59.5 Ling+TaskSp+Gaze77.0

17 Investigation of the significance of features 17  Calculate the weight of features according to the following formula set of the support vectors in a ranker weight of the support vector x function that returns 1 if f occurs in x

18 Weights of features in each model 18 pronoun modelnon-pronoun model rankfeatureweightfeatureweight 1TaskSp10.4744Ling60.6149 2TaskSp30.2684Gaze100.1566 3Ling10.2298Gaze90.1566 4TaskSp70.1929Gaze70.1255 5TaskSp90.1605Gaze110.1225 6Gaze100.1547Gaze140.1134 7Gaze90.1547Gaze130.1134 8Ling60.1442Gaze120.1026 9Gaze70.1267Ling20.1014 10Ling20.1164Gaze10.0750 TaskSp1: mouse cursor was over a piece at the beginning of uttering a referring expression TaskSp3: time distance is less than or equal to 10 sec after the mouse cursor was over a piece TaskSp1: mouse cursor was over a piece at the beginning of uttering a referring expression TaskSp3: time distance is less than or equal to 10 sec after the mouse cursor was over a piece

19 Weights of features in each model 19 pronoun modelnon-pronoun model rankfeatureweightfeatureweight 1TaskSp10.4744Ling60.6149 2TaskSp30.2684Gaze100.1566 3Ling10.2298Gaze90.1566 4TaskSp70.1929Gaze70.1255 5TaskSp90.1605Gaze110.1225 6Gaze100.1547Gaze140.1134 7Gaze90.1547Gaze130.1134 8Ling60.1442Gaze120.1026 9Gaze70.1267Ling20.1014 10Ling20.1164Gaze10.0750 Ling6: shape attributes of a piece are compatible with the attributes of a referring expression Gaze10: there exists the fixation time of a piece in the time period [t − T, t] Gaze 9: the fixation time of a piece in the time period [t − T, t] is longest out of all pieces Gaze10: there exists the fixation time of a piece in the time period [t − T, t] Gaze 9: the fixation time of a piece in the time period [t − T, t] is longest out of all pieces

20 Summary 20  Investigated the impact of multi-modal information on reference resolution in Japanese situated dialogues  The results demonstrate  The referents of pronouns rely on the visual focus of attention such as is indicated by moving the mouse cursor  Non-pronouns are strongly related to eye fixations on its referent  Integrating these two types of multi-modal information into linguistic information contributes to increasing accuracy of reference resolution

21 Future work 21  Need further data collection  All objects in Tangram puzzle (i.e. puzzle pieces) have nearly the same size Rejecting the factor that a relatively larger object occupying the computer display has higher prominence over smaller objects  Zero-anaphors in utterances need to be annotated frequent use of them in Japanese


Download ppt "Multi-modal Reference Resolution in Situated Dialogue by Integrating Linguistic and Extra-Linguistic Clues Ryu Iida Masaaki Yasuhara Takenobu Tokunaga."

Similar presentations


Ads by Google