Presentation is loading. Please wait.

Presentation is loading. Please wait.

InfoMagnets : Making Sense of Corpus Data Jaime Arguello Language Technologies Institute.

Similar presentations


Presentation on theme: "InfoMagnets : Making Sense of Corpus Data Jaime Arguello Language Technologies Institute."— Presentation transcript:

1 InfoMagnets : Making Sense of Corpus Data Jaime Arguello Language Technologies Institute

2 Topic Segmentation: Helping InfoMagnets Make Sense of Corpus Data Jaime Arguello Language Technologies Institute

3 3 Outline InfoMagnets Applications Topic Segmentation –Evaluation of 3 Algorithms Results Conclusions Q/A

4 4 InfoMagnets

5 5 InfoMagnets Applications Behavioral Research 2 Publishable results (submitted to CHI) CycleTalk Project, LTI Netscan Group, HCII Conversational Interfaces Tutalk Gweon et al., (2005) Guide authoring using pre-processed human- human sample conversations Corpus organization makes authoring conversational agents less intimidating. Rose, Pai, & Arguello (2005)

6 6 Pre-processing Dialogue Topic Segmentation A C A B Topic Clustering (1) (2) B C Transcribed conversations Topic “chunks”

7 7 Topic Segmentation Preprocess for InfoMagnets Important computational linguistics problem! Previous Work: –Marti Hearst’s TextTiling (1994) –Beeferman, Berger, and Lafferty (1997) –Barzilay and Lee (2004) NAACL best paper award! –Many others But we are segmenting dialogue…

8 8 Topic Segmentation of Dialogue Dialogue is Different: –Very little training data –Linguistic Phenomena Ellipsis Telegraphic content - And, most importantly … Coherence in dialogue is organized around a shared task, and not around a single flow of information!

9 9 Coherence Defined Over Shared Task Multiple topic shifts in regions w/ no intersection of content words

10 10 Evaluation of 3 Algorithms 22 student-tutor pairs Thermodynamics Conversation via chat interface One coder Results shown in terms of P k Lafferty et al., 1999 Significant tests: 2-tailed, t-tests

11 11 3 Baselines NONE: no topic boundaries ALL: every utterance marks topic boundary EVEN: every 13 th utterance marks topic boundary –avg topic length = 13 utterances

12 12 1 st Attempt: TextTiling Slide two adjacent “windows” down the text Calculate cosine correlation at each step Use correlation values to calculate “depth” “Depth” values higher than a threshold correspond to topic shifts w1 w2 (Hearst, 1997)

13 13 TextTiling Results TextTiling performs worse than baselines Difference not statistically significant Why doesn’t it work? Algorithm(avg) P k NONE0.489703303 ALL0.518040038 EVEN0.511727103 TT 0.535304902 TT T-test p-value TT (NONE)0.160036665 TT (ALL)0.089668491 TT (EVEN)0.221969444

14 14 Topic boundary set heuristically where correlation is 0 Bad results, but still valuable! TextTiling Results

15 15 Cluster utterances Treat each cluster as a “state” Construct HMM –Emissions: state-specific language models –Transitions: based on location and cluster- membership of the utterances Viterbi re-estimation until convergence 2 nd Attempt: Barzilay and Lee (2004)

16 16 B&L statistically better than TT, but not better than degenerate algorithms B&L Results Algorithm(avg) P k NONE0.489703303 ALL0.518040038 EVEN0.518040038 TextTiling0.511727103 B&L 0.5086039 B&L T-test p-value B&L (NONE)0.586165431 B&L (ALL)0.471531964 B&L (EVEN)0.862335954 B&L (TextTiling)0.038565997

17 17 B&L Results Too fine grained topic boundaries Fixed expressions (“ok”, “yeah”, “sure” ) Remember: cohesion based on shared task State-based language models sufficiently different?

18 18 Adding Dialogue Dynamics Dialogue Act coding scheme –Developed for discourse analysis of human- tutor dialogues 4 main dimensions: –Action –Depth –Focus –Control Dialogue Exchange (Sinclair and Coulthart, 1975)

19 19 X- dimensional learning (Donmez et al., 2004) Use estimated labels on some dimensions to learn other dimensions 3 types of Features: –Text (discourse cues) –Lexical coherence (binary) –Dialogue Acts labels 10-fold cross-validation Topic Boundaries learned on estimated labels, not hand coded ones! 3 rd Attempt: Cross-Dimensional Learning

20 20 X-Dimensional Learning Results X-DIM statistically better than TT, degenerate algorithms AND B&L! Algorithm(avg) P k NONE0.489703303 ALL0.518040038 EVEN0.511727103 TextTiling0.535304902 B&L0.5086039 X-DIM0.430374083 X-DIM T-test p-value X-DIM (NONE)0.024460654 X-DIM (ALL)0.023027116 X-DIM (EVEN)0.022848002 X-DIM (TextTiling) 0.011833934 X-DIM (B&L) 0.022591035

21 21 Statistically Significant Improvement TTB&LX-DIM NONE NON-SIG SIG ALL NON-SIG SIG EVEN NON-SIG SIG TT SIG B&L SIG

22 22 Future Directions Merge cross-dimensional learning (w/ dialogue act features) with B&L content modeling HMM approach. Explore other work in topic segmentation of dialogue

23 23 Summary Introduction to InfoMagnets Applications Need for topic segmentation Evaluation of other algorithms Novel algorithm using X-dimensional learning w/statistically significant improvement

24 24 Q/A Thank you!


Download ppt "InfoMagnets : Making Sense of Corpus Data Jaime Arguello Language Technologies Institute."

Similar presentations


Ads by Google