Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, umich.edu Modeling Document Dynamics: An.

Similar presentations


Presentation on theme: "Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, umich.edu Modeling Document Dynamics: An."— Presentation transcript:

1 Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, radev} @ umich.edu Modeling Document Dynamics: An Evolutionary Approach

2 What are dynamic texts? Sets of topically related documents (news stories, Web pages, etc.) Multiple sources Written/published at different points in time – may change over time Challenging features: –Paraphrases –Contradictions –Incorrect/biased information

3 Milan plane crash: April 18, 2002 04/18/02 13:17 (CNN) The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26 th floor at 5:50pm (1450 GMT) on Thursday. 04/18/02 13:42 (ABCNews) The plane was destined for Italy’s capital Rome, but there were conflicting reports as to whether it had come from Locarno, Switzerland or Sofia, Bulgaria. 04/18/02 13:42 (CNN) The plane, en route from Locarno in Switzerland, to Rome, Italy, smashed into the Pirelli building’s 26 th floor at 5:50pm (1450 GMT) on Thursday. 04/18/02 13:42 (FoxNews) The plane had taken off from Locarno, Switzerland, and was heading to Milan’s Linate airport, De Simone said.

4 Problem for IR systems User poses a question or query to a system –Known facts change at different points in time –Sources contradict one another –Many paraphrases – similar but not necessarily equivalent - information What is the “correct” information? What should be returned to the user?

5 Current Goals Propose that dynamic texts “evolve” over time Chronology recovery task Approaches –Phylogenetics: reconstruct history of a set of species based on DNA –Language modeling: LM constructed from first document should fit less well over time

6 Phylogenetic models [Fitch&Margoliash,67] Given a set of species and information about their DNA, construct a tree that describes how they are related, w.r.t. a common ancestor Statistically optimal tree minimizes the deviation between the original distances and those represented in the tree 1 bear2 dogwolf DWB D0456 W4044 B56440 24 2022 Distance matrix Candidate tree

7 Phylogenetic models (2) History of chain letters [Bennett&al,03] –“Genes” were facts in the letters: Names/titles of people Dates Threats to those who don’t send the letter on –Distance metric was the amount of shared information between two chain letters –Used Fitch/Margoliash method to construct trees Result: An almost perfect phylogeny. Letters that were close to one another in the tree shared similar dates, “genes” and even geographical properties.

8 Procedure: Phylogenetics For each document cluster and representation, generate a phylogenetic tree using Fitch [Felsenstein, 95] –Representations: full document, extractive summaries –Generate the Levenshtein distance matrix –Input matrix into Fitch to obtain unrooted tree “Reroot” the unrooted tree at the first document in the cluster. To obtain the chronological ordering, traverse the rerooted tree. Assign chronological ranks, starting with ‘1’ for the root.

9 S 1 (d=3.5) 1 (d=0) S 2 (d=6.5) S 4 (d=1) 2 (d=8.5) time t S1S1 S2S2 S3S3 S4S4 S 3 (d=0) Unrooted tree

10 S 2 (d=10) S 1 (d=0) S 3 (d=12) S 4 (d=13) 2 (d=12) 1 (d=3.5) time t S1S1 S2S2 S3S3 S4S4 Rerooted tree

11 Procedure: LM Approach Inspiration: document ranking for IR –If candidate document’s LM assigns high probability to query  relevant [Ponte & Croft, 98] Create LM from earliest document –Trigram backoff model using CMU-Cambridge toolkit [Clarkson & Rosenfeld,97] Evaluate it on remaining documents –Use fit to rank them: OOV rates (increasing), trigram (decreasing) and unigram-hit ratios (increasing)

12 Evaluation Metric: Kendall’s rank-order correlation coefficient (Kendall’s  ) [Siegel & Castellan,88] –-1    1 –Expresses extent to which the chronological rankings assigned by the algorithm agree with the actual rankings Randomly assigned rankings have, on average, a  = 0.

13 Dataset 36 document sets –Manually collected (6) –NewsInEssence clusters (3) –TREC Novelty clusters (27) [Soboroff & Harman, 03] 15 training, 6 dev/test, 15 test Example topics Story#Doc.Time Span #Sources Milan plane crash 561.5 days5 N33 Russian submarine Kursk sinks 251 month3 N48 Human genome decoded 252 years3

14 Training Phase Median  # Significant (  =0.10) Full document0.168/15 Summ-10.136 Summ-50.176 3-gram hit0.177 1-gram hit0.2111 OOV0.2813

15 Training Phase (2) Novelty Training Clusters Median  # Sig. Summ-50.053/11 1-gram0.208 OOV0.198 Manual Training Clusters Median  # Sig. Summ-50.323/3 1-gram0.423 OOV0.263

16 Test Phase (15 clusters) Median  # Significant Summ-50.155/15 1-gram hit0.146 OOV0.229

17 Manual Clusters StoryOOVSumm-5 Gulfair Plane Crash0.370.39 Honduras bus hijacking 0.120.17 Columbia shuttle0.560.48 Milan plane crash0.260.33 RI nightclub fire0.580.32 Iraq bombing0.240.17 Med.  0.310.33 # Significant5/66/6

18 Conclusions Over all clusters, LM approach based on OOV had best performance LM and phylogenetic models had similar performance on manual clusters –Have more salient “evolutionary” properties

19 Future work Tracking facts in multiple news stories over time Produce a timeline of known facts Determine if the facts have settled at each time 04/18/02 13:17 Locarno, Switzerland Journalist Desidera Cavvina told CNN No 04/18/02 13:42 Locarno, Switzerland or Sofia, Bulgaria ABCNo


Download ppt "Jahna Otterbacher, Dragomir Radev Computational Linguistics And Information Retrieval (CLAIR) {jahna, umich.edu Modeling Document Dynamics: An."

Similar presentations


Ads by Google