Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign.

Similar presentations


Presentation on theme: "Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign."— Presentation transcript:

1 Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign Malu Castellanos, Meichun Hsu HP Laboratories

2 … Time Any clues in the companion news stream? Dow Jones Industrial Average [Source: Yahoo Finance] IR for stock market analysis? What might have caused the stock market crash? Sept 11 Attack! What documents to read to analyze such a “causal” topic?

3 Analysis of Presidential Prediction Markets What might have caused the sudden drop of price for this candidate? What “mattered” in this election? … Time Any clues in the companion news stream? Tax cut? What documents to read to analyze such a “causal” topic?

4 … Time Any clues in the companion product reviews? Analysis of Product Sales What might have caused the decrease of sales? safety concerns What reviews to read to analyze such a “causal” topic?

5 … Time Which documents cover such a “trendy” topic? Finding documents about “trendy” topics Draw a “time series query”: Find documents about a topic emerging this summer, which has attracted much attention this Oct

6 Information Retrieval with Time Series Query Instead of keyword query, use time series as a query  Retrieve documents that contain topics that are correlated with the query time series Input: –Time series data with time stamp –Text stream which is a collection of documents with time stamp within the same time period Output –Ranked list of documents

7 Ideal Results of Information Retrieval with Time Series Query 20002001 … News RANKDATE EXCERPT 19/29/2000 Expect earning will be far below 212/8/2000 $4 billion cash in company 310/19/2000 Disappointing earning report 44/19/2001 Dow and Nasdaq soar after rate cut by Federal Reserve 57/20/2001 Apple's new retail store ………

8 IR w/ TS - Method Overview Sep, 2001Oct, 2001 … Text Stream Non-text Time Series Vocabulary, Word Frequency Curves W1 W2 W3 W4 … Input 1 Input 2 Rank by Correlation … … … … … Ranked Documents Output … … … … … Input Documents

9 IR w/ TS - Method Overview … Sep, 2001Oct, 2001 … Text Stream Non-text Time Series Vocabulary, Word Frequency Curves W1 W2 W3 W4 … Rank by Correlation Input 1 Input 2 …… … … … … … … … Ranked Documents Output Input Documents 1. How to measure correlation between word and time series 2. How to aggregate word correlations to rank documents

10 Correlation Function Measure correlation between word frequency curve vs. input time series 1.Pearson Correlation –Basic correlation 2.Dynamic Time Warping [Senin`08] –Capture alignment of shifted or stretched time series Series before alignment Time series Alignment Values Time

11 Aggregation Function Score document correlation by aggregating word correlations 1.Weighted TF-IDF (BM25) –Use top K correlated words as a text query  Use IR formula such as BM25 –Use correlation coefficient as a weight

12 Aggregation Function 2.Average Correlation a)Average over all terms:  Not all the words are correlated? b)Average over top-k terms:  May be dominated by multiple occurrences of the same term c)Average over top-k unique terms :

13 Evaluation Data Set –New York Times corpus (Jul 2000~Dec 2001) Entity annotated –Daily Stock prices of 24 companies Measure –Mean average precision (MAP) –Normalized discounted cumulative gain (NDCG) Research questions 1.Can our method retrieve meaningful documents? 2.Does DTW outperform Pearson Correlation? 3.Which aggregation function works the best?

14 Top ranked documents by American Airlines stock price RankDateExcerpt 110/22/2001Fleeing the war 212/11/2001Us and anti-Taliban forces in Afghanistan 311/18/2001Fate of Taliban Soldiers Under Discussion 411/12/2001Tally and dead and missing in Sep 11 terrorist attacks 59/25/2001Soldiers in Afghanistan … 611/19/2001Recover operation at World Trade Center 711/3/20014343 died or missing as a result of the attacks on Sep 11 811/17/2001Dead and missing report of Sep 11 attack ……… All top ranked documents are related to September 11, terrorist attack

15 Top Correlated Words to American Airlines stock price All top correlated terms to input time series are related to terrorist attack  Highly correlated terms contributed to retrieval of documents about this topic Word|ρ| challenged0.887031 afghanistan0.861351 security0.858745 sept0.858309 terrorism0.854865 pakistan0.848829 aghans0.844596 afghan0.843481 islamic0.842499 taliban0.841455

16 Top ranked ‘relevant’ documents for Apple stock price RankDateExcerpt 19/29/2000Fourth-quarter earning far below estimates 212/8/2000$4 billion reserve, not $11 billion 310/19/2000Announced earnings report 44/29/2001Dow and Nasdaq soar after rate cur by Federal Reserve 57/20/2001Apple’s new retail stores 612/6/2000Apple warns it will record quarterly loss 73/24/2001Stocks perk up, with Nasdaq posing gain 88/10/2000Mixing Mac and Windows ……… Retrieved relevant event: Disappointing earning report, store open, etc. Useful as a new feature for re-ranking search results?

17 Quantitative Evaluation All our methods > Random precision (0.0013) Dynamic time warping >> Pearson correlation PearsonDTW MAPNDCGMAPNDCG 0.00190.35150.00220.3609 - Average performance (Average correlation as aggregation method)

18 Comparison of Aggregation Methods AC << TopK, BM25 Top5-AC << Top20-AC, but not more than K=20 BM25 is sensitive to parameter setting –Scores of AC methods are more meaningful Incomplete judgments  Possibly much better performance in reality MAPNDCG AC0.00190.3515 Top5-AC0.00210.361 Top10-AC0.00230.3618 Top20-AC0.00240.3629 Top5-AC-Uniq0.00220.3613 Top10-AC-Uniq0.00220.3616 Top20-AC-Uniq0.00220.3619 Top5-BM250.00190.3584 Top10-BM250.00230.361 Top20-BM250.00190.3582 - Average performance (w/ Pearson correlation)

19 “Higher” NDCG vs. Low MAP

20 Summary Introduced a novel retrieval problem –time series as query Studied basic solutions: Time series representation of terms –Term retrieval: correlation(query, term) –Document retrieval: aggregation of term retrieval results Dynamic time warping + top-K average correlation seems working well

21 Limitations & Future Work Evaluation is based on simulation –Highly incomplete judgments! –What’s a good way to evaluate such a new retrieval task? Current solutions are heuristic –How can we develop a more principled model? Different notions of relevance –“Local” relevance vs. global relevance? All other issues relevant to a standard retrieval problem are worth exploring (e.g., feedback?)

22 Thank You! Comments/Questions?


Download ppt "Information Retrieval with Time Series Query Hyun Duk Kim (now at Twitter), Danila Nikitin (now at Google), ChengXiang Zhai University of Illinois at Urbana-Champaign."

Similar presentations


Ads by Google