Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM Research - India, Bengaluru, India 21 August 20141 Fast Mining of Interesting Phrases from Subsets of Text Corpora Deepak P, Atreyee Dey, Debapriyo.

Similar presentations


Presentation on theme: "IBM Research - India, Bengaluru, India 21 August 20141 Fast Mining of Interesting Phrases from Subsets of Text Corpora Deepak P, Atreyee Dey, Debapriyo."— Presentation transcript:

1 IBM Research - India, Bengaluru, India 21 August 20141 Fast Mining of Interesting Phrases from Subsets of Text Corpora Deepak P, Atreyee Dey, Debapriyo Majumdar* 1 IBM Research - India, Bengaluru, INDIA EDBT 2014 Conference, Athens, Greece *presently with Indian Statistical Institute, Kolkata, India

2 21 August 2014IBM Research – India, Bengaluru2 Problem Description Text Corpus ukraine, crimea … Chosen Subset Crimea independence, 0.90 USA Russia Relations, 0.85 G8 Membership, 0.81 … D D’ Given a text corpus D, and a subset D’, specified by a keyword query, find the top-k Interesting Phrases for D’ wrt D

3 21 August 2014IBM Research – India, Bengaluru3 Earlier Approaches p1d12d13d30d9901 p9876d1d11d305d8100 Phrase Indexing, Simistis et al., VLDB 2008 O(|P|) Document Indexing, Bedathur et al., VLDB 2010 and Gao & Michel, EDBT 2012 d1p5p43p167p8970 d9998p23p49p305p9987 O(|D’|)

4 21 August 2014IBM Research – India, Bengaluru4 Estimating Interestingness: AND Query Consider an AND query composed of k key-words Q = {Q1, Q2, …, Qk}

5 21 August 2014IBM Research – India, Bengaluru5 Query Word Independence Assumption Consider an AND Query of two words Q1 and Q2 We would like to estimate p(P1|Q1, Q2) as an estimate of the interestingness of P1 Instead, we could estimate p(Q1, Q2|P1) (as shown in previous slide) For OR Query Handling details, refer to the paper

6 21 August 2014IBM Research – India, Bengaluru6 Our Disk-Resident Indexes w1p30p12p990p13 w9876p810p11p305p8 0.230.210.180.002 0.10.080.0070.0001 The score that is stored along with each phrase is p(w|p) All values are stored in sorted order

7 21 August 2014IBM Research – India, Bengaluru7 Aggregation Approach: NRA We use the well-known NRA algorithm to do aggregation of the lists corresponding to the query words, to arrive at the top phrases At any point, we have upper and lower bounds. An example sum-aggregation below w1 P1, 0.04167P5, 0.0333 w2 P103, 0.26P1, 0.113 P1 – [0.1547, 0.1547] P5 – [0.0333,0.1433] P103 – [0.26, 0.2933] …

8 21 August 2014IBM Research – India, Bengaluru8 Our In-Memory Indexes w1p12p13p30p990 w9876p8p11p305p810 0.210.0020.230.18 0.00010.080.0070.1 The score that is stored along with each phrase is p(w|p) All values are stored in PhraseID sorted order Indexes may be created by preserving just the top-10% values of each list We will use simple Sort-Merge-Join on these lists for In-Memory operation

9 21 August 2014IBM Research – India, Bengaluru9 Example Results Query: trade reserves (Reuters Dataset) – economic minister – reserves – taiwan’s foreign exchange reserves – economic planning – economic planning and development

10 21 August 2014IBM Research – India, Bengaluru10 Result Quality Evaluation 0.9 0.91 0.92 0.93 0.94 0.95 0.96 0.97 0.98 0.99 1 20-AND50-AND Prec MRR NDCG MAP PubMed Dataset

11 21 August 2014IBM Research – India, Bengaluru11 Running Times: Disk-based Operation (NRA) 100 1000 10000 100000 1000000 10000000 020406080100 AND-NRA OR-NRA AND-GM OR-GM X-Axis: Percentage of NRA Lists Traversed PubMed Dataset

12 21 August 2014IBM Research – India, Bengaluru12 Percentages of Lists Traversed (NRA) 2728293031323334 Reuters-AND Reuters-OR Pubmed-AND Pubmed-OR

13 21 August 2014IBM Research – India, Bengaluru13 Running Times: Mem-based Operation (SMJ) 1 10 100 1000 10000 100000 1000000 10000000 0102030405060708090100 AND-SMJ OR-SMJ AND-GM OR-GM X-Axis: Percentage of Entries Stored PubMed Dataset

14 21 August 2014IBM Research – India, Bengaluru14 Shortcomings Index Sizes – Earlier approaches index only phrases and documents – Our method has word-specific indexes, with each word having a list in the index – Number of words across documents could be much more than the number of phrases – If we would like to support querying over all possible words, index sizes could get large Queries on Metadata Facets – Instead of using keyword queries, document subsets could also be chosen using metadata facets – E.g., venue:sigmod AND year:2007, on a set of scholarly publications – Our independence assumption has not yet been tested on metadata facets

15 21 August 2014IBM Research – India, Bengaluru15 Summary Proposed an approach for the problem of mining interesting phrases from subsets of text corpora Outlined the query word independence assumption that is seen to be empirically useful in accurately identifying interesting phrases Our approach is seen to be up to 90% accurate, while being able to achieve turnaround times that are orders of magnitude better than those of the current techniques Future Work – Other potential avenues for leveraging the independence assumption for phrase analytics – Methods to speed up interesting phrase mining over metadata facets

16 IBM Research - India, Bengaluru, India 21 August 201416 Thank You Questions, Comments, Suggestions?


Download ppt "IBM Research - India, Bengaluru, India 21 August 20141 Fast Mining of Interesting Phrases from Subsets of Text Corpora Deepak P, Atreyee Dey, Debapriyo."

Similar presentations


Ads by Google