Presentation is loading. Please wait.

Presentation is loading. Please wait.

Sumblr: Continuous Summarization of Evolving Tweet Streams Date : 2014/08/11 Author : Lidan Shou, Zhenhua Wang, Ke Chen, Gang Chen Source : SIGIR’13 Advisor:

Similar presentations


Presentation on theme: "Sumblr: Continuous Summarization of Evolving Tweet Streams Date : 2014/08/11 Author : Lidan Shou, Zhenhua Wang, Ke Chen, Gang Chen Source : SIGIR’13 Advisor:"— Presentation transcript:

1 Sumblr: Continuous Summarization of Evolving Tweet Streams Date : 2014/08/11 Author : Lidan Shou, Zhenhua Wang, Ke Chen, Gang Chen Source : SIGIR’13 Advisor: Jia-ling Koh Speaker : Sz-Han,Wang

2 Outline Introduction Method – Tweet Stream Clustering – High-level Summarization Experiment Conclusion 2

3 Introduction With the explosive growth of microblogging services, short text messages (also known as tweets) are being created and shared at an unprecedented rate. Tweets in its raw form can be incredibly informative, but also overwhelming. Plowing through so many tweets for interesting contents would be a nightmare, not to mention the enormous noises and redundancies that one could encounter. 3

4 Introduction In this paper, we study continuous tweet summarization as a solution. Traditional document summarization methods focus on static and small-scale data. Propose a novel prototype called Sumblr ( SUMmarization By stream cLusteRing) for tweet streams. 4 A timeline example for topic “Apple”

5 Framework 5

6 Outline Introduction Method – Tweet Stream Clustering – High-level Summarization Experiment Conclusion 6

7 Tweet Cluster Vector 7 abce TF-IDF score

8 Tweet Cluster Vector t1-Alice: a b c b e a e b. t2-Tim : a c c d d b e. t3-Judy: b c d e a a a. t4-Tina : b b d e e b b. t5-Sam : c c c b b b. 8 abcde |tvi| t t t t t abcde sum_v abcde wsum_v abcde cv sim(cv,ti) t t t t t Suppose m=3: ft_set = {t2, t1, t3}

9 Pryamidal Time Frame 9

10 Tweet Stream Clustering 1. Intialization Use a k-means clustering algorithm to create the initial clusters 2. Incremental Clustering 10 t c1 t1, t2, t3, t4, t5 TVC(1) Sim(c2,t) Sim(c3,t) c2 t6, t7, t8 TVC(2) c3 t9, t10 TVC(3) Sim(c1,t) Max MaxSim(c1, t) < MBS → t is upgraded to a new cluster MaxSim(c1, t) ≥ MBS → t is added to its closest cluster

11 Tweet Stream Clustering 3. Restrict the number of active clusters 1) Deleting Outdated Clusters - periodical examination Avg p > threshold → remove the cluster 2) Merging Clusters - memory limit is reached Merging process continues until there are only mc percentage of the original clusters left 11 threshold=3 days, p=10 cluster pairs distance (c1,c2) (c2,c4) (c1,c4) (c5,c7) (c4,c5) …… Suppose mc=0.7, Remove:10*(1-0.7)=3 cluster {c1,c2} {c1,c2,c4} {c5,c7} Before Merging:c1,c2,c3,c4,c5,c6,c7,c8,c9,c10 After Merging:{c1,c2,c4},c3,{c5,c7},c6,c8,c9,c10

12 High-level Summarization Online summaries –Retrieved directly from the current clusters maintained in the memory Historical summaries –Retrieved two snapshots from PTF –TCV-Rank Summarization 12

13 TCV-Rank Summarization 1.Generate input cluster 2.Gather tweets from the ft_sets in D(c) as a set T 13 S(ts2) TCV(C5) ft_set:{t9,t10} TCV(C4) ft_set:{t1,t2,t8} TCV(C6) ft_set:{t11} the beginning timestamp of the duration S(ts1) TCV(C2) ft_set:{t4,t5} TCV(C3) ft_set:{t6,t7} the ending timestamp of the duration TCV(C1) ft_set:{t1,t2,t3} TCV(C1-C4) ft_set:{t3} TCV(C1-C4) ft_set:{t3} input cluster D(c) TCV(C2) ft_set:{t4,t5} TCV(C3) ft_set:{t6,t7} TCV(C4) ft_set:{t1,t2,t8} TCV(C5) ft_set:{t9,t10} TCV(C6) ft_set:{t11} T={t1,t2,t3,t4,t5,t6, t7,t8,t9,t10,t11}

14 TCV-Rank Summarization 14 tvit1t2t3t4t5t6t7t8t9t10t11 LR T={t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11}

15 LexRank 15 t1t2t3t4 t t t t idegree t13 t23 t34 t42 Sim[i][j] > t (t=0.5) t1t2t3t4 t t t t Matrix M ptpt 0.25 p t+1 =M T p t p t

16 Topic Evolvement Detection 16 current summary The iPhone 6 release date will be in 2014 ScSc SpSp Current summary Add to timeline

17 Outline Introduction Method – Tweet Stream Clustering – High-level Summarization Experiment Conclusion 17

18 Experiment Datasets Baseline –ClusterSum –LexRank –DSDR 18

19 Experiment 19 windows size=20000 step size=4000~20000

20 Outline Introduction Method – Tweet Stream Clustering – High-level Summarization Experiment Conclusion 20

21 Conclusion Proposed a prototype called Sumblr which supported continuous tweet stream summarization. Sumblr employed a tweet stream clustering algorithm to compress tweets into TCVs and maintain them in an online fashion. Used a TCV-Rank summarization algorithm for generating online summaries and historical summaries with arbitrary time durations. The topic evolvement could be detected automatically, allowing Sumblr to produce dynamic timelines for tweet streams. For future work, we aim to develop a multi-topic version of Sumblr in a distributed system, and evaluate it on more complete and large- scale datasets. 21


Download ppt "Sumblr: Continuous Summarization of Evolving Tweet Streams Date : 2014/08/11 Author : Lidan Shou, Zhenhua Wang, Ke Chen, Gang Chen Source : SIGIR’13 Advisor:"

Similar presentations


Ads by Google