Sumblr: Continuous Summarization of Evolving Tweet Streams

Slides:

Advertisements

Similar presentations

A probabilistic model for retrospective news event detection

Advertisements

Date: 2013/1/17 Author: Yang Liu, Ruihua Song, Yu Chen, Jian-Yun Nie and Ji-Rong Wen Source: SIGIR12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Adaptive.

Psychological Advertising: Exploring User Psychology for Click Prediction in Sponsored Search Date: 2014/03/25 Author: Taifeng Wang, Jiang Bian, Shusen.

Date: 2014/05/06 Author: Michael Schuhmacher, Simon Paolo Ponzetto Source: WSDM’14 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Knowledge-based Graph Document.

Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :

A Framework for Clustering Evolving Data Streams Charu C. Aggarwal, Jiawei Han, Jianyong Wang, Philip S. Yu Presented by: Di Yang Charudatta Wad.

Date : 2013/05/27 Author : Anish Das Sarma, Lujun Fang, Nitin Gupta, Alon Halevy, Hongrae Lee, Fei Wu, Reynold Xin, Gong Yu Source : SIGMOD’12 Speaker.

Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.

Incremental Clustering Previous clustering algorithms worked in “batch” mode: processed all points at essentially the same time. Some IR applications cluster.

TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.

Linking Named Entity in Tweets with Knowledge Base via User Interest Modeling Date : 2014/01/22 Author : Wei Shen, Jianyong Wang, Ping Luo, Min Wang Source.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Date : 2013/09/17 Source : SIGIR’13 Authors : Zhu, Xingwei

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.

Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University.

Query session guided multi- document summarization THESIS PRESENTATION BY TAL BAUMEL ADVISOR: PROF. MICHAEL ELHADAD.

Stream Clustering CSE 902. Big Data Stream analysis Stream: Continuous flow of data Challenges ◦Volume: Not possible to store all the data ◦One-time.

On Sparsity and Drift for Effective Real- time Filtering in Microblogs Date ： 2014/05/13 Source ： CIKM’13 Advisor ： Prof. Jia-Ling, Koh Speaker ： Yi-Hsuan.

Leveraging Conceptual Lexicon ： Query Disambiguation using Proximity Information for Patent Retrieval Date : 2013/10/30 Author : Parvaz Mahdabi, Shima.

Load Balancing for Partition-based Similarity Search Date : 2014/09/01 Author : Xun Tang, Maha Alabduljalil, Xin Jin, Tao Yang Source : SIGIR’14 Advisor.

Processing of large document collections Part 7 (Text summarization: multi- document summarization, knowledge- rich approaches, current topics) Helena.

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 作者 :Rutvik Karve ， Devendra Dahiphale ， Amit Chhajer 報告 : 饒展榕.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Wei Feng , Jiawei Han, Jianyong Wang , Charu Aggarwal , Jianbin Huang

1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor ： Jia Ling, Koh Speaker ： SHENG HONG,

Templated Search over Relational Databases Date: 2015/01/15 Author: Anastasios Zouzias, Michail Vlachos, Vagelis Hristidis Source: ACM CIKM’14 Advisor:

DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.

1 Generating Comparative Summaries of Contradictory Opinions in Text (CIKM09’)Hyun Duk Kim, ChengXiang Zhai 2010/05/24 Yu-wen,Hsu.

Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.

Date: 2012/08/21 Source: Zhong Zeng, Zhifeng Bao, Tok Wang Ling, Mong Li Lee (KEYS’12) Speaker: Er-Gang Liu Advisor: Dr. Jia-ling Koh 1.

Date: 2013/6/10 Author: Shiwen Cheng, Arash Termehchy, Vagelis Hristidis Source: CIKM’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang Predicting the Effectiveness.

Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map Group 15 Chen Zhutian Huang Hengguang.

Date: 2012/11/29 Author: Chen Wang, Keping Bi, Yunhua Hu, Hang Li, Guihong Cao Source: WSDM’12 Advisor: Jia-ling, Koh Speaker: Shun-Chen, Cheng.

Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.

Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.

Compact Query Term Selection Using Topically Related Text Date : 2013/10/09 Source : SIGIR’13 Authors : K. Tamsin Maxwell, W. Bruce Croft Advisor : Dr.Jia-ling,

Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.

CONTEXTUAL SEARCH AND NAME DISAMBIGUATION IN USING GRAPHS EINAT MINKOV, WILLIAM W. COHEN, ANDREW Y. NG SIGIR’06 Date: 2008/7/17 Advisor: Dr. Koh,

TO Each His Own: Personalized Content Selection Based on Text Comprehensibility Date: 2013/01/24 Author: Chenhao Tan, Evgeniy Gabrilovich, Bo Pang Source:

Unsupervised Streaming Feature Selection in Social Media

Refined Online Citation Matching and Adaptive Canonical Metadata Construction CSE 598B Course Project Report Huajing Li.

MMM2005The Chinese University of Hong Kong MMM2005 The Chinese University of Hong Kong 1 Video Summarization Using Mutual Reinforcement Principle and Shot.

Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Summarizing Contrastive Viewpoints in Opinionated Text Michael J. Paul, ChengXiang Zhai, Roxana Girju EMNLP ’ 10 Speaker: Hsin-Lan, Wang Date: 2010/12/07.

哈工大信息检索研究室 HITIR ’ s Update Summary at TAC2008 Extractive Content Selection Using Evolutionary Manifold-ranking and Spectral Clustering Reporter: Ph.d.

CS791 - Technologies of Google Spring A Webbased Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.

Extractive Summarisation via Sentence Removal: Condensing Relevant Sentences into a Short Summary Marco Bonzanini, Miguel Martinez-Alvarez, and Thomas.

Short Text Similarity with Word Embedding Date: 2016/03/28 Author: Tom Kenter, Maarten de Rijke Source: CIKM’15 Advisor: Jia-Ling Koh Speaker: Chih-Hsuan.

The Wisdom of the Few Xavier Amatrian, Neal Lathis, Josep M. Pujol SIGIR’09 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.

ClusCite:Effective Citation Recommendation by Information Network-Based Clustering Date: 2014/10/16 Author: Xiang Ren, Jialu Liu,Xiao Yu, Urvashi Khandelwal,

LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.

Date : 2016/08/09 Advisor : Jia-ling Koh Speaker : Yi-Yui Lee

Presented by Niwan Wattanakitrungroj

Customized of Social Media Contents using Focused Topic Hierarchy

Improving Search Relevance for Short Queries in Community Question Answering Date： 2014/09/25 Author ： Haocheng Wu, Wei Wu, Ming Zhou, Enhong Chen, Lei.

Speaker: Jim-an tsai advisor: professor jia-lin koh

A Large Scale Prediction Engine for App Install Clicks and Conversions

A Framework for Clustering Evolving Data Streams

Learning Literature Search Models from Citation Behavior

Automatic Segmentation of Data Sequences

Date : 2013/1/10 Author : Lanbo Zhang, Yi Zhang, Yunfei Chen

Enriching Taxonomies With Functional Domain Knowledge

Wiki3C: Exploiting Wikipedia for Context-aware Concept Categorization

Color image noise removal algorithm utilizing hybrid vector filtering

Heterogeneous Graph Attention Network

Presentation transcript:

Sumblr: Continuous Summarization of Evolving Tweet Streams Date： 2014/08/11 Author ： Lidan Shou, Zhenhua Wang, Ke Chen, Gang Chen Source： SIGIR’13 Advisor: Jia-ling Koh Speaker： Sz-Han,Wang

Outline Introduction Method Experiment Conclusion Tweet Stream Clustering High-level Summarization Experiment Conclusion

Introduction With the explosive growth of microblogging services, short text messages (also known as tweets) are being created and shared at an unprecedented rate. Tweets in its raw form can be incredibly informative, but also overwhelming. Plowing through so many tweets for interesting contents would be a nightmare, not to mention the enormous noises and redundancies that one could encounter.

Introduction In this paper, we study continuous tweet summarization as a solution. Traditional document summarization methods focus on static and small-scale data. Propose a novel prototype called Sumblr ( SUMmarization By stream cLusteRing) for tweet streams. A timeline example for topic “Apple”

Framework

Outline Introduction Method Experiment Conclusion Tweet Stream Clustering High-level Summarization Experiment Conclusion

Tweet Cluster Vector Alice: a b c b e a e b. a tweet ti =(tvi, tsi,wi) Alice: a b c b e a e b. tvi=[ ] For a cluster C containing tweets t1, t2,… tn Tweet Cluster Vector(TCV)(c)=(sum_v,wsum_v,ts1,ts2,ft_set) sum_v= i=1 n tvi ||tvi|| , wsum_v= i=1 n wi∙tvi The vector of cluster centroid(cv)= i=1 n wi∙tvi n = wsum_v n a b c e 1.301 1.477 1 TF-IDF score tvi:the textual vector,tsi:the posted timestamp wi : the UserRank value of the tweet’s author ts1= i=1 n tsi is the sum of timestamps ts2= i=1 n (tsi)2 is the quadratic sum of timestamps ft_set is a focus tweet set of size m, consisting of the closest m tweets to the cluster centroid (use cosine similarity as the distance metric)

Tweet Cluster Vector t1-Alice: a b c b e a e b. t2-Tim : a c c d d b e. t3-Judy: b c d e a a a. t4-Tina : b b d e e b b. t5-Sam : c c c b b b . a b c d e |tvi| t1 1.301 1.477 1 2.563 t2 2.527 t3 2.486 t4 1.602 2.293 t5 2.089 sum_v= i=1 n tvi ||tvi|| sim(cv,ti) t1 0.934 t2 0.951 t3 0.943 t4 0.815 t5 0.757 a b c d e sum_v 1.497 2.780 2.014 1.353 1.873 Suppose m=3: ft_set = {t2, t1, t3} wsum_v= i=1 n wi∙tvi a b c d e wsum_v 3.778 6.556 4.778 3.301 4.602 sim(cv,ti) cv= wsum_v n a b c d e cv 0.756 1.311 0.956 0.660 0.920

Pryamidal Time Frame The Pyramidal Time Frame (PTF) stores snapshots at differing levels of granularity depending on the recency. The maximum order of any snapshot stored at T is log𝛼(T); The maximum number of snapshots maintained at T is (𝛼𝑙+1) ‧ log𝛼(T) Each snapshot of the i-th order is taken at a moment in time when the timestamp from the beginning of the stream is exactly divisible by αi Each i-th order stored the maximum number of snapshots is (𝛼𝑙+1) 𝛼=3,𝑙=2 Start timestamp=1 Current timestamp=86 log3 (86) ≈ 4.05 (32+1)*log3 (86) ) ≈ 40.5 (32+1)=10

Tweet Stream Clustering Intialization Use a k-means clustering algorithm to create the initial clusters Incremental Clustering MBS(Minimum Bounding Similarity)=β∙ Sim c1, ti Sim c1, ti = 1 𝑛 i=1 n tvi∙𝐶1 ||tvi||∙||𝐶1|| = wsum_v∙sum_v n∙||wsum_v|| c1 t1, t2, t3, t4, t5 TVC(1) Max Sim(c1,t) MaxSim(c1, t) < MBS → t is upgraded to a new cluster MaxSim(c1, t) ≥ MBS → t is added to its closest cluster c2 t6, t7, t8 TVC(2) Sim(c2,t) t Sim(c3,t) c3 t9, t10 TVC(3)

Tweet Stream Clustering Restrict the number of active clusters Deleting Outdated Clusters - periodical examination Avgp > threshold → remove the cluster Merging Clusters - memory limit is reached Merging process continues until there are only mc percentage of the original clusters left threshold=3 days, p=10 Suppose mc=0.7, Remove:10*(1-0.7)=3 cluster Before Merging:c1,c2,c3,c4,c5,c6,c7,c8,c9,c10 cluster pairs distance (c1,c2) (c2,c4) (c1,c4) (c5,c7) (c4,c5) …… {c1,c2} {c1,c2,c4} {c5,c7} After Merging:{c1,c2,c4},c3,{c5,c7},c6,c8,c9,c10

High-level Summarization Online summaries Retrieved directly from the current clusters maintained in the memory Historical summaries Retrieved two snapshots from PTF TCV-Rank Summarization

TCV-Rank Summarization Generate input cluster Gather tweets from the ft_sets in D(c) as a set T S(ts1) TCV(C2) ft_set:{t4,t5} TCV(C3) ft_set:{t6,t7} the ending timestamp of the duration TCV(C1) ft_set:{t1,t2,t3} S(ts2) TCV(C5) ft_set:{t9,t10} TCV(C4) ft_set:{t1,t2,t8} TCV(C6) ft_set:{t11} the beginning timestamp of the duration TCV(C1-C4) ft_set:{t3} TCV(C1-C4) ft_set:{t3} input cluster D(c) TCV(C2) ft_set:{t4,t5} TCV(C3) ft_set:{t6,t7} TCV(C4) ft_set:{t1,t2,t8} TCV(C5) ft_set:{t9,t10} TCV(C6) ft_set:{t11} T={t1,t2,t3,t4,t5,t6, t7,t8,t9,t10,t11}

TCV-Rank Summarization Build a cosine similarity graph on T Compute LexRank scores LR Add tweet t into the summary 𝑡= argmax 𝑡𝑖 [𝜆 𝑛𝑡𝑖 𝑛𝑚𝑎𝑥 𝐿𝑅 𝑡𝑖 − 1−𝜆 avg 𝑡𝑗∈𝑆 𝑆𝑖𝑚 𝑡𝑖,𝑡𝑗 ] T={t1,t2,t3,t4,t5,t6,t7,t8,t9,t10,t11} tvi t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 LR 0.601 0.847 0.349 0.752 0.591 0.799 0.355 1 0.592 0.691

LexRank Build cosine similarity Matrix and degree LR=PowerMethod(M,n,𝜖) Matrix M t1 t2 t3 t4 1 0.8 0.6 0.3 0.7 0.4 0.9 i degree t1 3 t2 t3 4 t4 2 t1 t2 t3 t4 0.33 0.27 0.15 0.18 0.2 0.23 0.25 0.45 0.1 0.13 0.5 Sim[i][j] > t (t=0.5) 𝑠𝑖𝑚 𝑖 [𝑗] 𝑑𝑒𝑔𝑟𝑒𝑒[𝑖] pt 0.25 pt+1 0.23 0.24 0.20 0.33 𝛿=||pt+1-pt|| Compare 𝛿 and 𝜖 if 𝛿<𝜖, pt+1=LR pt+1=MTpt

Topic Evolvement Detection Continuous timeline Compute Dcur and Davg if Dcur Davg > 𝜏 , add time node Sp Kullback–Leibler divergenc DKL(Sc||Sp) = w∈V p(w|Sc) ln p(w|sc) p(w|sp) Current summary Add to timeline Sc current summary The iPhone 6 release date will be in 2014

Outline Introduction Method Experiment Conclusion Tweet Stream Clustering High-level Summarization Experiment Conclusion

Experiment Datasets Baseline ClusterSum LexRank DSDR

Experiment windows size=20000 step size=4000~20000

Outline Introduction Method Experiment Conclusion Tweet Stream Clustering High-level Summarization Experiment Conclusion

Conclusion Proposed a prototype called Sumblr which supported continuous tweet stream summarization. Sumblr employed a tweet stream clustering algorithm to compress tweets into TCVs and maintain them in an online fashion. Used a TCV-Rank summarization algorithm for generating online summaries and historical summaries with arbitrary time durations. The topic evolvement could be detected automatically, allowing Sumblr to produce dynamic timelines for tweet streams. For future work, we aim to develop a multi-topic version of Sumblr in a distributed system, and evaluate it on more complete and large-scale datasets.