Entity Tracking in Real- Time using Sub-Topic Detection on Twitter SANDEEP PANEM, ROMIL BANSAL, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF.

Slides:



Advertisements
Similar presentations
The objective of an Entity Recognition and Disambiguation (ERD) system is to recognize mentions of entities in a given text, disambiguate them, and map.
Advertisements

Linking Entities in #Microposts ROMIL BANSAL, SANDEEP PANEM, PRIYA RADHAKRISHNAN, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY,
Improved TF-IDF Ranker
WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Information Retrieval in Practice
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Methodology Conceptual Database Design
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Overview of Search Engines
Yin Yang (Hong Kong University of Science and Technology) Nilesh Bansal (University of Toronto) Wisam Dakka (Google) Panagiotis Ipeirotis (New York University)
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
CONCLUSION & FUTURE WORK Normally, users perform triage tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Modeling Documents by Combining Semantic Concepts with Unsupervised Statistical Learning Author: Chaitanya Chemudugunta America Holloway Padhraic Smyth.
Eric H. Huang, Richard Socher, Christopher D. Manning, Andrew Y. Ng Computer Science Department, Stanford University, Stanford, CA 94305, USA ImprovingWord.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
A Probabilistic Graphical Model for Joint Answer Ranking in Question Answering Jeongwoo Ko, Luo Si, Eric Nyberg (SIGIR ’ 07) Speaker: Cho, Chin Wei Advisor:
1 Linmei HU 1, Juanzi LI 1, Zhihui LI 2, Chao SHAO 1, and Zhixing LI 1 1 Knowledge Engineering Group, Dept. of Computer Science and Technology, Tsinghua.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
10/22/2015ACM WIDM'20051 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis Voutsakis.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
CONCLUSION & FUTURE WORK Normally, users perform search tasks using multiple applications in concert: a search engine interface presents lists of potentially.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Topic Modeling using Latent Dirichlet Allocation
From Text to Image: Generating Visual Query for Image Retrieval Wen-Cheng Lin, Yih-Chen Chang and Hsin-Hsi Chen Department of Computer Science and Information.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Semi-automatic Product Attribute Extraction from Store Website
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Local Linear Matrix Factorization for Document Modeling Institute of Computing Technology, Chinese Academy of Sciences Lu Bai,
Link Distribution on Wikipedia [0407]KwangHee Park.
10.0 Latent Semantic Analysis for Linguistic Processing References : 1. “Exploiting Latent Semantic Information in Statistical Language Modeling”, Proceedings.
Concept-based Short Text Classification and Ranking
Duc-Tien Dang-Nguyen, Giulia Boato, Alessandro Moschitti, Francesco G.B. De Natale Department to Information and Computer Science –University of Trento.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Exploiting Wikipedia Inlinks for Linking Entities in Queries Entity Recognition and Disambiguation Challenge ACM SIGIR 2014 July 6-11, 2014 The 37 th Annual.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.
Collaborative Deep Learning for Recommender Systems
University Of Seoul Ubiquitous Sensor Network Lab Query Dependent Pseudo-Relevance Feedback based on Wikipedia 전자전기컴퓨터공학 부 USN 연구실 G
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Topic Modeling for Short Texts with Auxiliary Word Embeddings
SocialStories: Segmenting Stories within Trending Twitter Topics
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Clustering of Web pages
Using lexical chains for keyword extraction
Applying Key Phrase Extraction to aid Invalidity Search
EDIUM: Improving Entity Disambiguation via User modelling
Text Mining & Natural Language Processing
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Bug Localization with Combination of Deep Learning and Information Retrieval A. N. Lam et al. International Conference on Program Comprehension 2017.
Presentation transcript:

Entity Tracking in Real- Time using Sub-Topic Detection on Twitter SANDEEP PANEM, ROMIL BANSAL, MANISH GUPTA, VASUDEVA VARMA INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY, HYDERABAD 14 TH APRIL

Introduction 1.“Topics” on Twitter relate to major events in the real world. “Sub-Topics” on the other hand are fine-grained aspects of such events. For example, consider the tweet, “Recently listed on MLS: 2003 Volvo VHD64B200 #mixer from Transport Truck Sales in Kansas City, KS”. Here the sub-topic is “buying or selling of trucks” and the topic is “Volvo”. 2.Mining sub-topics from entities helps in trend analysis, social monitoring, topic tracking and reputation mining. 2

Motivation 1.In the recent past, Twitter has been widely used for spreading the social pulse about real world entities. 2.Sub-topic detection helps users to see all the information related to a particular entity at one place rather than in a cluttered manner. 3.It helps in trend analysis, social monitoring, mining reputation of particular entities by revealing the hidden topics. 4.Earlier approaches used offline clustering approaches where they assume all the tweets are available before hand. We propose two unsupervised clustering approaches to tackle dynamic clustering of tweets as they arrive. 3

Related Work 1.Existing topic detection methodologies are generally based on probabilistic language models, such as Probabilistic Latent Semantic Analysis (PLSA) and Latent Dirichlet Allocation (LDA). 2.By exploiting tweet contents, LIA at CLEF 2013 applied a large variety of machine learning methods for clustering of tweets. 3.REINA at CLEF 2013 used similarity matrix and community detection techniques for topic detection. 4.UNED ORM at CLEF 2013 experimented with approaches like agglomerative clustering based on term co-occurrences and clustering of wikified concepts. 4

Our Approach (Outline) We propose two unsupervised clustering approaches. Each approach consists of two phases, offline and online. 1.Semantic Space Representation (SSR) based approach 2.Concept Space Representation (CSR) based approach 5

Offline Initial Cluster Generation Phase for Both Approaches 6

Online Cluster Maintenance Phase for Both Approaches 7

SSR Approach 1.We extract keywords and keyphrases from each tweet. 2.Keywords are nouns, verbs, hashtags or proper nouns occurring in tweet. 3.We enhance these keywords by synonyms using WordNet. Further top n words from URL text of the tweet link are taken to enhance the keywords. 4.We extract the longest sequence of nouns as well as proper nouns as keyphrases. 8

Here, w denotes the weight given to the keyphrases and (1- w) denotes the weight given to the keywords. For our experiments we set w = 0:6. 9 Similarity Computation between two Tweets

1. To avoid incorrect cluster assignment for the new tweet, we used “Wikipedia title matching” to distinguish sub-topics related to a particular topic. 2. We compare tweet’s keyphrases with Wikipedia titles by performing a substring match. If at least one keyphrase occurs in one of the titles of a Wikipedia page and if none of the tweet’s keyphrases match with the keyphrases of the matched cluster c, then we create a new cluster. Otherwise we assign the tweet to the cluster c. 3.Observation: The number of discovered relationships in the system are less, as excessive keyword matching (probably because of WordNet usage) caused some sub-topics to merge into a single sub-topic. 10 Wikipedia Title Matching

Concept Space Representation (CSR) based Approach 1.Transformation of tweets into concept space can help reduce the lexical mismatch. 2.We obtain the semantic representation of a tweet by extracting concepts for a tweet using the TagMe API. The API takes short text snippets as input, disambiguates the entities in the text, and maps these entities to Wikipedia pages. 3.We represent each tweet t conceptually as a combination of keyphrases (KP(t)), URLConcepts (UC(t)) and TweetConcepts (TC(t)). The offline phase for the CSR based approach is similar to the one for the SSR based approach. 4.In the online phase, we query the index for clusters containing the tweet’s keywords and retrieve the k nearest clusters. Similarly, we retrieve the top k nearest clusters for URLConcepts and TweetConcepts. 11

Cluster Score Computation 12

Conditions for Creating a New Cluster 1. A new tweet t is either assigned to the cluster c with the highest score or to a new cluster. 2. If no cluster in the index match with the tweet keywords, the tweet is assigned to a new cluster. 3. The tweet is assigned to the cluster c if at least one keyphrase matches between cluster c and tweet t, or at least one concept matches other than the entity name. 4. Otherwise the tweet t is assigned to a new cluster. 13

Maintaining Cluster Purity and Cluster Labels 1.We preserve the purity of clusters, by removing the irrelevant concepts from the clusters at regular intervals of tweet arrivals into the clusters. 2.In SSR, we label the clusters using the top occurring keyphrase and keyword. In CSR, we define cluster label using the top occurring keyphrase and concept. 3.As the labels of the cluster change frequently with the incoming tweets, we update the labels of the clusters at regular intervals, t. For our experiments, we set delta t to fifty tweets. 14

Dataset  For our evaluation, we use RepLab (CLEF) 2013 dataset.  The dataset contains tweets for 61 entities. Each entity has about 700 tweets for training and 1500 tweets for testing. 15

Results 16

Conclusion In the SSR based approach, we used keywords (WordNet synonyms, URL keywords), keyphrases and Wikipedia title matching (as criterion for creating new cluster) as features. In the CSR based approach, we used TweetConcepts, URLConcepts and keyphrases as features. In the future, we would like to extend this study by incorporating the similarity between concepts in the CSR based approach. 17

Thank you! Questions? 18

1. E. Amig´o, A. Corujo, J. Gonzalo, E. Meij, and M. de Rijke. Overview of RepLab 2012: Evaluating Online Reputation Management Systems. In Proc. of the 3rd Intl. Conf. of the CLEF Initiative, E. Amig´o, J. C. de Albornoz, I. Chugur, A. Corujo, J. Gonzalo, T. Mart´ın-Wanton, E. Meij, M. de Rijke, and D. Spina. Overview of RepLab 2013: Evaluating Online Reputation Monitoring. Systems. In Proc. of the 4th Intl. Conf. of the CLEF Initiative, pages 333–352, E. Amig´o, J. Gonzalo, and F. Verdejo. A General Evaluation Measure for Document Organization Tasks. In Proc. of the 36th Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), pages 643–652, D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet Allocation. Journal of Machine Learning Research, 3:993–1022, Mar K. S. Dave and V. Varma. Pattern Based Keyword Extraction for Contextual Advertising. In Proc. of the 19th ACM Intl. Conf. on Information and Knowledge Management (CIKM), pages 1885–1888, P. Ferragina and U. Scaiella. TagMe: On-the-fly Annotation of Short Text Fragments (by Wikipedia Entities). In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (CIKM), pages 1625–1628. ACM, T. Hofmann. Probabilistic Latent Semantic Indexing. In Proc. of the 22nd Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval (SIGIR), pages 50–57, References