1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.

Slides:



Advertisements
Similar presentations
Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Advertisements

WWW 2014 Seoul, April 8 th SNOW 2014 Data Challenge Two-level message clustering for topic detection in Twitter Georgios Petkos, Symeon Papadopoulos, Yiannis.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Large dataset for object and scene recognition A. Torralba, R. Fergus, W. T. Freeman 80 million tiny images Ron Yanovich Guy Peled.
WIMS 2014, Thessaloniki, June 2014 A soft frequent pattern mining approach for textual topic detection Georgios Petkos, Symeon Papadopoulos, Yiannis Kompatsiaris.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
WMES3103 : INFORMATION RETRIEVAL
June 19-21, 2006WMS'06, Chania, Crete1 Design and Evaluation of Semantic Similarity Measures for Concepts Stemming from the Same or Different Ontologies.
Patent Search QUERY Log Analysis Shariq Bashir Department of Software Technology and Interactive Systems Vienna.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Distributional clustering of English words Authors: Fernando Pereira, Naftali Tishby, Lillian Lee Presenter: Marian Olteanu.
Clustering. What is clustering? Grouping similar objects together and keeping dissimilar objects apart. In Information Retrieval, the cluster hypothesis.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,
Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Chapters Presented by Sole.
MGT-491 QUANTITATIVE ANALYSIS AND RESEARCH FOR MANAGEMENT OSMAN BIN SAIF Session 14.
India Research Lab Auto-grouping s for Faster eDiscovery Sachindra Joshi, Danish Contractor, Kenney Ng*, Prasad M Deshpande, and Thomas Hampp* IBM.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Mining and Summarizing Customer Reviews
Deriving Topics and Opinions from Microblogs Feng Jiang Supervisors: Jixue Liu & Jiuyong Li.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Learning Information Extraction Patterns Using WordNet Mark Stevenson and Mark A. Greenwood Natural Language Processing Group University of Sheffield,
LACRO'13 workshop April 2013, Leuven, Belgium Miguel‐Angel Sicilia, Salvador Sánchez‐Alonso, Elena Garcia‐Barriocanal, Julià Minguillón, Enayat Rajabi.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Presented by Tienwei Tsai July, 2005
Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.
Clustering User Queries of a Search Engine Ji-Rong Wen, Jian-YunNie & Hon-Jian Zhang.
Incident Threading for News Passages (CIKM 09) Speaker: Yi-lin,Hsu Advisor: Dr. Koh, Jia-ling. Date:2010/06/14.
Laboratory for InterNet Computing CSCE 561 Social Media Projects Ryan Benton October 8, 2012.
WORD SENSE DISAMBIGUATION STUDY ON WORD NET ONTOLOGY Akilan Velmurugan Computer Networks – CS 790G.
Improving Subcategorization Acquisition using Word Sense Disambiguation Anna Korhonen and Judith Preiss University of Cambridge, Computer Laboratory 15.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Gene Clustering by Latent Semantic Indexing of MEDLINE Abstracts Ramin Homayouni, Kevin Heinrich, Lai Wei, and Michael W. Berry University of Tennessee.
Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Finding frequent and interesting triples in text Janez Brank, Dunja Mladenić, Marko Grobelnik Jožef Stefan Institute, Ljubljana, Slovenia.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
1 A Biterm Topic Model for Short Texts Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng Institute of Computing Technology, Chinese Academy of Sciences.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Answer Mining by Combining Extraction Techniques with Abductive Reasoning Sanda Harabagiu, Dan Moldovan, Christine Clark, Mitchell Bowden, Jown Williams.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
2/10/2016Semantic Similarity1 Semantic Similarity Methods in WordNet and Their Application to Information Retrieval on the Web Giannis Varelas Epimenidis.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Twitter Hashtags RMBI4310Spring 2016 Group 14 Cheung Hiu Yan, Debbie Chow Miu Lam, Carman Tsang Wing Wah, Denise
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Erasmus University Rotterdam
Social Knowledge Mining
Presentation 王睿.
WordNet: A Lexical Database for English
Text Mining & Natural Language Processing
Ying Dai Faculty of software and information science,
Ying Dai Faculty of software and information science,
Ying Dai Faculty of software and information science,
Text Mining & Natural Language Processing
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Giannis Varelas Epimenidis Voutsakis Paraskevi Raftopoulou
Text Mining Application Programming Chapter 9 Text Categorization
Presentation transcript:

1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised by: Dr. Antonio Moreno

2 Objectives Analyze and report the current state of the art on the analysis of tweets. Obtain a data set of tweets. Develop, implement and test new mechanisms of automatic hashtag hierarchy construction. –Use of co-occurrence frequency vs. use of semantic measures.

3 What is Twitter Twitter is an online social networking service. Each tweets is up to 140 characters. –text –links –user mentions –symbols emoticons –hashtags

4 Scope In general, tweets are usually ungrammatical. Hashtags provide Twitter with a mechanism to semi- structure its content. Hashtags may be used to categorize sets of tweets. Motivate the need for systems that can aggregate and categorize all its content. Examples: –Large companies. –Governments.

5 Why it is difficult ? Hashtags are unstructured. Tweets are very terse, often lacking sufficient context to categorize them. Retrieval and classification methods have some basic problems –Synonymy –Polysemy

6 State of the Art The three basic kinds of techniques that have been proposed to detect the main topics of interest within a set of messages exchanged in a social network. –Probabilistic models. –Document-pivot approaches. –Feature-pivot methods.

7 Methodology Clustering: this stage aims to group all the similar hashtags in clusters of related terms in order to detect topics of interest. Topic selection: general discussion about the detection of the most relevant classes.

8 Some basic concepts and tools Twitter Knowledge repositories –WordNet –Ontology-based semantic similarity Techniques –Word-breaking –Clustering –Inter-class Homogeneity

9 WordNet WordNet is the most commonly used online lexical and semantic repository for the English language. WordNet includes the main lexical categories (nouns, verbs, adjectives and adverbs) but ignore prepositions, determiners and other kinds of words.

10 Ontology-based semantic similarity The science that aims to estimate the alikeness between words or concepts by evaluating their semantics. To calculating the semantic similarity between words we have used the Wu and Palmer distance function.

11 Wu and Palmer distance function

12 Word-breaking If a hashtags or a word does not match with a WordNet entry, the word-breaking technique is applied. It checks the matches between the subsequence of the hashtags and WordNet entries. If a match is found, the subsequence is stored. iPhone6 -> Phone, hone, one, on SmartPhone -> Smart, Phone, mart, art, hone, one

13 Word-breaking Two(if possible) large non-overlapping sub-sequence are taken. iPhone6 -> Phone SmartPhone -> Smart, Phone In English it is usual that the words on the left are adjectives or terms that denote a specialization of the main noun, located on the right. Therefore, this procedure finds the most general specialization present in WordNet. Thus when we analyze the data, we will consider “iPhone6” as “Phone”.

14 Clustering Clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar to each other than to those in other groups. we have chosen the hierarchical clustering method (with complete linkage) to classify the hashtags contained in a set of tweets Complete linkage calculates the distance between two clusters as the maximum distance between a pair of objects.

15 Inter-class Homogeneity Inter-class Homogeneity is a concept related to the degree of similarity between elements in the same cluster or the measurement of the degree of homogeneity among population elements within the sampling clusters.

16 Methodology : Clustering Syntactic hashtag clustering Semantic hashtag clustering

17 Syntactic hashtag clustering The main consideration of the similarity matrix is that the more frequently two hashtags appear in one tweets, the more related they are supposed to be. ∀i ∈[1,n] ∀j ∈[1,n], c ij = a (i,j)

18 Semantic hashtag clustering Semantic similarity is calculated using the Wu & Palmer on WorNet. ∀i ∈[1,n] ∀j ∈[1,n], sij = SemanticSimilarity (hi,hj)

19 Topic selection Three basic approaches: –Bottom-up approach. –Top-down approach. –Dendogram approach. Filtering has two threshold values: –Minimum number of elements. –Minimum inter-class homogeneity.

20 Bottom-up approach

21 Top-down approach

22 Dendogram approach

23 Case study :The Dataset 1000 tweets contained the hashtag #sensor Then for each hashtags (found in those 1000 tweets) we again extract, if possible, 100 tweets hashtagged tweets with unique hashtags were collected.

24 Analysis of the set of tweets: Cluster Clustering based on Co-occurrence frequency Clustering based on Semantic similarity

25 Threshold mHT (minimum number of hashtags in one cluster): –For co-occurrence: values ranging from 5 to 45 in interval of 5. –For semantic: values ranging from 5 to 50 in interval of 5. Threshold mHG (minimum inter-class homogeneity in one cluster): –For co-occurrence: values ranging from 0.1 to 0.65 in interval of –For semantic: values ranging from 0.3 to 0.95 in interval of 0.05

26 Analysis of the set of tweets Analysis 1: Total number of hashtags selected by the system Total hashtags found for the Topic selection Bottom-Up (left) Top-Down (right) analysis Based on semantic similarity clustering

27 Analysis of the set of tweets Analysis 1: Total number of hashtags selected by the system Total hashtags found for the Topic selection Bottom-Up (left) Top-Down (right) analysis based on co-occurrence clustering

28 Analysis of the set of tweets Analysis 2: Total number of clusters selected by the system Total clusters found for the Topic selection Bottom-Up (left) Top-Down (right) analysis based on semantic similarity clustering

29 Analysis of the set of tweets Analysis 2: Total number of clusters selected by the system Total clusters found for the Topic selection Bottom-Up (left) Top-Down (right) analysis based on co-occurrence clustering

30 Observations The clustering based on semantic similarity can extract more hashtags and clusters when we demand high homogeneity and high number of hashtags.

31 Result : Semantic Clustering (Bottom Up) Minimum hashtags 6, minimum inter-class homogeneity 0.9

32 Result: Semantic Clustering (Top-Down) Minimum hashtags 6, minimum inter-class homogeneity 0.9

33 Result : Syntactic Clustering (Bottom-Up) Minimum hashtags 6, minimum inter-class homogeneity 0.8

34 Result : Syntactic Clustering (Top-Down) Minimum hashtags 6, minimum inter-class homogeneity 0.8

35 Observations For semantic clustering  Most of classes a general name can be set.  The semantic centroid generated by the system is good.  most precise clustering : higher “minimum homogeneity” and lower “minimum number of hashtags”.  System can generate a general class with a large number of hashtags.  For some clusters it is hard to set a name manually, but the system can find a general semantic centroid. For co-occurrence clustering  For few classes a general name can be set.  the semantic centroid generated by the system is not good  System not able to generate a general class with a large number of hashtags.

36 Dendogram Result

37 Observations Each branch of the tree the semantic centroids go from general concepts to more specific ones. There are some long branches (e.g. entity, individual) that are not very illustrative.

38 Conclusion A hierarchical clustering is applied to group all the similar hashtags. For the syntactic clustering: the co- occurrence matrix is normalized to calculate the similarity matrix. For the semantic hashtag clustering: –Wordnet –WordBreaking –Words not found in WordNet are removed –Similarity matrix is calculated using the application of the Wu-Palmer distance on WordNet and co-occurrence frequency.

39 Conclusion Bottom-up selection of clusters: Aims to find the most specific classes that fulfill the selection criteria. Top-down selection of clusters: Aims to find the most general classes that fulfill the selection criteria. Dendogram analysis of clusters: Aims to obtain a hierarchy of clusters that fulfill the selection criteria.

40 Conclusion Regarding the case study –Number of hashtags and number of cluster: the clustering based on semantic similarity is better. –Topic selection approaches: the clustering based on semantic similarity is better. –Automatic construction of hashtags hierarchy based on semantic analysis produces a better result.

41 Future work Apply "stemming" techniques. Concepts using other knowledge structures. e.g. YAGO –Wikipedia (e.g., categories, redirects, infoboxes) –WordNet (e.g., synsets, hyponymy) –GeoNames The specific treatment of polysemic hashtags.

42 THANK YOU……