1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,

Slides:



Advertisements
Similar presentations
Conceptual Clustering
Advertisements

Entity-Centric Topic-Oriented Opinion Summarization in Twitter Date : 2013/09/03 Author : Xinfan Meng, Furu Wei, Xiaohua, Liu, Ming Zhou, Sujian Li and.
Taxonomies of Knowledge: Building a Corporate Taxonomy Wendi Pohs, Iris Associates
Date : 2013/09/17 Source : SIGIR’13 Authors : Zhu, Xingwei
Xyleme A Dynamic Warehouse for XML Data of the Web.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Aki Hecht Seminar in Databases (236826) January 2009
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Memoplex Browser: Searching and Browsing in Semantic Networks CPSC 533C - Project Update Yoel Lanir.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Tag Clouds Revisited Date : 2011/12/12 Source : CIKM’11 Speaker : I- Chih Chiu Advisor : Dr. Koh. Jia-ling 1.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Machine Learning Approach for Ontology Mapping using Multiple Concept Similarity Measures IEEE/ACIS International Conference on Computer and Information.
1 A Graph-Theoretic Approach to Webpage Segmentation Deepayan Chakrabarti Ravi Kumar
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Querying Structured Text in an XML Database By Xuemei Luo.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Adding Semantics to Clustering Hua Li, Dou Shen, Benyu Zhang, Zheng Chen, Qiang Yang Microsoft Research Asia, Beijing, P.R.China Department of Computer.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Acclimatizing Taxonomic Semantics for Hierarchical Content Categorization --- Lei Tang, Jianping Zhang and Huan Liu.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
A Portrait of the Semantic Web in Action Jeff Heflin and James Hendler IEEE Intelligent Systems December 6, 2010 Hyewon Lim.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Automatic vs manual indexing Focus on subject indexing Not a relevant question? –Wherever full text is available, automatic methods predominate Simple.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Efficient Semantic Web Service Discovery in Centralized and P2P Environments Dimitrios Skoutas 1,2 Dimitris Sacharidis.
Intelligent Database Systems Lab Presenter : YU-TING LU Authors : Hsin-Chang Yang, Han-Wei Hsiao, Chung-Hong Lee IPM Multilingual document mining.
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Semantic collaborative web caching Jean-Marc Pierson Lionel Brunie, David Coquil LISI, INSA de LYON
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Ontology Engineering and Feature Construction for Predicting Friendship Links in the Live Journal Social Network Author:Vikas Bahirwani 、 Doina Caragea.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Guangbing Yang Presentation for Xerox Docushare Symposium in 2011
Constrained Clustering -Semi Supervised Clustering-
K-means and Hierarchical Clustering
Clustering.
Topic Oriented Semi-supervised Document Clustering
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Presentation transcript:

1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing, National University of Singapore SIGIR 2010 Speaker: Tom Chao Zhou , Tuesday

2 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

3 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

4 Motivation Utility of user-generated-contents Quality: distinguish good, bad quality content. Accessibility: question search Organizing the huge collections of data for information navigation: Categorization, hierarchical clustering with labels and descriptions of clusters.

5 Categorization Users to construct fine-grained topic hierarchies and assign objects Open Directory Project and Wikipedia Disadvantage: too many manual efforts. Coarse grain hierarchies Yahoo! Answers’ categories. Disadvantage: too coarse, does not have “IPod”.

6 Categorization Supervised techniques. Not appropriate for dynamic Web services. Unsupervised Clustering the collections into smaller groups. Extracting labels for clustered groups.

7 Prototype Hierarchy based Clustering (PHC) Tackle web collection categorization and navigation problem. PHC utilizes the world knowledge in the form of prototype hierarchies, while adapts to the underlying topic structures of the collections.

8 Prototype Hierarchy based Clustering (PHC) Advantages Eliminate the problem of determining the number of clusters and assigning initial clusters by following the structure of the prototype hierarchy. Results are interpretable, comprehensive, and organized. Flexible forms of supervision: prototype hierarchy can come in different level of granularity.

9 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

10 Prototype Hierarchy Based Clustering Prototype Hierarchy (PH) A hierarchy whose nodes set V represent a set of tuples. p: prototype serving as description of concept l. Data Hierarchy (DH) A hierarchy organizes a collection of objects d. Each node represents a category of objects CO.

11 Problem Formulation Given a collection D of objects on a topic τ, PHC partitions and maps D into the categories that are predefined by a PH on τ, such that the formed objects clusters CO1, CO2,..., COk are organized in a DH with similar structures.

12 some PH node does not have objects. some questions have no appropriate category to assign to.

13 Requirements Data hierarchy is evolving into a compact structure encoding the underlying topics of the collection. Data and prototype hierarchy matched at both node and relation level. Distance between objects are measured by appropriate metrics.

14 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

15 Problem Formulation and Approach Hierarchy Metric and Information Function A hierarchy metric as a function that operates on all nodes. h: V×V->R+, adjacent pair of nodes, Quality of the structure measured by the amount of information carried in H.

16 Minimum Evolution Minimum Evolution (obj1) Intuition :DH that compactly “encodes” the collection into topic categories is the best. Monitor the structural evolution of the data hierarchy. The optimal DH on a collection is the one that contains the least information.

17 Matching of Prototype Data Hierarchy Data Hierarchy Centroid Centroids of DH nodes are generated in an incremental manner. New object in a leaf node automatically becomes member of its ancestor nodes. Magnitude of the change decreases with the levels from the leaf node.

18 Prototype Centrality Prototype centrality (obj2) Intuition: Adding a data object into a node, so that the updated centroids are most similar to their corresponding prototypes. A prototype is located at the center of an object cluster.

19 Prototype-Data Hierarchy Resemblance Matching between two hierarchies H1, H2 Full match, V1=V2 and R1=R2. Partial match common hierarchy: matched nodes and relations. Incomplete match: V1+Vin=V2,R1+Rin=R2 Excess match:V1=V2+Vin,R1=R2+Rin

20

21 Prototype-Data Hierarchy Resemblance Prototype-Data Hierarchy Resemblance (obj3) Common part of the data hierarchy and the prototype hierarchy.

22 Partially Matched Prototype Hierarchy PH is an incomplete match of DH Adding dummy child nodes to the existing nodes in PH. Employ label extraction algorithms. PH is an excess match of DH Empty nodes will be removed.

23 Object Metric M(di,dj) defined as the similarity between a pair of objects di and dj within a node. Translation-based Language Model. semantic. Syntactic Tree Kernel Matching. syntactic.

24 Category Cohesiveness Category Cohesiveness (obj4) Objects in the same category are similar to each other. Objects in different categories are dissimilar to each other.

25 Multi-Criterion Optimization Function Minimum evolution. Prototype centrality. Prototype-Data Hierarchy Resemblance. Category cohesiveness.

26 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

27 Datasets Hierarchy Dental: Wikipedia IPod: manually constructed by combining Wikipedia article, Wordnet, product spec. Dataset diversity CS: deep hierarchy. Hierarchies are noise. RS: broad hierarchy, abstract domain. Hierarchies are noise. IPod: concrete domain. Dental: Hierarchy is well constructed.

28 Experimental Setting proKmeans Prototype hierarchy enhanced K- means divisive hierarchical clustering. LiveClassifier PHC CFC Classifier Supervised text categorization technique.

29 Specifying a prototype hierarchy for a collection, even a simple method can categorize the collection reasonable well. PHC is superior in terms of utilizing the prototype hierarchy. Comparable with supervised method. PHC introduces new nodes into predefined hierarchy. PHC works better in concrete domains than on abstract domains.

30 Ablation Study on Optimization Objectives Prototype Centrality(obj2) Category Cohesiveness(obj4) Prototype-Data Hierarchy Resemblance.(obj3) Minimum Evolution(obj1) Data hierarchy varies less from the prototype hierarchy without minimum evolution. (create new node with minimum evolution) Minimum evolution objective leads to a self-contained data hierarchy.

31 Robustness with Mismatched Prototype Hierarchy PHC is robust against overfitted prototype hierarchies. PHC has only limited ability to create categories.