Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,

Similar presentations


Presentation on theme: "1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,"— Presentation transcript:

1 1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing, National University of Singapore SIGIR 2010 Speaker: Tom Chao Zhou 2010.10.26, Tuesday

2 2 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

3 3 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

4 4 Motivation Utility of user-generated-contents Quality: distinguish good, bad quality content. Accessibility: question search Organizing the huge collections of data for information navigation: Categorization, hierarchical clustering with labels and descriptions of clusters.

5 5 Categorization Users to construct fine-grained topic hierarchies and assign objects Open Directory Project and Wikipedia Disadvantage: too many manual efforts. Coarse grain hierarchies Yahoo! Answers’ categories. Disadvantage: too coarse, does not have “IPod”.

6 6 Categorization Supervised techniques. Not appropriate for dynamic Web services. Unsupervised Clustering the collections into smaller groups. Extracting labels for clustered groups.

7 7 Prototype Hierarchy based Clustering (PHC) Tackle web collection categorization and navigation problem. PHC utilizes the world knowledge in the form of prototype hierarchies, while adapts to the underlying topic structures of the collections.

8 8 Prototype Hierarchy based Clustering (PHC) Advantages Eliminate the problem of determining the number of clusters and assigning initial clusters by following the structure of the prototype hierarchy. Results are interpretable, comprehensive, and organized. Flexible forms of supervision: prototype hierarchy can come in different level of granularity.

9 9 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

10 10 Prototype Hierarchy Based Clustering Prototype Hierarchy (PH) A hierarchy whose nodes set V represent a set of tuples. p: prototype serving as description of concept l. Data Hierarchy (DH) A hierarchy organizes a collection of objects d. Each node represents a category of objects CO.

11 11 Problem Formulation Given a collection D of objects on a topic τ, PHC partitions and maps D into the categories that are predefined by a PH on τ, such that the formed objects clusters CO1, CO2,..., COk are organized in a DH with similar structures.

12 12 some PH node does not have objects. some questions have no appropriate category to assign to.

13 13 Requirements Data hierarchy is evolving into a compact structure encoding the underlying topics of the collection. Data and prototype hierarchy matched at both node and relation level. Distance between objects are measured by appropriate metrics.

14 14 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

15 15 Problem Formulation and Approach Hierarchy Metric and Information Function A hierarchy metric as a function that operates on all nodes. h: V×V->R+, adjacent pair of nodes, Quality of the structure measured by the amount of information carried in H.

16 16 Minimum Evolution Minimum Evolution (obj1) Intuition :DH that compactly “encodes” the collection into topic categories is the best. Monitor the structural evolution of the data hierarchy. The optimal DH on a collection is the one that contains the least information.

17 17 Matching of Prototype Data Hierarchy Data Hierarchy Centroid Centroids of DH nodes are generated in an incremental manner. New object in a leaf node automatically becomes member of its ancestor nodes. Magnitude of the change decreases with the levels from the leaf node.

18 18 Prototype Centrality Prototype centrality (obj2) Intuition: Adding a data object into a node, so that the updated centroids are most similar to their corresponding prototypes. A prototype is located at the center of an object cluster.

19 19 Prototype-Data Hierarchy Resemblance Matching between two hierarchies H1, H2 Full match, V1=V2 and R1=R2. Partial match common hierarchy: matched nodes and relations. Incomplete match: V1+Vin=V2,R1+Rin=R2 Excess match:V1=V2+Vin,R1=R2+Rin

20 20

21 21 Prototype-Data Hierarchy Resemblance Prototype-Data Hierarchy Resemblance (obj3) Common part of the data hierarchy and the prototype hierarchy.

22 22 Partially Matched Prototype Hierarchy PH is an incomplete match of DH Adding dummy child nodes to the existing nodes in PH. Employ label extraction algorithms. PH is an excess match of DH Empty nodes will be removed.

23 23 Object Metric M(di,dj) defined as the similarity between a pair of objects di and dj within a node. Translation-based Language Model. semantic. Syntactic Tree Kernel Matching. syntactic.

24 24 Category Cohesiveness Category Cohesiveness (obj4) Objects in the same category are similar to each other. Objects in different categories are dissimilar to each other.

25 25 Multi-Criterion Optimization Function Minimum evolution. Prototype centrality. Prototype-Data Hierarchy Resemblance. Category cohesiveness.

26 26 Outline Motivation Prototype Hierarchy Based Clustering Problem Formulation and Approach Experiments

27 27 Datasets Hierarchy Dental: Wikipedia IPod: manually constructed by combining Wikipedia article, Wordnet, product spec. Dataset diversity CS: deep hierarchy. Hierarchies are noise. RS: broad hierarchy, abstract domain. Hierarchies are noise. IPod: concrete domain. Dental: Hierarchy is well constructed.

28 28 Experimental Setting proKmeans Prototype hierarchy enhanced K- means divisive hierarchical clustering. LiveClassifier PHC CFC Classifier Supervised text categorization technique.

29 29 Specifying a prototype hierarchy for a collection, even a simple method can categorize the collection reasonable well. PHC is superior in terms of utilizing the prototype hierarchy. Comparable with supervised method. PHC introduces new nodes into predefined hierarchy. PHC works better in concrete domains than on abstract domains.

30 30 Ablation Study on Optimization Objectives Prototype Centrality(obj2) Category Cohesiveness(obj4) Prototype-Data Hierarchy Resemblance.(obj3) Minimum Evolution(obj1) Data hierarchy varies less from the prototype hierarchy without minimum evolution. (create new node with minimum evolution) Minimum evolution objective leads to a self-contained data hierarchy.

31 31 Robustness with Mismatched Prototype Hierarchy PHC is robust against overfitted prototype hierarchies. PHC has only limited ability to create categories.


Download ppt "1 Prototype Hierarchy Based Clustering for the Categorization and Navigation of Web Collections Zhao-Yan Ming, Kai Wang and Tat-Seng Chua School of Computing,"

Similar presentations


Ads by Google