Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall.

Slides:



Advertisements
Similar presentations
Association Analysis (Data Engineering). Type of attributes in assoc. analysis Association rule mining assumes the input data consists of binary attributes.
Advertisements

Chapter 5: Introduction to Information Retrieval
PARTITIONAL CLUSTERING
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
I/O-Algorithms Lars Arge Fall 2014 September 25, 2014.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Graduate : Sheng-Hsuan Wang
Data Mining Techniques: Clustering
Context-aware Query Suggestion by Mining Click-through and Session Data Authors: H. Cao et.al KDD 08 Presented by Shize Su 1.
Web search results clustering Web search results clustering is a version of document clustering, but… Billions of pages Constantly changing Data mainly.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
WebMiningResearch ASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
ADVISE: Advanced Digital Video Information Segmentation Engine
Creating Concept Hierarchies in a Customer Self-Help System Bob Wall CS /29/05.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.
LinkSelector: A Web Mining Approach to Hyperlink Selection for Web Portals Xiao Fang University of Arizona 10/18/2002.
WebMiningResearchASurvey Web Mining Research: A Survey Raymond Kosala and Hendrik Blockeel ACM SIGKDD, July 2000 Presented by Shan Huang, 4/24/2007 Revised.
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.
Recommender systems Ram Akella November 26 th 2008.
Extracting Test Cases by Using Data Mining; Reducing the Cost of Testing Andrea Ciocca COMP 587.
Chapter 5: Information Retrieval and Web Search
“A Comparison of Document Clustering Techniques” Michael Steinbach, George Karypis and Vipin Kumar (Technical Report, CSE, UMN, 2000) Mahashweta Das
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to Data Mining Engineering Group in ACL.
Temporal Event Map Construction For Event Search Qing Li Department of Computer Science City University of Hong Kong.
Data Mining Chun-Hung Chou
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
An Integrated Approach to Extracting Ontological Structures from Folksonomies Huairen Lin, Joseph Davis, Ying Zhou ESWC 2009 Hyewon Lim October 9 th, 2009.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc – SIAM Web Analytics Workshop.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Clustering Personalized Web Search Results Xuehua Shen and Hong Cheng.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comprehensive Comparison Study of Document Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Data Mining By Dave Maung.
Chapter 6: Information Retrieval and Web Search
Mining Topic-Specific Concepts and Definitions on the Web Bing Liu, etc KDD03 CS591CXZ CS591CXZ Web mining: Lexical relationship mining.
Data Mining over Hidden Data Sources Tantan Liu Depart. Computer Science & Engineering Ohio State University July 23, 2012.
21/11/20151Gianluca Demartini Ranking Clusters for Web Search Gianluca Demartini Paul–Alexandru Chirita Ingo Brunkhorst Wolfgang Nejdl L3S Info Lunch Hannover,
DOCUMENT UPDATE SUMMARIZATION USING INCREMENTAL HIERARCHICAL CLUSTERING CIKM’10 (DINGDING WANG, TAO LI) Advisor: Koh, Jia-Ling Presenter: Nonhlanhla Shongwe.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
- University of North Texas - DSCI 5240 Fall Graduate Presentation - Option A Slides Modified From 2008 Jones and Bartlett Publishers, Inc. Version.
Summary „Data mining” Vietnam national university in Hanoi, College of technology, Feb.2006.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
V. Clustering 인공지능 연구실 이승희 Text: Text mining Page:82-93.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Text Clustering Hongning Wang
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Multimedia Information Retrieval
Level 1 height Level 2 width Height = 2, 2 levels Width = 4
Chapter 5: Information Retrieval and Web Search
Tantan Liu, Fan Wang, Gagan Agrawal The Ohio State University
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

Hierarchical Document Clustering using Frequent Itemsets Benjamin C. M. Fung, Ke Wang, Martin Ester SDM 2003 Presentation Serhiy Polyakov DSCI 5240 Fall 2005

Introduction Application of document clustering:  web mining  search engines  information retrieval  topological analysis Special requirements for document clustering:  high dimensionality  high volume of data  ease for browsing  meaningful cluster labels

Problem statement Problems with some standard clustering techniques:  number of clusters is unknown  size of the clusters varies greatly Suggested approach - Frequent Itemset- based Hierarchical Clustering (FIHC):  Reduced dimensionality  High clustering accuracy  Number of clusters as an optional input parameter  Easy to browse with meaningful cluster description

Algorithm FIHC preprocessing steps: –stop words removal –stemming on the document set –each document is represented by a vector of frequencies of remaining items within the document FIHC two main steps: –Constructing Initial Clusters (construct an initial cluster to contain all the documents that contain each global frequent itemset) –Making Clusters Disjoint (after this step, each document belongs to exactly one cluster)

Example of Disjoined clusters The cluster label is a set of mandatory items in the cluster in that every document in the cluster must contain all the items in the cluster label

Building the Cluster Tree The set of clusters produced by the previous stage can be viewed as a set of topics and subtopics in the document set. A cluster (topic) tree is constructed based on the similarity among clusters The topic of a parent cluster is more general than the topic of a child cluster and they are “similar” to a certain degree.

Tree Structure vs Browsing   Deep hierarchy tree produced by other methods may not be suitable for browsing   A flat hierarchy reduces the number of navigation steps which in turn decreases the chance for a user to make mistakes   If a hierarchy is too flat, a parent topic may contain too many subtopics and it would increase the time and difficulty for the user to locate her target   A balance between depth and width of the tree is essential for browsing

Evaluation  Evaluation has been performed in terms of F- measure  The following parameters have been evaluated: sensitivity to parameters, efficiency and scalability.  The following competitors have been considered: UPGMA, bisecting k-means, and frequent itemset-based algorithm (HFTC).

Conclusion FIHC approach suggested in the article outperforms its competitors in terms of accuracy, efficiency, and scalability.