1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s):1217-1229,2008.

Slides:



Advertisements
Similar presentations
A New Evolving Tree for Text Document Clustering and Visualization 1 Wui Lee Chang, 1* Kai Meng Tay, 2 Chee Peng Lim 1 Faculty of Engineering, Universiti.
Advertisements

Effective Keyword Search in Relational Databases Fang Liu (University of Illinois at Chicago) Clement Yu (University of Illinois at Chicago) Weiyi Meng.
Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Mining Query Subtopics from Search Log Data Date : 2012/12/06 Resource : SIGIR’12 Advisor : Dr. Jia-Ling Koh Speaker : I-Chih Chiu.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
Evaluation of N-grams Conflation Approach in Text-based Information Retrieval Serge Kosinov University of Alberta, Computing Science Department, Edmonton,
Learning the Common Structure of Data Kristina Lerman and Steven Minton Presentation by Jeff Roth.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to Data Mining Engineering Group in ACL.
Disambiguation of References to Individuals Levon Lloyd (State University of New York) Varun Bhagwan, Daniel Gruhl (IBM Research Center) Varun Bhagwan,
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Text mining.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Sanjay Agarwal Surajit Chaudhuri Gautam Das Presented By : SRUTHI GUNGIDI.
Intelligent Database Systems Lab Presenter: MIN-CHIEH HSIU Authors: NHAT-QUANG DOAN ∗, HANANE AZZAG, MUSTAPHA LEBBAH 2013 NN Growing self-organizing trees.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A New Suffix Tree Similarity Measure for Document Clustering
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Web Mining: Phrase-based Document Indexing and Document Clustering Khaled Hammouda, Ph.D. Candidate Mohamed Kamel, Supervisor, PI PAMI Research Group University.
Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Semantic Wordfication of Document Collections Presenter: Yingyu Wu.
Outline Introduction – Frequent patterns and the Rare Item Problem – Multiple Minimum Support Framework – Issues with Multiple Minimum Support Framework.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
2005/12/021 Content-Based Image Retrieval Using Grey Relational Analysis Dept. of Computer Engineering Tatung University Presenter: Tienwei Tsai ( 蔡殿偉.
2005/12/021 Fast Image Retrieval Using Low Frequency DCT Coefficients Dept. of Computer Engineering Tatung University Presenter: Yo-Ping Huang ( 黃有評 )
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Fast accurate fuzzy clustering through data reduction Advisor.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
A Practical Web-based Approach to Generating Topic Hierarchy for Text Segments CIKM2004 Speaker : Yao-Min Huang Date : 2005/03/10.
Emerging Trend Detection Shenzhi Li. Introduction What is an Emerging Trend? –An Emerging Trend is a topic area for which one can trace the growth of.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Multi-object Similarity Query Evaluation Michal Batko.
Text Document Categorization by Term Association Maria-luiza Antonie Osmar R. Zaiane University of Alberta, Canada 2002 IEEE International Conference on.
26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.
Date: 2013/4/1 Author: Jaime I. Lopez-Veyna, Victor J. Sosa-Sosa, Ivan Lopez-Arevalo Source: KEYS’12 Advisor: Jia-ling Koh Speaker: Chen-Yu Huang KESOSD.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.
Artificial Intelligence Techniques Internet Applications 4.
Hybrid Content and Tag-based Profiles for recommendation in Collaborative Tagging Systems Latin American Web Conference IEEE Computer Society, 2008 Presenter:
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Using category-Based Adherence to Cluster Market-Basket Data Author : Ching-Huang Yun, Kun-Ta Chuang, Ming-Syan Chen Graduate : Chien-Ming Hsiao.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
On the Intersection of Inverted Lists Yangjun Chen and Weixin Shen Dept. Applied Computer Science, University of Winnipeg 515 Portage Ave. Winnipeg, Manitoba,
Document Clustering with Prior Knowledge Xiang Ji et al. Document Clustering with Prior Knowledge. SIGIR 2006 Presenter: Suhan Yu.
Similarity Measures for Text Document Clustering
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web Yizhe Ge.
Connecting the Dots Between News Article
VECTOR SPACE MODEL Its Applications and implementations
CoXML: A Cooperative XML Query Answering System
Presentation transcript:

1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008 Speaker: Wei-Cheng Wu Data:2008/10/23

2 Outline 1. Introduction 2. The Phrase-Based Document Similarity 3. Experimental Results 4. Conclusions

3 1.Introduction Clustering techniques are based on four concepts, data representation model, similarity measure, clustering model, and clustering algorithm Vector Space Document (VSD) model Suffix Tree Document (STD) model

4 2.The Phrase-Based Document Similarity Standard Suffix Tree Document Model and STC Algorithm

5 2.The Phrase-Based Document Similarity EX:

6 The Phrase-Based Document Similarity Based on the STD Model 2.The Phrase-Based Document Similarity

7 Vector Space Document (VSD) model 2.The Phrase-Based Document Similarity (1) Example of Fig.1

8 2.The Phrase-Based Document Similarity (2) Let vectors (3)

9 Properties of the STD Model 2.The Phrase-Based Document Similarity

10 2.The Phrase-Based Document Similarity

11 Property1. Each internal node of the suffix tree T represents an LCP of the document data set D, and each leaf node represents a suffix substring of a document in the data set D. Property 2. Each first-level node in suffix tree T is labeled by a distinct phrase that appears at least once in the documents of data set D. The number of the first-level nodes is equal to the number of keywords (distinct single-word terms in the VSD model) in the data set D. Property 3. Each phrase denoted by an internal node v at a higher level ( ) in suffix tree T contains at least two words. The length of the phrase (by words). 2.The Phrase-Based Document Similarity

12 3.Experimental Results OHSUMED Document Collection RCV1 Document Collection 20-Newsgroups Document Collection Original STC algorithm GHAC (group-average HAC algorithm) with the phrase-based document similarity GHAC with the traditional single-word tf-idf cosine similarity K-NN clustering algorithm with the phrase- based document similarity

13 3.Experimental Results

14 is a clustering of data set D of N document designate the “ correct ” class set of D The recall of cluster j with respect to class i The precision of cluster j with respect to class i 3.Experimental Results

15 3.Experimental Results

16 3.Experimental Results

17 3.Experimental Results

18 3.Experimental Results

19 The Performance Evaluation on Large Document Data Sets we conducted a set of experiments on a large data set(DS8) that are generated from the RCV1 document collection.The data set DS8 contains 500 documents of category GSPO, M11, respectively, and all documents of other eight categories. The total number of documents is 4, Experimental Results

20 3.Experimental Results

21 3.Experimental Results

22 3.Experimental Results

23 3.Experimental Results

24 4.Conclusions The new phrase-based document similarity successfully connects the two document models and inherits their advantages.