A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.

Slides:



Advertisements
Similar presentations
Query Chain Focused Summarization Tal Baumel, Rafi Cohen, Michael Elhadad Jan 2014.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Improved TF-IDF Ranker
Evaluation of Decision Forests on Text Categorization
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Towards Twitter Context Summarization with User Influence Models Yi Chang et al. WSDM 2013 Hyewon Lim 21 June 2013.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Learning to Cluster Web Search Results SIGIR 04. ABSTRACT Organizing Web search results into clusters facilitates users quick browsing through search.
Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Web Document Clustering: A Feasibility Demonstration Hui Han CSE dept. PSU 10/15/01.
Online Clustering of Web Search results
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
Representation of hypertext documents based on terms, links and text compressibility Julian Szymański Department of Computer Systems Architecture, Gdańsk.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Chapter 5: Information Retrieval and Web Search
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Aparna Kulkarni Nachal Ramasamy Rashmi Havaldar N-grams to Process Hindi Queries.
CHAMELEON : A Hierarchical Clustering Algorithm Using Dynamic Modeling
«Tag-based Social Interest Discovery» Proceedings of the 17th International World Wide Web Conference (WWW2008) Xin Li, Lei Guo, Yihong Zhao Yahoo! Inc.,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
Text mining.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng WWW 07.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Web Document Clustering By Sang-Cheol Seok. 1.Introduction: Web document clustering? Why ? Two results for the same query ‘amazon’ Google : currently.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.
Querying Structured Text in an XML Database By Xuemei Luo.
Web Document Clustering: A Feasibility Demonstration Oren Zamir and Oren Etzioni, SIGIR, 1998.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
A New Suffix Tree Similarity Measure for Document Clustering
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS MINAL.
Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
Chapter 6: Information Retrieval and Web Search
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Text mining. The Standard Data Mining process Text Mining Machine learning on text data Text Data mining Text analysis Part of Web mining Typical tasks.
Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
Web- and Multimedia-based Information Systems Lecture 2.
Finding Near-Duplicate Web Pages: A Large-Scale Evaluation of Algorithms Author: Monika Henzinger Presenter: Chao Yan.
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
Post-Ranking query suggestion by diversifying search Chao Wang.
Reporter: Shau-Shiang Hung( 洪紹祥 ) Adviser:Shu-Chen Cheng( 鄭淑真 ) Date:99/06/15.
A code-centric cluster-based approach for searching online support forums for programmers Christopher Scaffidi, Christopher Chambers, Sheela Surisetty.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Don’t Follow me : Spam Detection in Twitter January 12, 2011 In-seok An SNU Internet Database Lab. Alex Hai Wang The Pensylvania State University International.
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Clustering (Search Engine Results) CSE 454. © Etzioni & Weld To Do Lecture is short Add k-means Details of ST construction.
1 Efficient Phrase-Based Document Similarity for Clustering IEEE Transactions On Knowledge And Data Engineering, Vol. 20, No. 9, Page(s): ,2008.
Why indexing? For efficient searching of a document
Clustering of Web pages
Information Retrieval and Web Search
6. Implementation of Vector-Space Retrieval
Chapter 5: Information Retrieval and Web Search
Presentation transcript:

A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April 11, 2008 Internet Database Lab., SNU Hyewon Lim 1

Contents  Introduction  Related Work  A New Suffix Tree Similarity Measure  A Practical Approach  Evaluation  Conclusions and Future Work 2

Introduction (1/3)  BBS, Weblog and Wiki  Computer have no understanding of the content and meaning of the submitted information data.  Assessing and classifying the information data Relied on the manual work of a few experienced people Grows of a community → manual work become heavier  Document clustering algorithms  Group document together based on their similarities.  The objective our work  Develop a document clustering algorithm to categorize the Web documents in an online community. 3

Introduction (2/3)  VSD model  Very widely used  Represent any document as a feature vector of the words  The similarity between two documents is computed with similarity measures.  Sequence order of words is seldom considered  STC algorithm  A linear time clustering algorithm Based on identifying phrases that are common to groups of documents.  Lacks an efficient similarity measure 4

Introduction (3/3)  We focused our work on how to combine the advantages of two document models in document clustering.  The new suffix tree similarity measure  Combination of the word’s sequence order consideration of suffix tree model and the term weighting scheme of VSD model 5

Contents  Introduction  Related Work  A New Suffix Tree Similarity Measure  A Practical Approach  Evaluation  Conclusions and Future Work 6

Related Work (1/2)  The method used for document clustering  1. Agglomerative Hierarchical Clustering Algorithm Most commonly used algorithm among the numerous document clustering algorithm. Can often generate a high quality clustering result with a tradeoff of a higher computing complexity.  2. VSD model Words or characters are considered to be atomic elements. Clustering methods based on VSD model mostly make use of single word term analysis of document data set. 7

Related Work (2/2)  The method used for document clustering  3. Suffix tree document model Considers a document to be a set of suffix substrings Common prefixes of the suffix substrings are selected as phrases to label the edges of a suffix tree. STC algorithm  Developed based on this model  Works well in clustering Web document snippets 8

Contents  Introduction  Related Work  A New Suffix Tree Similarity Measure  A Practical Approach  Evaluation  Conclusions and Future Work 9

A New Suffix Tree Similarity Measure - Suffix Tree Document Model and STC Algo. (1/4)  Document model  A concept that describes how a set of meaningful features in extracted from a document.  Suffix tree document model  A document d=w 1 w 2 …w m as a string consisting of words w i, not characters (i=1,2,…,m)  Suffix tree of document d is a compact trie containing all suffixes of document d. 10

A New Suffix Tree Similarity Measure - Suffix Tree Document Model and STC Algo. (2/4) 11

A New Suffix Tree Similarity Measure - Suffix Tree Document Model and STC Algo. (3/4)  The original STC algorithm  Developed based on the suffix tree document model.  Three logical steps: 1. the common suffix tree generating  A suffix tree S for all suffixes of each document in D = {d 1,d 2, …, d N } is constructed. 2. base cluster selecting  s(B) = |B|· f(|P|)  All base clusters are sorted by the scores, and the top k base clusters are selected for cluster merging. 3. cluster merging  A similarity graph consisting of the k base clusters is generated. 12

A New Suffix Tree Similarity Measure - Suffix Tree Document Model and STC Algo. (4/4) 13

A New Suffix Tree Similarity Measure - The New Suffix Tree Similarity Measure (1/2)  Mapping all nodes n of the common suffix tree to a M dimensional space of VSD model,  D = {w(1,d), w(2,d), …, w(M, d)}  df(n): the number of the different documents that have traversed node n  tf(n, d): total traversed times of document d through node n  w(n, d): weight of node n 14

A New Suffix Tree Similarity Measure - The New Suffix Tree Similarity Measure (2/2)  After obtaining the term weights of all nodes,  apply traditional similarity measures like the cosine similarity to compute the similarity of two documents. 15

A New Suffix Tree Similarity Measure - A Closer Look to Suffix Tree Doc Model (1/3)  In suffix tree document model,  Document is considered as a string consisting of words, not characters.  O(m 2 ) times The naïve, straightforward method to build a suffix tree for a document of m words  Ukkonen’s paper Time complexity of building a suffix tree: O(m) Makes it possible to build a large incremental suffix tree online 16

A New Suffix Tree Similarity Measure - A Closer Look to Suffix Tree Doc Model (3/3)  Stopword  Use a standard Stopwards List and Porter stemming algorithm to preprocess the document to get “clean” doc.  Words appearing in the stoplist, or that appear in too few or too many documents receives a score of zero in computing the score s(B) of a base cluster.  “stopnode” Same idea of stopwords in the suffix tree similarity measure computation Threadhold idf thd of idf is given to identify whether a node is a stopnode. 17

Contents  Introduction  Related Work  A New Suffix Tree Similarity Measure  A Practical Approach  Evaluation  Conclusions and Future Work 18

A Practical Approach: Web Document Clustering In online Forum Communities (1/5)  Web document clustering algorithm has three logical steps:  Document preparing  Document clustering  Cluster topic summary generating 19

A Practical Approach: Web Document Clustering In online Forum Communities (2/5)  Document Preparing  Content of a topic thread in a forum consists of a topic post and the reply posts. Each post is saved as a tuple  To prepare a text document with respect to a topic thread, Access the tuples from DB table directly Combine all posts of the same thread into a single document Before adding a post into the doc, a doc “cleaning” procedure is executed After cleaning, the posts containing at least 3 distinct words are selected for document merging. 20

A Practical Approach: Web Document Clustering In online Forum Communities (3/5)  Document Clustering  Each thread document is fetched from the corresponding table, and inserted into a suffix tree.  The tf and df of each node have been calculated during constructing the suffix tree.  The pairwise similarity of two documents can be computed with cosine similarity measure. 21

A Practical Approach: Web Document Clustering In online Forum Communities (4/5)  Cluster Topic Summary Generating (1/2)  Topic summary generating concerns two important information retrieval work: 1) ranking the documents in a cluster by a quality score 2) extracting common phrases as the topic summary of the corresponding cluster 22

A Practical Approach: Web Document Clustering In online Forum Communities (5/5)  Cluster Topic Summary Generating (2/2)  Evaluating quality of cluster and its documents is still a challenging research The Web documents of a forum system can provide some additional human assessments for the document quality evaluation 3 statistical scores provided in our forum system, view clicks, reply posts and recommend clicks.  q(d) = |d|· v· r· c All documents in the same cluster are sorted by their quality scores. 23

Contents  Introduction  Related Work  A New Suffix Tree Similarity Measure  A Practical Approach  Evaluation  Conclusions and Future Work 24

Evaluation (1/2)  F-Measure  Commonly used in evaluating the effectiveness of clustering and classification algorithms.  The weighted harmonic mean of precision and recall.  Formula of F-measure: 25

Evaluation (2/2)  F-Measure  It combines the precision and recall idea from IR:  The F-Measure for overall quality of cluster set C: rec(i, j) = |C j ∩C i *|/|C i *| prec(i, j) = |C j ∩C i *|/|C i | C: a clustering of document set D C*: the “correct” class set of D 26

Evaluation - Results and Discussion (1/5)  We constructed document sets from OHSUMED and RCV1 document collections 27

Evaluation - Results and Discussion (2/5)  NSTC: results of the new suffix tree similarity measure  TDC: results of traditional word tf-idf cosine similarity measure  STC: results of all clusters generated by STC algorithm  STC-10: results of the top 10 clusters generated by orginal STC algorithm 28

Evaluation - Results and Discussion (3/5)  Result from DS3 document set 29

Evaluation - Results and Discussion (4/5) 30

Evaluation - Results and Discussion (5/5) 31

Contents  Introduction  Related Work  A New Suffix Tree Similarity Measure  A Practical Approach  Evaluation  Conclusions and Future Work 32

Conclusions and Future Work (1/2)  VSD model and suffix tree model Two models are used in two isolated ways:  Almost all clustering algorithms based on VSD model  ignore the occurring position of words in the document  the different semantic meanings of a word in different sentences are unavoidably discarded  Suffix tree document model  Keeps all sequential characteristics of the sentences for each document  Phrases consisting of one or more words are used to designate the similarity of two documents.  Original STC algorithms cannot provide an effective evaluation method to assess the quality of clusters. 33

Conclusions and Future Work (2/2)  New suffix tree similarity measure Connect both two document models.  Mapping all nodes in the common suffix tree into a M dimensional space of VSD model  The advantages of two document models are smoothly inherited in the new similarity measure.  The new similarity measure is suitable to not only hierarchical clustering algorithm but also most traditional clustering algorithms based on VSD model.  Future Work More performance evaluation comparisons for these clustering algorithms with the new similarity measure. 34