Presentation is loading. Please wait.

Presentation is loading. Please wait.

Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic.

Similar presentations


Presentation on theme: "Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic."— Presentation transcript:

1 Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic model 2.Vector space model (VSM) 3.Ontology-based VSM 5.Document Representation 6.Ontology and Semantic Enhancement of Presentation Models 7.Ontology-based VSM

2 Document Clustering Essentials

3 Text Clustering Architecture Text Documents Preprocessing VSM Data Modeling Clustering Ontology Tf-Idf0 Ontology Tf-Idf0 Clustering Algorithms BOW (Bag of Visual Words) BOW (Bag of Visual Words) Clusters

4  Removing Stopwords  Stopwords  Function words and connectives  Appear in a large number of documents and have little use in describing the characteristics of documents. Example  Removing Stopwords Stopwords: “of”, “a”, “by”, “and”, “the”, “instead” Example “direct prediction continuous output variable method discretizes variable kMeans clustering solves resultant classification problem” Preprocessing

5  Stemming  Remove inflections that convey parts of speech, tense.  Techniques Morphological analysis (e.g., Porter’s algorithm) Dictionary lookup (e.g., WordNet)  Stems: “prediction --->predict” “discretizes --->discretize” “kMeans ---> kMean” “clustering --> cluster” “solves ---> solve” “classification ---> classify” Example sentence  “direct predict continuous output variable method discretize variable kMean cluster solve resultant classify problem” Preprocessing

6 Different Document Models

7 Probabilistic and Vector Space Model

8 Document Representation

9 Ontology and Semantic Enhancement of Presentation Models Represent unstructured data (text documents) according to ontology repository Each term in a vector is a concept rather than only a word or phrase Determine the similarity of documents  Methods to Represent Ontology Terminological ontology Synonyms: several words for the same concept employee (HR)=staff (Administration)=researcher (R&D) car=automobile Homonyms: one word with several meanings bank: river bank vs. financial bank fan: cooling system vs. sports fan

10 Ontology-based VSM Each element of a document vector considering ontology is represented by: Where X ji1 is the original frequency of t i1 term in the jth document, is the semantic similarity between t i1 term and t i2 term. Advantage: 1.Consider the relationship between terms. 2.Introduce semantic concepts into data models. 3.Combine ontology with the traditional VSM. 4.Terms (dimensions) have semantic relationships, rather than independent

11 References S.E. Robertson and K.S. Jones. Relevance weighting of search terms. Journal of the American society for Information Sciences, 27(3): , Joshua Zhexue Huang1 & Michael Ng.2 & Liping Jing1,”Text Clustering: Algorithms, Semantics and Systems”,1 The University of Hong Kong, 2 Hong Kong Baptist University, PAKDD06 Tutorial, April 9, 2006, Singapore. Hewijin Christine Jiau & Yi-Jen Su & Yeou-Min Lin & Shang-Rong Tsai, “MPM: a hierarchical clustering algorithm using matrix partitioning method for non-numeric data”, J Intell Inf Syst (2006) 26: 185–207, DOI /s Michael W. Berry and Malu Castellanos; urvey of Text Mining: Clustering, Classification, and Retrieval, Second Edition;Springer; 2007; A. Hotho, S. Staab, and G.Stumme, Text Clustering based on background knowledge, TR425, AIFB, German, 2003


Download ppt "Document Clustering Content: 1.Document Clustering Essentials. 2.Text Clustering Architecture 3.Preprocessing 4.Different Document Models 1.Probabilistic."

Similar presentations


Ads by Google