Presentation is loading. Please wait.

Presentation is loading. Please wait.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.

Similar presentations


Presentation on theme: "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing."— Presentation transcript:

1 Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing Text Clustering Shady Shehata, Fakhri Karray, and Mohamed S. Kamel TKDE, 2010 Presented by Wen-Chung Liao 2010/11/03

2 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outlines  Motivation  Objectives  THEMATIC ROLES BACKGROUND  CONCEPT-BASED MINING MODEL  Experiments  Conclusions  Comments

3 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation  Vector Space Model (VSM) ─ represents each document as a feature vector of the terms (words or phrases) in the document. ─ Each feature vector contains term weights (usually term frequencies) of the terms in the document. ─ term frequency captures the importance of the term within a document only.  However, two terms can have the same frequency in their documents, but one term contributes more to the meaning of its sentences than the other term.  Thus, the underlying text mining model should indicate terms that capture the semantics of text.

4 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objectives  A new concept-based mining model is introduced. ─ captures the semantic structure of each term within a sentence and document rather than the frequency of the term within a document only ─ effectively discriminate between nonimportant terms and terms which hold the concepts that represent the sentence meaning. ─ three measures for analyzing concepts on the sentence, document, and corpus levels are computed ─ a new concept-based similarity measure is proposed. based on a combination of sentence-based, document-based, and corpus-based concept analysis. ─ more significant effect on the clustering quality due to the similarity ’ s insensitivity to noisy terms.

5 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 THEMATIC ROLES BACKGROUND  Verb argument structure: (e.g., John hits the ball). ─ “ hits ” is the verb. ─ “ John ” and “ the ball ” are the arguments of the verb “ hits, ”  Label: A label is assigned to an argument, ─ e.g.: “ John ” has subject (or Agent) label. “ the ball ” has object (or theme) label,  Term: is either an argument or a verb. ─ either a word or a phrase  Concept: a labeled term.  Generally, the semantic structure of a sentence can be characterized by a form of verb argument structure

6 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 CONCEPT-BASED MINING MODEL

7 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 CONCEPT-BASED MINING MODEL  Sentence-Based Concept Analysis ─ Calculating ctf of Concept c in Sentence s the conceptual term frequency, ctf  the number of occurrences of concept c in verb argument structures of sentence s.  has the principal role of contributing to the meaning of s  a local measure on the sentence level ─ Calculating ctf of Concept c in Document d the overall importance of concept c to the meaning of its sentences in document d.

8 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 CONCEPT-BASED MINING MODEL  Document-Based Concept Analysis ─ the concept-based term frequency tf the number of occurrences of a concept (word or phrase) c in the original document. a local measure on the document level  Corpus-Based Concept Analysis ─ the concept-based document frequency df the number of documents containing concept c used to reward the concepts that only appear in a small number of documents

9 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9  Three verbs, colored by red, that represent the semantic structure of the meaning of the sentence.  Each has its own arguments: ─ [ARG0 Texas and Australia researchers] have [TARGET created] [ARG1 industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles]. ─ Texas and Australia researchers have created industry-ready sheets of [ARG1 materials] [TARGET made] [ARG2 from nanotubes that could lead to the development of artificial muscles]. ─ Texas and Australia researchers have created industry-ready sheets of materials made from [ARG1 nanotubes] [R-ARG1 that] [ARGM- MOD could] [TARGET lead] [ARG2 to the development of artificial muscles]. Example of Calculating ctf Measure Texas and Australia researchers have created industry-ready sheets of materials made from nanotubes that could lead to the development of artificial muscles.

10 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 A clean step  To remove stop words  To stem the words

11 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 A Concept-Based Similarity Measure The single-term similarity measure is:  The concept-based similarity between two documents, d 1 and d 2 is calculated by: d1d1 d2d2 m matching concepts (using the TF-IDF weighting scheme)

12 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Mathematical Framework  Assume that the content of document d 2 is changed by △  Sensitivity analysis: Assume that each concept consists of one word. In this case, each concept is a word and A =1. (?) By approximation, the d 1c value is bigger than d 1w and the △ d 2c value is bigger than the △ d 2w value. Hence, the sensitivity of the concept-based similarity is higher than the cosine similarity. This means that the concept-based model is deeper in analyzing the similarity between two documents than the traditional approaches.

13 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Concept-Based Analysis Algorithm d1d1 d2d2 d3d3 d4d4 d1d1 d2d2 d3d3 d4d4 L LL LLL

14 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 EXPERIMENTAL RESULTS  Four data sets ─ 23,115 ACM abstract articles collected from the ACM digital library five main categories ─ 12,902 documents from the Reuters 21,578 data set five category sets ─ 361 samples from the Brown corpus main categories were press: reportage; press: reviews, religion, skills and hobbies, popular lore, belles-letters, and learned; fiction: science; fiction: romance and humor. ─ 20,000 messages collected from 20 Usenet newsgroups  Three standard document clustering techniques: ─ Hierarchical Agglomerative Clustering (HAC), ─ Single-Pass Clustering ─ k-Nearest Neighbor (k-NN) Evaluation methods

15 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Four different concept-based weighting schemes:

16 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Conclusions  Bridges the gap between natural language processing and text mining disciplines. (?)  By exploiting the semantic structure of the sentences in documents, a better text clustering result is achieved.  A number of possibilities for extending this paper. ─ link this work to Web document clustering. ─ apply the same model to text classification.

17 Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Comments  Advantages ─ Better similarity considering the semantic structure of sentences in documents.  Shortages ─ Ambiguous algorithm  Applications ─ Text clustering ─ Text classification


Download ppt "Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing."

Similar presentations


Ads by Google