ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky ACMSE’07

INTRODUCTION blogs highly opinionated personal online commentary including hyperlinks to other resources Technorati (July, 2006) tracking more than 50 million blogs about 175,000 blogs were created daily size of the blogosphere doubles every six months how many blog authors are updating their blogs regularly -> not clear

INTRODUCTION(CON.) analysis of the blogosphere in 2004 more than two-thirds of public blogs are personal journals knowledge blogs (k-blogs) -> mere 3 percent due to the diverse background of the blog authors and readers the blogosphere has hyper-accelerated the spread of information

BLOGS V.S. WEBPAGES  the major difference between blogs and the standard web pages  blogs are dated  most of blogs allow readers to place comments on each blog document  creates communication channels between the blog authors and the readers  blog authors can place individual blogs into different categories  according to some predefined categories  the definitions of the categories may be different for different authors

BLOG DOCUMENTS use vector-space model to encode the blog web pages each blog page can be viewed as a column vector each word used can be considered as one row of the matrix consider a blog page as three parts blog title blog body the content of the blog page comments of the authors and/or the readers

A SAMPLE BLOG PAGE

H YPOTHESIS hypothesis the use of title and comment words in the dataset will enhance the discrimination of the blog pages result in more accurate clustering solutions reason the words in the comments reflect the specific views and questions and answers of the authors and the readers may hold more weights in discriminating individual blog pages

DATA PREPARATION AND CLUSTERING Data Preprocessing selected three categories of blog files gun control church Alzheimer’s disease downloaded from Windows Live Spaces by searching with the key words each entry has at least one comment each category has 70 files for a total of 210 blog files parsing  convert into 3 parts  stemming  delete stop words  count the number of occurrences of each word

DATA PREPROCESSING(CON.) represent each document by three vectors vector for the whole document is a weighted sum of all three vectors: w t : title weight w b : body weight w c : comment weight

DATA PREPROCESSING(CON.) the word-page matrix A is composed of a set of such document vectors A = (v 1 … v m ) v ij is the weighted occurrences of the word i in the document v j to balance the influence of small size and large size documents scale each document vector v j to have its Euclidean norm equal to 1

tf-idf TI is the mean value of tfidf over all the documents for each term use TI to measure the quality of the term the higher the TI value is, the better the term is to be ranked F EATURE S ELECTION

C LUSTERING k-means algorithm 1. It computes the Euclidean distance from each of the documents to each cluster center. A document is assigned to the cluster with the smallest distance 2. each cluster center is recomputed to be the mean of its constituent documents 3. repeat steps 1. and 2. until the convergence is reached

criterion function for the convergence r : the step of the iterations Edist(vi, cj) : computes the Euclidean distance from the document vi to a cluster center cj given a convergence criterion ε the k-means algorithm stops when |fr+1 - fr| < ε CLUSTERING(CON.)

CLUSTERING METRICS Entropy gauges the distribution of each class of documents within each cluster suppose there are q classes and the clustering algorithm returns k clusters the entropy E of a cluster S r of size n r is computed as is the number of documents in the i th class that are assigned to the r th cluster entropy of the entire clustering solution is computed as:

CLUSTERING METRICS(CON.) Purity the purity of the cluster S r can be defined as purity value of the entire clustering solution is computed as

EXPERIMENTAL RESULTS influence of weight not very good if only use one of the title, body, or comment the accuracy of clustering the blog body is better than title or comments using all of the three parts improves a lot

EXPERIMENTAL RESULTS Feature Selection use only the title and the body for clustering reducing the percentage of the features used will not change the clustering accuracy apply feature selection to all the blog content including the comments with certain percentage of features selected, entropy value can be reduced  making good use of the terms in comments can help increase clustering accuracy

S UMMARY utilizing a particular feature of the blogs, the comments, to enhance the effectiveness of a clustering algorithm in classifying blog pages Future work consider the timing effect of the blogs better clustering blog documents finding blog communities the utilization of predefined category information may also improve the classification of blog files experimenting other data mining algorithms with blog datasets

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.

Similar presentations

Presentation on theme: "ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky.

Similar presentations

Presentation on theme: "ENHANCING CLUSTERING BLOG DOCUMENTS BY UTILIZING AUTHOR/READER COMMENTS Beibei Li, Shuting Xu, Jun Zhang Department of Computer Science University of Kentucky."— Presentation transcript:

Similar presentations

About project

Feedback