RazorFish Data Exploration-KModes Data Exploration utilizing the K-Modes Clustering algorithm Performed By: Hilbert G Locklear.

2 K-Modes k-modes algorithm (Huang 1999) is an extension of the k-means algorithm by MacQueen(1967) k-modes aims to partition the objects into k groups such that the distance from objects to the assigned cluster modes is minimized. By default simple-matching distance is used to determine the dissimilarity of two objects. ◦ The simple-matching distance is computed by counting the number of mismatches in all variables. ◦ Alternatively the distance can be weighted by the frequencies of the categories in the data. ◦ An initial matrix of modes can be supplied.

3 K-Modes Function Part of the klaR package Perform k-modes clustering on categorical data k-modes function usage ◦ kmodes(data, modes, iter.max = 10, weight = FALSE)  data is a matrix or data frame of categorical data. Objects have to be in rows and variables in columns.  mode is a number of modes or a set of distinct cluster modes. If a number is chosen the initial modes are a random set of distinct rows.  iter.max is the maximum number of iterations allowed.  weighted is TRUE or FALSE based on whether a usual simple-matching distance between objects is used or a weighted version of this distance is used. k-modes can return the following values: ◦ cluster...a vector of integers indicating the cluster to which each object is allocated. ◦ size...the number of objects in each cluster. ◦ modes...a matrix of cluster modes. ◦ withindiff...the within-cluster distance for each cluster ◦ iterations...the number of iterations the algorithm has run. ◦ weighted...whether weighted distance were used

4 Data Cleaning Training and Testing data sets contain 12,500 records each. ◦ Clustering performed only on training set. Training and Testing data sets are organized into three fields. ◦ Reviewer ID Number...4 or 5 numeric character string. ◦ Sentiment Value...0 or 1 ◦ Review text Over 2.91 million words of free text in training set. Data contains some HTML markup and whitespace padding. ◦ Used simple Java regular expression library to remove markup. No data extrapolation measures needed.

5 Data Kmodes was performed on the training set. ◦ BOWTrainVectorized.txt  12,500 objects each  Feature vector consist of 2 categorical variables and 7 numeric variables  Reviewer ID...Identifies the reviewer...may not be unique  Sentiment Value...Binary value (1) = positive and (0) = negative.  Total Word Count...Number of all word in the review text.  Stopword Count...Number of words in the review text that are stopwords.  Useful Word Count...Total Word Count – Stopword Count.  Good Adjective Count...Number of words in the review text that are positive adjectives.  Bad Adjective Count...Number of words in the review text that are negative adjectives.  Good Phrase Count...Number of words in the review text that are sequential, multiple word strings which represent positive sentiment.  Bad Phrase Count...Number of words in the review text that are sequential multiple word strings which represent negative sentiment. Example Vector

6 Data Summary FeatureMinimumMedianMeanMaximumSum S_value 010.516,312 Twrd_count 111742332,460~2.91mil Swrd_count 4871151,097~1.44mil Uwrd_count 7871171,363~1.47mil Good_Adj 00< 13011,043 Bad_Adj 00< 1159,499 Good_Phr 00< 12303 Bad_Phr 00< 11201

7 Procedure-R script 1.#install required packages 1.install.packages("plyr") 2.install.packages(“klaR") 3.library(plyr) 4.library(klaR) 2.#read the data into a data frame 1.Train_Data<-read.delim(“~BOWTrainVectorized.txt”, header = TRUE, sep =“\t”) 3.#perform kmodes clustering 1.cluster_Train<-kmodes(Train_Data[2:9], 3, iter.max = 3, weighted = FALSE) 4.#create a frequency table to identify each cluster 1.freqTable_Train<-table(cluster_Train$cluster) 5.#create a pie chart of the cluster distribution 1.pie(freqTable_Train, main="Cluster Distribution for Training Set") 6.#append the cluster information to the data frame 1.Train_Data_Mod<-cbind(Train_Data, cluster_Train$cluster) 7.#create a subset of the data frame for each cluster 1.train_cluster1 <-subset(Train_Data_Mod, cluster_Train$cluster==1) 2.train_cluster2 <-subset(Train_Data_Mod, cluster_Train$cluster==2) 3.train_cluster3 <-subset(Train_Data_Mod, cluster_Train$cluster==3) 8.#create cluster sum information for each cluster 1.colSums(train_cluster1[,2:9]) 2.colSums(train_cluster2[,2:9]) 3.colSums(train_cluster3[,2:9]) 9.#create summary statistics for the training set 1.colSums(Train_Data[2:9]) 2.summary(Train_Data[2:9])

8 Results Characteristics ClusterSizeWithin Cluster Distance 16,48824,803 23,63914,087 32,3738,062 1 1 2 2 3 3 Distance metric Aggregates ClusterGood_AdjBad_AdjGood_PhrBad_Phr 16,626171319754 21,9767,23580130 32,4415518617 Aggregates ClusterS_valueTwrd_countSwrd_countUwrd_count 14,464~1.4m~720k~733k 214~955k~475k~479k 31,834~ 508k~ 251k~ 257k

9 Results 1 1 2 2 3 3 Cluster 1 Sentiment: Positive Mean Twrd_count: 224 Mean Swrd_count: 110 Mean Uwrd_count: 113 Cluster 3 Sentiment: Positive Mean Twrd_count: 262 Mean Swrd_count: 130 Mean Uwrd_count: 131 Cluster 2 Sentiment: Negative Mean Twrd_count: 214 Mean Swrd_count: 105 Mean Uwrd_count: 108

10 Analysis Distinct clusters. Cluster have good cohesion. Sentiment homogeneity in cluster 2 is very high. Sentiment homogeneity in cluster 3 is very high. Cluster 2 contains extraordinary high-level of negative sentiment. Good-Bad Adjective and Phrase result is poor among all records.

