Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong.

Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong – Rutgers, the State University of New Jersey

Our Story  “Let’s set up a team to compete another data mining challenge” – a call with Rutgers  Is it a competition on data preprocessing?  Transfer the problem into a clustering problem :  How many clusters we are shooting for?  What distance measurement works better?  Go with the stochastic K-means clustering.

Dataset Recap  Five real world data sets were extracted from different domains  No labels were provided during unsupervised learning challenge  The withheld labels are multi-class labels.  Some records can belong to different labels at the same time  Performance was measured by a global score, which is defined as Area Under Learning Curve  A simple linear classifier (Hebbian learner) was used to calculate the learning curve  Focus on small number of training samples by log2 scaling on x- axis of the learning curve

Evolution of Our Approaches  Simple Data Preprocessing  Normalization: Z-scale (std=1, mean=0)  TF-IDF on text recognition (TERRY dataset)  PCA:  PCA on raw data  PCA on normalized data  Normalized PCA vs. non-normalized PCA  K-means Clustering  Cluster on top N normalized PCs  Cosine similarity vs. Euclidian distance

Stochastic Clustering Process  Given Data set X, number of cluster K, and iteration N  For n=1, 2, …, N  Randomly choose K seeds from X  Perform K-means clustering, assign each record a cluster membership I n  Transform In into binary representation  Combine the N binary representation together as the final result  Example of binary representation of clusters  Say cluster label = 1,2,3  Binary representation will be (1 0 0) (0 1 0) and (0 0 1) Our final approach

Results of Our Approaches Dataset Harry – human action recognition

Results Dataset Rita – object recognition

Results Dataset Sylvester-- ecology

Results Dataset Terry – text recognition

Results Dataset Avicenna – Arabic manuscripts

Summary on Results Overall rank 2 nd. Pie Chart Title DatasetWinner Valid Winner Final Winner Rank Our Valid Our Final Our Rank Avecinna0.17440.218310.13860.19066 Harry0.86400.704360.90850.73573 Rita0.30950.495110.37370.47825 Sylvester0.64090.456960.71460.58281 Terry8.1950.846510.81760.84372

Discussions  Stochastic clustering can generate better results than PCA in general  Cosine similarity distance is better than Euclidian distance  Normalized data is better than non-normalized data for k-means in general  Number of clusters (K) is an important factor, but can be relaxed for this particular competition.

Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong.

Similar presentations

Presentation on theme: "Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong.

Similar presentations

Presentation on theme: "Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong."— Presentation transcript:

Similar presentations

About project

Feedback