Presentation is loading. Please wait.

Presentation is loading. Please wait.

Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong.

Similar presentations


Presentation on theme: "Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong."— Presentation transcript:

1 Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong – Rutgers, the State University of New Jersey

2 Our Story  “Let’s set up a team to compete another data mining challenge” – a call with Rutgers  Is it a competition on data preprocessing?  Transfer the problem into a clustering problem :  How many clusters we are shooting for?  What distance measurement works better?  Go with the stochastic K-means clustering.

3 Dataset Recap  Five real world data sets were extracted from different domains  No labels were provided during unsupervised learning challenge  The withheld labels are multi-class labels.  Some records can belong to different labels at the same time  Performance was measured by a global score, which is defined as Area Under Learning Curve  A simple linear classifier (Hebbian learner) was used to calculate the learning curve  Focus on small number of training samples by log2 scaling on x- axis of the learning curve

4 Evolution of Our Approaches  Simple Data Preprocessing  Normalization: Z-scale (std=1, mean=0)  TF-IDF on text recognition (TERRY dataset)  PCA:  PCA on raw data  PCA on normalized data  Normalized PCA vs. non-normalized PCA  K-means Clustering  Cluster on top N normalized PCs  Cosine similarity vs. Euclidian distance

5 Stochastic Clustering Process  Given Data set X, number of cluster K, and iteration N  For n=1, 2, …, N  Randomly choose K seeds from X  Perform K-means clustering, assign each record a cluster membership I n  Transform In into binary representation  Combine the N binary representation together as the final result  Example of binary representation of clusters  Say cluster label = 1,2,3  Binary representation will be (1 0 0) (0 1 0) and (0 0 1) Our final approach

6 Results of Our Approaches Dataset Harry – human action recognition

7 Results Dataset Rita – object recognition

8 Results Dataset Sylvester-- ecology

9 Results Dataset Terry – text recognition

10 Results Dataset Avicenna – Arabic manuscripts

11 Summary on Results Overall rank 2 nd. Pie Chart Title DatasetWinner Valid Winner Final Winner Rank Our Valid Our Final Our Rank Avecinna0.17440.218310.13860.19066 Harry0.86400.704360.90850.73573 Rita0.30950.495110.37370.47825 Sylvester0.64090.456960.71460.58281 Terry8.1950.846510.81760.84372

12 Discussions  Stochastic clustering can generate better results than PCA in general  Cosine similarity distance is better than Euclidian distance  Normalized data is better than non-normalized data for k-means in general  Number of clusters (K) is an important factor, but can be relaxed for this particular competition.

13


Download ppt "Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong."

Similar presentations


Ads by Google