Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Slides:



Advertisements
Similar presentations
Active Learning based on Bayesian Networks Luis M. de Campos, Silvia Acid and Moisés Fernández.
Advertisements

Data Mining Tools Overview Business Intelligence for Managers.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Random Forest Predrag Radenković 3237/10
Imbalanced data David Kauchak CS 451 – Fall 2013.
Machine Learning Theory Machine Learning Theory Maria Florina Balcan 04/29/10 Plan for today: - problem of “combining expert advice” - course retrospective.
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
Lesson learnt from the UCSD datamining contest Richard Sia 2008/10/10.
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
© 2014 CY Lin, Columbia University E6893 Big Data Analytics – Lecture 4: Big Data Analytics Algorithms 1 E6893 Big Data Analytics: Financial Market Volatility.
Distributed Representations of Sentences and Documents
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
April 11, 2008 Data Mining Competition 2008 The 4 th Annual Business Intelligence Symposium Hualin Wang Manager of Advanced.
Automatic Gender Identification using Cell Phone Calling Behavior Presented by David.
Exercise Session 10 – Image Categorization
Learning from Imbalanced, Only Positive and Unlabeled Data Yetian Chen
A Genetic Algorithm-Based Approach for Building Accurate Decision Trees by Z. Fu, Fannie Mae Bruce Golden, University of Maryland S. Lele, University of.
Classification (Supervised Clustering) Naomi Altman Nov '06.
by B. Zadrozny and C. Elkan
Feature selection LING 572 Fei Xia Week 4: 1/29/08 1.
Apache Mahout. Mahout Introduction Machine Learning Clustering K-means Canopy Clustering Fuzzy K-Means Conclusion.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Data Collection and Processing (DCP) 1. Key Aspects (1) DCPRecording Raw Data Processing Raw Data Presenting Processed Data CompleteRecords appropriate.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1 Data Mining: Concepts and Techniques (3 rd ed.) — Chapter 12 — Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign.
AISTATS 2010 Active Learning Challenge: A Fast Active Learning Algorithm Based on Parzen Window Classification L.Lan, H.Shi, Z.Wang, S.Vucetic Temple.
Nurissaidah Ulinnuha. Introduction Student academic performance ( ) Logistic RegressionNaïve Bayessian Artificial Neural Network Student Academic.
Statistical Analysis Topic – Math skills requirements.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Stochastic Unsupervised Learning on Unlabeled Data July 2, 2011 Presented by Jianjun Xie – CoreLogic Collaborated with Chuanren Liu, Yong Ge and Hui Xiong.
Agnostic Active Learning Maria-Florina Balcan*, Alina Beygelzimer**, John Langford*** * : Carnegie Mellon University, ** : IBM T.J. Watson Research Center,
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
COMP24111: Machine Learning Ensemble Models Gavin Brown
Date: 2011/1/11 Advisor: Dr. Koh. Jia-Ling Speaker: Lin, Yi-Jhen Mr. KNN: Soft Relevance for Multi-label Classification (CIKM’10) 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Bayesian Active Learning with Evidence-Based Instance Selection LMCE at ECML PKDD th September 2015, Porto Niall Twomey, Tom Diethe, Peter Flach.
Opinion spam and Analysis 소프트웨어공학 연구실 G 최효린 1 / 35.
Team: flyingsky Reporter: YanJie Fu & ChuanRen Liu Institution: Chinese Academy of Sciences.
Hierarchical Sampling for Active Learning Sanjoy Dasgupta and Daniel Hsu University of California, San Diego Session : Active Learning and Experimental.
Semi-Supervised Clustering
Modeling Annotator Accuracies for Supervised Learning
Sofus A. Macskassy Fetch Technologies
Advanced data mining with TagHelper and Weka
Introductory Seminar on Research: Fall 2017
COMP61011 : Machine Learning Ensemble Models
CIKM Competition 2014 Second Place Solution
Sample Column Chart- No Data Labels, no lines SOURCE:
CIKM Competition 2014 Second Place Solution
Active learning The learning algorithm must have some control over the data from which it learns It must be able to query an oracle, requesting for labels.
iSRD Spam Review Detection with Imbalanced Data Distributions
Multiple Decision Trees ISQS7342
Classification and Prediction
CSCI N317 Computation for Scientific Applications Unit Weka
Tuning CNN: Tips & Tricks
Pool-based learning via Weighted Information Gain Measurements
MAS 622J Course Project Classification of Affective States - GP Semi-Supervised Learning, SVM and kNN Hyungil Ahn
Presentation transcript:

Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong

What Problem We Are Facing  Six data sets extracted from six different domains  Domains were removed in the contest  They are all binary classification problems  They are all imbalanced data sets  Percentage of positive labels varies from 7.2% to 25.2%  This information was removed in the competition  They were significantly different from the development sets  They all have one known label to start with

Datasets Summary Final contest datasets DatasetDomainFeature number Train number Positive Label % AHandwriting Recognition 9217, BMarketing25025, CChemo- informatics 85125, DText Processing 12,00010, EEmbryology15432, FEcology1267,

Stochastic Semi-supervised Learning  Condition:  Label distribution is highly imbalanced, positive labels are rare  Known labels are few  Unlabeled data are abundant  Approach to A, C, and D:  Randomly pick one record from unlabeled data pool as “negative”  Use the given positive seed and picked “negative” seed as initial cluster center for k-means clustering  Label the cluster as positive where the positive seed resides  Repeat above process n times  Take the normalized cluster membership count of each data point as the first set of prediction score Our approach when number of labels <200

Stochastic Semi-supervised Learning -- continued  Approach to A, C, and D:  When more labels are known after query, use both known labels and randomly picked “negative” seeds as initial cluster center  Label cluster using known positive seeds  Discard cluster whose membership is not clear  Store the cluster membership of each data points  Use normalized positive cluster membership counts as prediction score Our approach when number of labels <200

Stochastic Semi-supervised Learning -- continued  Approach to B, E, and F:  Randomly pick 20 unlabeled data as “negative” labels for each known positive label.  Build over-fit logistic regression model using above dataset  Repeat above random picking and model building process n times  Final score is the average of n models. Our approach when number of labels <200

Supervised Learning Using Gradient Boosting Decision Tree (TreeNet)

Querying Strategy  One critical part of active learning is the query strategy  Popular approaches:  Uncertainty sampling  Expected model change  Query by committee  What we tried:  Uncertainty sampling + density based selective sampling  Random sampling (for large label purchase)  Certainty sampling (try to get more positive labels)

Dataset A: Handwriting Recognition Global score = 0.623, rank 2 nd. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy Uncertainty /Selective Uncertainty /Selective Random Get All

Dataset B: Marketing Global score = 0.375, rank 2 nd. Pie Chart Title

Dataset C: Chemo-informatics Global score = 0.334, rank 4 th. Passive learning. Pie Chart Title

Dataset D: Text Processing Global score = 0.331, rank 18 th. Pie Chart Title

Dataset E: Embryology Global score = 0.533, rank 3 rd. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy Certainty Uncertainty /Selective Uncertainty /Selective Get All

Dataset E: Embryology  Performance gets worse with more labels  Newly queried labels did too much correction to the existing model  This phenomenon was common in this contest Global score = 0.533, rank 3 rd. Pie Chart Title

Dataset F: Ecology Global score = 0.77, rank 4 th. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy Uncertainty /Selective Uncertainty /Selective Uncertainty /Selective Random Get all

Dataset F: Ecology  Performance gets worse with 2 more labels at beginning  Most of the time, too many small queries do more harm than good to global score Pie Chart Title

Summary on Results Overall rank 3 rd. Pie Chart Title DatasetPositive label % AUCALCNum. Queries RankWinner AUC Winner ALC A B C D E F

Discussions  How to consistently get better performance with only a few labels across different datasets  How to consistently improve model performance with the increase of labels in a given dataset  Does the log2 scaling give too much weight on first few queries? What about every dataset starts with a little bit more labels?