Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun

Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun
Bidirectional Active Learning: A Two-Way Exploration Into Unlabeled and Labeled Data Set By Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun Presented By Ruhani Faiheem Rahman

But what about mislabeled data?
Abstract Labelling the data is one of the major problem in Machine Learning We have huge amount of unlabeled data and few has the label Active Learning helps in that case. But what about mislabeled data? This noise will propagate to the model This paper explores both labeled and unlabeled data sets simultaneously

Introduction Classic Machine Learning
Supervised Learning Unsupervised Learning If unlabeled data explored along with the labeled data, considerable amount of improvement possible Actively select the most informative instances to improve the model

Unidirectional Active Learning
Traditional active learning Chose instance from the sample, to learn the model effectively Uncertainty sampling - choose the least certain instance Query by Committee - a voting is done among the classes. Most disagreed samples are selected Decision Theoretic Approach - instance which reduce the model’s generalization error if its label was known Noise in the labeled data can jeopardize the learning performance

Unidirectional Active Learning

Bidirectional Active Learning
Forward Learning Backward Learning Backward Instance Detecting Instance-level detecting Label-level detecting Backward Instance Processing Undo Redo Relabel Reselect Backward Learning Algorithm

Forward Learning Similar to Unidirectional Active Learning (UDAL)
Selects a forward instance from Unlabeled data set based on the selection mechanism described before Add the instance to label data set and removes it from unlabeled data set Train a new model

Backward Learning Detect a Backward Instance
Explores the labeled data set instead of Unlabeled data set Detect an instance from the labeled data set based on Instance level detecting FInd the instance without which the entropy over unlabeled data set would be minimized Label level detecting Find the most suspiciously mislabeled instance If the label is changed and the entire error is minimized

Backward Learning Process the backward instance Undo
Eliminate the negative influence by removing it from the training set Suitable for instance level method Redo Relabel Backward instance is returned to be labeled for the second time If new label is the same as previous then it will be copied twice Otherwise replace it with the new label Reselect Find the nearest neighbour of the backward instance in the unlabeled data set Probability of the neighbour’s label and the backward instance is higher Add this instance to the train model

BDAL Algorithm

Experiments Synthesis Data Classification
Handwritten Digit Classification Image Classification Patent Document Classification

Synthesis Data Classification
Two class synthesis data 410 instances 205 instances for each class 5 instances from each class are selected randomly for initial training

Synthesis Data Classification

B. Handwritten Digit Classification
MNIST data set 10,000 instances of test data Each image has 28 * 28 pixels 100 images randomly picked for initial training For model update, 100 images are labeled with 5% noise Result is averaged over 20 runs

B. Handwritten Digit Classification

C. Image Classification
50 categories of images. Like car, ship, human etc Each category contains 100 images 10 images from each category used for initial training the model So, total 500 images as initial training data For each model update 500 images with 5% noise. Result is averaged over 50 runs.

C. Image Classification

D. Patent Document Classification
5000 patents data from Innography database. All those data are manually classified by domain experts into 5 classes. 5484 terms are extracted as raw text file, then use PCA for dimention reduction 150-D feature vector 50 instances are picked randomly for initial training data For each model update, 100 instances are labeled with 5% noise Result is averaged over 20 runs

D. Patent Document Classification

Conclusion BDAL performs better in all the experiments
BDAL > UDAL > NonAL Redo Strategy receives slightly improved performance than UDAL Undo Strategy outperforms most of the experiments

Future Plan Different strategies can be adopted during the backward learning process based on the noise on the data Fast approximation algorithm will be studied for computational efficiency

Thank You

Any Question?

Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun

Similar presentations

Presentation on theme: "Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun

Similar presentations

Presentation on theme: "Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun"— Presentation transcript:

Similar presentations

About project

Feedback