COMBINED UNSUPERVISED AND SEMI-SUPERVISED LEARNING FOR DATA CLASSIFICATION Fabricio Aparecido Breve, Daniel Carlos Guimarães Pedronette State University.

Slides:



Advertisements
Similar presentations
Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.
Advertisements

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.
Data Mining Classification: Alternative Techniques
Classification Dr Eamonn Keogh Computer Science & Engineering Department University of California - Riverside Riverside,CA Who.
Heterogeneous Consensus Learning via Decision Propagation and Negotiation Jing Gao† Wei Fan‡ Yizhou Sun†Jiawei Han† †University of Illinois at Urbana-Champaign.
Margin Based Sample Weighting for Stable Feature Selection Yue Han, Lei Yu State University of New York at Binghamton.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Interactive Image Segmentation of Non-Contiguous Classes using Particle Competition and Cooperation Fabricio Breve São Paulo State University (UNESP)
 C. C. Hung, H. Ijaz, E. Jung, and B.-C. Kuo # School of Computing and Software Engineering Southern Polytechnic State University, Marietta, Georgia USA.
Abstract - Many interactive image processing approaches are based on semi-supervised learning, which employ both labeled and unlabeled data in its training.
Semi-Supervised Learning with Concept Drift using Particle Dynamics applied to Network Intrusion Detection Data Fabricio Breve Institute of Geosciences.
Active Learning for Class Imbalance Problem
Particle Competition and Cooperation to Prevent Error Propagation from Mislabeled Data in Semi- Supervised Learning Fabricio Breve 1,2
Combined Active and Semi-Supervised Learning using Particle Walking Temporal Dynamics Fabricio Department of Statistics, Applied.
Particle Competition and Cooperation for Uncovering Network Overlap Community Structure Fabricio A. Breve 1, Liang Zhao 1, Marcos G. Quiles 2, Witold Pedrycz.
Learning the threshold in Hierarchical Agglomerative Clustering
Uncovering Overlap Community Structure in Complex Networks using Particle Competition Fabricio A. Liang
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Transductive Regression Piloted by Inter-Manifold Relations.
Particle Competition and Cooperation in Networks for Semi-Supervised Learning with Concept Drift Fabricio Breve 1,2 Liang Zhao 2
Graph-based Text Classification: Learn from Your Neighbors Ralitsa Angelova , Gerhard Weikum : Max Planck Institute for Informatics Stuhlsatzenhausweg.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Project by: Cirill Aizenberg, Dima Altshuler Supervisor: Erez Berkovich.
Optimal Dimensionality of Metric Space for kNN Classification Wei Zhang, Xiangyang Xue, Zichen Sun Yuefei Guo, and Hong Lu Dept. of Computer Science &
Unsupervised Feature Selection for Multi-Cluster Data Deng Cai, Chiyuan Zhang, Xiaofei He Zhejiang University.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Query Rules Study on Active Semi-Supervised Learning using Particle Competition and Cooperation Fabricio Department of Statistics,
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
New Algorithms for Efficient High-Dimensional Nonparametric Classification Ting Liu, Andrew W. Moore, and Alexander Gray.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.
Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.
Scalable Person Re-identification on Supervised Smoothed Manifold
Fuzzy Logic in Pattern Recognition
Semi-Supervised Clustering
Genetic-Algorithm-Based Instance and Feature Selection
Neural-Network Combination for Noisy Data Classification
Sofus A. Macskassy Fetch Technologies
Semi-supervised Machine Learning Gergana Lazarova
Constrained Clustering -Semi Supervised Clustering-
CLASSIFICATION OF TUMOR HISTOPATHOLOGY VIA SPARSE FEATURE LEARNING Nandita M. Nayak1, Hang Chang1, Alexander Borowsky2, Paul Spellman3 and Bahram Parvin1.
Efficient Image Classification on Vertically Decomposed Data
Evaluating Techniques for Image Classification
Preface to the special issue on context-aware recommender systems
کاربرد نگاشت با حفظ تنکی در شناسایی چهره
Do-Gil Lee1*, Ilhwan Kim1 and Seok Kee Lee2
Efficient Image Classification on Vertically Decomposed Data
Boosting Nearest-Neighbor Classifier for Character Recognition
Collaborative Filtering Nearest Neighbor Approach
Nearest-Neighbor Classifiers
Lecture 22 Clustering (3).
A Fast and Scalable Nearest Neighbor Based Classification
Jianping Fan Dept of Computer Science UNC-Charlotte
An Improved Neural Network Algorithm for Classifying the Transmission Line Faults Slavko Vasilic Dr Mladen Kezunovic Texas A&M University.
Emre O. Neftci  iScience  Volume 5, Pages (July 2018) DOI: /j.isci
Fabricio Breve Department of Electrical & Computer Engineering
IEEE World Conference on Computational Intelligence – WCCI 2010
Department of Electrical Engineering
Problem definition Given a data set X = {x1, x2, …, xn} Rm and a label set L = {1, 2, … , c}, some samples xi are labeled as yi  L and some are unlabeled.
Example: Academic Search
Generalized Locality Preserving Projections
Zhedong Zheng, Liang Zheng and Yi Yang
Efficient Processing of Top-k Spatial Preference Queries
Learning Dual Retrieval Module for Semi-supervised Relation Extraction
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

COMBINED UNSUPERVISED AND SEMI-SUPERVISED LEARNING FOR DATA CLASSIFICATION Fabricio Aparecido Breve, Daniel Carlos Guimarães Pedronette State University of São Paulo (UNESP), Rio Claro, SP, Brazil fabricio@rc.unesp.br ; daniel@rc.unesp.br Abstract - Semi-supervised learning methods exploit both labeled and unlabeled data items in their training process, requiring only a small subset of labeled items. Although capable of drastically reducing the costs of labeling process, such methods are directly dependent on the effectiveness of distance measures used for building the kNN graph. On the other hand, unsupervised distance learning approaches aims at capturing and exploiting the dataset structure in order to compute a more effective distance measure, without the need of any labeled data. In this paper, we propose a combined approach which employs both unsupervised and semi-supervised learning paradigms. An unsupervised distance learning procedure is performed as a pre-processing step for improving the kNN graph effectiveness. Based on the more effective graph, a semi-supervised learning method is used for classification. The proposed Combined Unsupervised and Semi-Supervised Learning (CUSSL) approach is based on very recent methods. The Reciprocal kNN Distance is used for unsupervised distance learning tasks and the semi-supervised learning classification is performed by Particle Competition and Cooperation (PCC). Experimental results conducted in six public datasets demonstrated that the combined approach can achieve effective results, boosting the accuracy of classification tasks. Each particles randomly chooses a neighbor to visit at each iteration. Nodes which are already dominated by the particle team and closer to the particle initial node have higher probabilities of being chosen. Classification Accuracy on Digit1, COIL, USPS, and g241c data sets with 10 labeled samples Reciprocal kNN Distance Reciprocal kNN Distance, was proposed to provide a more effective distance measure in image retrieval scenarios. It takes into account the intrinsic dataset structure by analyzing the reciprocal references at top rank positions. The modelling in terms of rank information enables its use in many other scenarios, specially in cases which require the computation of k nearest neighbors, as the PCC. The Reciprocal kNN Distance between two data items 𝑥 𝑞 , 𝑥 𝑖 ∈𝒳 is computed based on the number of reciprocal neighbors at top positions of ranked lists 𝜏 𝑞 , 𝜏 𝑖 ∈Τ. Additionally, a weight for each pair of reciprocal neighbors is considered, proportionally to their position in the ranked lists 𝜏 𝑞 and 𝜏 𝑖 . Dataset Digit1 COIL USPS G241c Mean PCC 86.91% 39.30% 80.06% 56.96% 65.77% CUSSL 87.70% 40.76% 82.81% 59.51% 67.28% 𝑣1 𝑣2 𝑣3 𝑣4 0.1 0.2 0.6 0.4 0.3 0.8 0.0 Overall work-flow of combined method: in green dashed box, the unsupervised distance learning method (Reciprocal kNN Distance); in red dashed box, the semisupervised learning method used for classification (Particle Competition and Cooperation). 𝑝 𝑣 𝑖 | 𝜌 𝑗 = 1 − 𝑝 𝑔𝑟𝑑 𝑊 𝑞𝑖 𝜇=1 𝑛 𝑊 𝑞𝜇 + 𝑝 𝑔𝑟𝑑 𝑊 𝑞𝑖 𝑣 𝑖 𝜔 ℓ 1+ 𝜌 𝑗 𝑑 𝑖 −2 𝜇=1 𝑛 𝑊 𝑞𝜇 𝑣 𝜇 𝜔 ℓ 1+ 𝜌 𝑗 𝑑 𝜇 −2 Classification accuracy achieved by CUSSL with different combinations of 𝒌 𝒓 and 𝒌 𝒏 on (a) Digit1, (b) COIL, (c) USPS, and (d) g241c data sets with 10 labeled samples (a) (b) (c) (d) Computer Simulations Computer simulation using some artificial and real-world data sets are presented in order to show the effectiveness of the proposed method. For each data set, we applied both the original PCC method and the proposed method (CUSSL). For PCC, the parameter 𝑘 defines the size of k-neighborhood (amount of nearest neighbors) for the graph construction using the Euclidean distance (L2). For CUSSL, two parameters are considered: (i) the size of k-neighborhood used by the unsupervised distance learning algorithm, referred in this section as 𝑘 𝑟 ; and (ii) the size of k-neighborhood for the graph construction, referred in this section as 𝑘 𝑛 . Particle Competition and Cooperation A particle is generated for each labeled node of the graph. Particles initial position are set to their corresponding nodes. Particles with same label play for the same team and cooperate with each other. Particles with different labels compete against each other. They increase the domination level of their respective team in the nodes they visit. At the same time, they decrease the domination levels of other teams. Classification Accuracy on the Iris data set with 2% to 10% labeled samples Labeled 2% 4% 6% 8% 10% PCC 90.44% 89.77% 90.52% 91.23% 91.77% CUSSL 92.07% 91.42% 91.43% 92.17% 92.89% 4 Classification Accuracy on Digit1, COIL, USPS, and g241c data sets with 100 labeled samples Dataset Digit1 COIL USPS G241c Mean PCC 97.31% 74.21% 93.59% 73.87% 84.73% CUSSL 97.48% 76.87% 95.19% 73.85% 85.00% Conclusion The proposed approach performs an unsupervised distance learning step, without the need of any labeled or training data, through the Reciprocal kNN Distance. The objective consists in exploiting the intrinsic dataset structure for improving the distance among data items. Subsequently, a k-NN graph is computed based on the learned distance and used as input for a semisupervised learning step. The semi-supervised learning step is based on the Particle Competition and Cooperation approach. It combines both labeled and unlabeled data items in its training process. An experimental evaluation was conducted considering six public datasets, including artificial and real-world data sets. The computer simulations also considered various different size of labeled sets used in the training procedure. The vast majority of experimental results demonstrated the benefits of the combined approach, CUSSL, in comparison to the original PCC method. 𝑡 𝑡+1 Classification Accuracy on the Wine data set with 2% to 10% labeled samples Labeled 2% 4% 6% 8% 10% PCC 93.24% 92.54% 93.26% 94.13% 94.89% CUSSL 94.17% 94.84% 95.13% 95.39% 95.74% 𝑣 𝑖 𝜔 ℓ = 1 if 𝑦 𝑖 =ℓ 0 if 𝑦 𝑖 ≠ℓ e 𝑦 𝑖 ∈𝐿 1 𝑐 if 𝑦 𝑖 =∅ 𝑣 𝑖 𝜔 ℓ 𝑡+1 = max⁡ 0, 𝑣 𝑖 𝜔 ℓ 𝑡 − Δ 𝑣 𝜌 𝑗 𝜔 𝑡 𝑐−1 if 𝑦 𝑖 =0 e ℓ≠ 𝜌 𝑗 𝑓 𝑣 𝑖 𝜔 ℓ 𝑡 + 𝑞≠ℓ 𝑣 𝑖 𝜔 𝑞 𝑡 − 𝑣 𝑖 𝜔 𝑞 𝑡+1 if 𝑦 𝑖 =0 e ℓ= 𝜌 𝑗 𝑓 𝑣 𝑖 𝜔 ℓ 𝑡 if 𝑦 𝑖 ≠0 A particle gets strong when it visits a node being dominated by its own team, but it gets weak when it selects a node being dominated by another team. Acknowledgments The authors would like to thank the São Paulo Research Foundation - FAPESP (grants 2011/17396-9 and 2013/08645-0) and the National Counsel of Technological and Scientific Development - CNPq (grants 475717/2013-9) for the financial support. 𝜌 𝑗 𝜔 𝑡 = 𝑣 𝑖 𝜔 𝑐 𝑡 0.1 0.2 0.6 0.1 0.4 0.2 0.3