Classifying and clustering using Support Vector Machine 2 nd PhD report PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Suppervisor: Lucian N. VINŢAN Sibiu, 2005
Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection Information Gain Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects. Conclusions and further work
Classifying (clustering) steps Text mining – features extraction Features selection Classifying or Clustering Testing results
Reuters Database Processing total documents, 126 topics, 366 regions, 870 industry codes Industry category selection – system software 7083 documents 4722 training samples 2361 testing samples attributes (features) 68 classes (topics) Binary classification Topics c152 (only 2096 from 7083)
Frequency vector Terms frequency Stopwords Stemming Threshold Large frequency vector Features extraction
Information Gain SVM features selection Liniar kernel – weight vector Features selection
Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection Information Gain Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects. Conclusions and further work
Support Vector Machine Binary classification Optimal hyperplane Higher-dimensional feature space Primal optimization problem Dual optimization problem - Lagrange multipliers Karush-Kuhn-Tucker conditions Support Vectors Kernel trick Decision function
Optimal Hyperplane {x|w,x+b=0} X2X2 X1X1 y i =+1 y i =-1 {x|w,x+b=-1} {x|w,x+b=+1} w margin
Higher-dimensional feature space
Primal optimization problem Dual optimization problem Maximize: subject to: Lagrange formulation
SVM - caracteristics Karush-Kuhn-Tucker (KKT) conditions only the Lagrange multipliers that are non-zero at the saddle point Support Vectors the patterns x i for which Kernel trick Positively defined kernel Decision function
Multi-class classification Separate one class versus the rest
Clustering Caracteristics mapped data into a higher dimensional space search for the minimal enclosing sphere Primal optimisation problem Dual optimisation problem Karush Kuhn Tucker condition
Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection Information Gain Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects. Conclusions and further work
SMO characteristics Only two parameters are updated (minimal size of updates). Benefit: doesnt need any extra matrix storage doesn t need to use numerical QP optimization step needs more iterations to converge, but only needs a few operations at each step, which leads to overall speed-up Components: Analytic method to solve the problem for two Lagrange multipliers Heuristics for choosing the points
Analytic method Heuristics for choosing the point Choice of 1 st point ( x 1 / 1 ): Find KKT violations Choice of 2 nd point ( x 2 / 2 ): update 1, 2 which cause a large change, which, in turn, result in a large increase of the dual objective maximize quantity |E 1 -E 2 | SMO - components
Probabilistic outputs
Features selection using SVM Linear kernel Primal optimisation form Keeped only that value that have weight in learned w vector great ther a threshold
Contents Classification (clustering) steps Reuters Database processing Feature extraction and selection Information Gain Support Vector Machine Binary classification Multiclass classification Clustering Sequential Minimal Optimizations (SMO) Probabilistic outputs Experiments & results Binary classification. Aspects and results. Feature subset selection. A comparative approach. Multiclass classification. Quantitative aspects. Clustering. Quantitative aspects. Conclusions and further work
Polynomial kernel Gaussian kernel Kernels used
Binary using values 0 and 1 Nominal Connell SMART Data representation
Binary classification - 63 d - kernels degree Binary Nominal CONNELL SMART
Binary classification d - kernels degree Binary Nominal CONNELL SMART
Influence of vector size Polynomial kernel
Influence of vector size Gaussian kernel
Polynomial kernel IG versus SVM – 427 features
Gaussian kernel IG versus SVM – 427 features
LibSvm versus UseSvm Polynomial kernel
LibSvm versus UseSvm Gaussian kernel
Multiclass classification Polynomial kernel features
Multiclass classification Gaussian kernel 2488 features
Clustering using SVM υ\#features ,010,6% 0,7%0,6% 0,10,5% 0,525,2%25,1%
Conclusions – best results Polynomial kernel and nominal representation (degree 5 and 6 ) Gaussian kernel and Connell Smart ( C=2.7) Reduced # of support vectors for polynomial kernel in comparison with Gaussian kernel (24,41% versus 37.78%) # features between 6% (1309) and 10% (2488) Multiclass follows the binary classification Clustering has a smaller # of svs Clustering follows binary classification
Further work Features extraction and selection Association rules between words (Mutual Information) Synonym and Polysemy problem Better implementation of SVM with linear kernel Using families of words (WordNet) SVM with kernel degree greater then 1 Classification and clustering Using classification and clustering together
Influence of bias – Pol. kernel
Influence of bias – RBF kernel