Presentation is loading. Please wait.

Presentation is loading. Please wait.

“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015.

Similar presentations


Presentation on theme: "“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015."— Presentation transcript:

1 “Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015

2 Support Vector Machine (SVM) Used for – Classification – Regression Applied in – Network intrusion detection – Image processing – Text classification – …

3 libSVM Library for support vector machines Integrate different types of SVMs

4 Types of SVMs Supported by libSVM For support vector classification – C-SVC – Nu-SVC For support vector regression – Epsilon-SVR – Nu-SVR For distribution estimation – One-class SVM

5 C-SVC Goal: Find the separating hyperplane that maximizes the margin Support vectors: data points closest to the separating hyperplane

6 C-SVC Primal form Dual form (derived using Lagrange multipliers)

7 Speedup Computation and storage requirements increase rapidly as the number of training vectors (also called training samples or training points) increases Need efficient algorithms and implementation to apply to large scale data mining => Parallel SVM

8 Parallel SVM Methods Message Passing Interface (MPI) – Efficient for computation-intensive problems Ex. Simulation MapReduce – Can be used for data-intensive problems …

9 Other Speedup Techniques Chunking: optimize subsets of training data iteratively until the global optimum is reached – Ex. Sequential Minimal Optimization (SMO) Use a chunk size of 2 vectors Eliminate non-support vectors early

10 This Paper’s Approach 1.Partition & distribute data to nodes 2.Map class: Train each subSVM to find support vectors for subset of data 3.Reduce class: Combine support vectors of each 2 subSVMs 4.If more than 1 SVM Go to 2.

11 Twister Support iterative MapReduce More efficient than Hadoop or Dryad/DryadLINQ for iterative MapReduce

12 Computation Complexity

13 Evaluations Number of nodes Training time Accuracy = # correctly predicted data / # total testing data * 100 %

14 Adult Data Analysis Binary classification Correlation between attribute variable X and class variable Y used to select attributes

15 Adult Data Analysis Computation cost concentrates on training Data transfer time cost minor Last layer computation time depends on α and β instead of number of nodes (1 node only) Feature selection reduces computation greatly but does not reduce accuracy very much

16 Forest Cover Type Classification Multiclass classification – Use k(k - 1)/2 binary SVMs as k-class SVM – 1 binary SVM for each pair of classes – Use maximum voting to determine the class

17 Forest Cover Type Classification Correlation between attribute variable X and class variable Y used to select attributes Attribute variables are normalized to [0, 1]

18 Forest Cover Type Classification Last layer computation time depends on α and β instead of number of nodes (1 node only) Feature selection reduces computation greatly but does not reduce accuracy very much

19 Heart Disease Classification Binary classification Data replicated different times to compare results for different sample sizes

20 Heart Disease Classification When sample size too big, can’t be processed with 1 node because of memory constraint Training time decreases little when number of nodes > 8

21 Conclusion Classical SVM impractical for large scale data Need parallel SVM This paper proposes a model based on iterative MapReduce Show the model efficient for data-intensive problems

22 References [1]Z. Sun and G. Fox, “Study on Parallel SVM Based on MapReduce,” in PDPTA., Las Vegas, NV, 2012, pp. [2]C. Lin et al., “Anomaly Detection Using LibSVM Training Tools,” in ISA., Busan, Korea, 2008, pp. 166-171.

23 Q & A


Download ppt "“Study on Parallel SVM Based on MapReduce” Kuei-Ti Lu 03/12/2015."

Similar presentations


Ads by Google