Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010.

Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems tdameh@cs.sfu.ca Dec 2 nd, 2010

Outline Introduction Motivation: Kernel machines applications in multimedia content analysis and search Challenges in large scale kernel machines Previous Work Sub-Quadratic approach to compute the sparse Gram matrix Results Conclusion and future Work

Introduction Given a set of points, with a notion of distance between points, group the points into some number of clusters. We use Kernel functions to compute the similarity between each pair of points to produce a Similarity (Gram) Matrix (O(N 2 ) space and computation) Example of kerenl kernel machines:  Support Vector Machines SVM (formulated for 2 classes)  Relevance Vector Machines (result much sparse models)  Guassian Process  Fisher’s Linear discriminant analysis LDA  Kernel PCA

Kernel machines applications in multimedia content analysis and search BroadCast Video Summarization Using Clustering Document Clustering Audio Content Discovery Searching one billion web images by content

Challenges and Sparse Solutions for Kernel Machines One of the significant limitations of many kernel methods is that the kernel function k(x,y) must be evaluated for all possible pairs x and y of training points, which can be computationally infeasible. traditional algorithm analysis assumes that the data fits in main memory, it is unreasonable to make such assumptions when dealing with massive data sets such as multimedia data, web page repositories and so on Observing that kernel machines are Redial Basis Function, then the gram matrices have many values that are close to zero We are developing algorithms to approximate the gram matrix to sparse one (filtering out the small similarities)

Previous Work Approximation depending on the Eigen spectrum of the gram matrix  The Eigen spectrum rapidly decays especially when the kernel function is Radial basis (most information stored in the first few eigen vectors) Sparse Bayesian learning  Methods that leads to much sparse models  Relevance vector machines (RVM)  Sparse kernel principle component analysis (sparse KPCA ) Efficient Implementation of computing the kernel function  Space filling curves  Locality Sensitive Hashing (OUR Method)

Locality Sensitive hashing Hash the data-points so that probability of collision is higher for close points. A Family H=h : S → U is called (r1,r2,p1,p2)-sensitive, if for any v,q є S dist(v,q) < r1 → ProbH [h(v) = h(p)] ≥ p1 dist(v,q) > r2 → ProbH [h(v) = h(p)] ≤ p2  p1 > p2 and r1 1  We need the gap between p1 and p2 a quite large For a proper choice of k (will be shown later), g(v) = {h1(v), …,hk(v) } We compute the kernel function between the points that reside at the same bucket.  using this approach and for a hash table of size m (assuming the buckets have the same number of points) computing the gram matrix will have the complexity of N 2 /m

Sub-quadratic approach using LSH Claim 1:The number of concatenated hash values k is logarithmic in the size of datasets n and independent of the dimension d Proof: Given a set of n points P in the d-dimensional space and (r1; r2; p1; p2)-sensitive hash functions, and given a point q, the probability that Is at most p 2 k = B/n, where B is the average bucket size. then we can find that:

Claim2:The complexity of computing the approximated gram matrix using the locality sensitive hashing is sub-quadratic. Proof:

FN ratio VS Memory reeducation for different values of k

Affinity Propagation results for different values of k

Second stage of AP over the first stage weighted exemplars

N*d input vectors m segments each of size (N/m)*d L buckets files Compute Gram matrix of each bucket (gram matrix size is (N/L) 2 ) and run clustering algorithm on each bucket’s Gram matrix Combine clusters with Weights Run second phase of clustering Hashing Clustering Clusters with weights Final Clusters

Conclusion and future work Brute force kernel methods require O(N 2 ) space and computation, where the assumption that data fits in main memory no longer works. Approximate the full Gram matrix to sparse one depending on the redial basis property of such methods would reduce this quadratic down into sub-quadratic Using the locality sensitive hashing we can find the close points to compute the kennel function between them and also we can distrusted the processing as the bucket will be the base unit. Future Work: working on control the error as k increases, so we can both run very large scale data and at the same time maintain sufficient accuracy.

Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010.

Similar presentations

Presentation on theme: "Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010.

Similar presentations

Presentation on theme: "Sparse Solutions for Large Scale Kernel Machines Taher Dameh CMPT820-Multimedia Systems Dec 2 nd, 2010."— Presentation transcript:

Similar presentations

About project

Feedback