Fast Approximation of Matrix Coherence and Statistical Leverage Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/

Slides:



Advertisements
Similar presentations
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Advertisements

RandNLA: Randomized Numerical Linear Algebra
The Stability of a Good Clustering Marina Meila University of Washington
Uniform Sampling for Matrix Approximation Michael Cohen, Yin Tat Lee, Cameron Musco, Christopher Musco, Richard Peng, Aaron Sidford M.I.T.
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
3D Reconstruction – Factorization Method Seong-Wook Joo KG-VISA 3/10/2004.
Sketching for M-Estimators: A Unified Approach to Robust Regression
Solving Linear Systems (Numerical Recipes, Chap 2)
Revisiting Revisiting the Nyström Method for Improved Large-scale Machine Learning Michael W. Mahoney Stanford University Dept. of Mathematics
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Dan Witzner Hansen  Groups?  Improvements – what is missing?
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Chapter 5 Orthogonality
Randomized matrix algorithms and their applications
3D Geometry for Computer Graphics
Statistical Leverage and Improved Matrix Algorithms Michael W. Mahoney Stanford University.
Math for CSLecture 41 Linear Least Squares Problem Over-determined systems Minimization problem: Least squares norm Normal Equations Singular Value Decomposition.
Singular Value Decomposition COS 323. Underconstrained Least Squares What if you have fewer data points than parameters in your function?What if you have.
Ch 7.3: Systems of Linear Equations, Linear Independence, Eigenvalues
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
Previously Two view geometry: epipolar geometry Stereo vision: 3D reconstruction epipolar lines Baseline O O’ epipolar plane.
3D Geometry for Computer Graphics
Class 25: Question 1 Which of the following vectors is orthogonal to the row space of A?
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems
Basics of regression analysis
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.
Randomized algorithms for matrices and data Michael W. Mahoney Stanford University November 2011 (For more info, see:
Boyce/DiPrima 9th ed, Ch 7.3: Systems of Linear Equations, Linear Independence, Eigenvalues Elementary Differential Equations and Boundary Value Problems,
Summarized by Soo-Jin Kim
Algorithms for a large sparse nonlinear eigenvalue problem Yusaku Yamamoto Dept. of Computational Science & Engineering Nagoya University.
1 Information Retrieval through Various Approximate Matrix Decompositions Kathryn Linehan Advisor: Dr. Dianne O’Leary.
Finding Frequent Items in Data Streams [Charikar-Chen-Farach-Colton] Paper report By MH, 2004/12/17.
Faster least squares approximation
Network Systems Lab. Korea Advanced Institute of Science and Technology No.1 Appendix A. Mathematical Background EE692 Parallel and Distribution Computation.
SVD: Singular Value Decomposition
Orthogonality and Least Squares
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
Statistical Leverage and Improved Matrix Algorithms Michael W. Mahoney Yahoo Research ( For more info, see: )
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Orthogonalization via Deflation By Achiya Dax Hydrological Service Jerusalem, Israel
Elementary Linear Algebra Anton & Rorres, 9th Edition
MATH 685/ CSI 700/ OR 682 Lecture Notes Lecture 4. Least squares.
Sampling algorithms and core-sets for L p regression and applications Michael W. Mahoney Yahoo Research ( For more info, see:
Review of Linear Algebra Optimization 1/16/08 Recitation Joseph Bradley.
Dimensionality Reduction
Ch 6 Vector Spaces. Vector Space Axioms X,Y,Z elements of  and α, β elements of  Def of vector addition Def of multiplication of scalar and vector These.
Advanced Computer Graphics Spring 2014 K. H. Ko School of Mechatronics Gwangju Institute of Science and Technology.
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
ECE 530 – Analysis Techniques for Large-Scale Electrical Systems Prof. Hao Zhu Dept. of Electrical and Computer Engineering University of Illinois at Urbana-Champaign.
Singular Value Decomposition and Numerical Rank. The SVD was established for real square matrices in the 1870’s by Beltrami & Jordan for complex square.
Unsupervised Learning II Feature Extraction
1 Objective To provide background material in support of topics in Digital Image Processing that are based on matrices and/or vectors. Review Matrices.
1 SYSTEM OF LINEAR EQUATIONS BASE OF VECTOR SPACE.
A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.
ALGEBRAIC EIGEN VALUE PROBLEMS
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
Streaming & sampling.
Background: Lattices and the Learning-with-Errors problem
Singular Value Decomposition
Unfolding Problem: A Machine Learning Approach
Singular Value Decomposition SVD
Generally Discriminant Analysis
Unfolding with system identification
Presentation transcript:

Fast Approximation of Matrix Coherence and Statistical Leverage Michael W. Mahoney Stanford University ( For more info, see: cs.stanford.edu/people/mmahoney/ or Google on “Michael Mahoney”)

Statistical leverage Def: Let A be n x d matrix, with n >> d, i.e., a tall matrix. The statistical leverage scores are the diagonal elements of the projection matrix onto the left singular vectors of A. The coherence of the rows of A is the largest score. Basic idea: Statistical leverage measures: correlation b/w singular vectors of a matrix and the standard basis how much influence/leverage a row has on the the best LS fit where in the high-dimensional space the (singular value) information of A is being sent, independent of what that information is the extent to which a data point is an outlier

Who cares? Statistical Data Analysis and Machine Learning historical measure of “outlierness” or error recently, “Nystrom method” and “matrix reconstruction” typically assume that the coherence is uniform/flat Numerical Linear Algebra key bottleneck to get high-quality numerical implementation of randomized matrix algorithms Theoretical Computer Science key structural nonuniformity to deal with in worst-case analysis best random sampling algorithms use them as importance sampling distribution and best random projection algorithms uniformize them

Statistical leverage and DNA SNP data

Statistical leverage and term-document data Note: often the most “interesting” or “important” data points have highest leverage scores--- especially when use L2-based models for computational (as opposed to statistical) reasons.

Computing statistical leverage scores Simple (deterministic) algorithm: Compute a basis Q for the left singular subspace, with QR or SVD. Compute the Euclidean norms of the rows of Q. Running time is O(nd 2 ), if n >> d, O(on-basis) time otherwise. We want faster! o(nd 2 ) or o(on-basis), with no assumptions on input matrix A. Faster in terms of flops of clock time for not-obscenely-large input. OK to live with  -error or to fail with overwhelmingly-small  probability

Randomized algorithms for matrices (1) General structure: “Preprocess” the matrix -- compute nonuniform importance sampling distribution, or perform a random projection/rotation to uniformize Draw a random sample of columns/rows, in the original or randomly rotated basis “Postprocess” the sample with traditional deterministic NLA algorithm, the solution to which provides the approximation Two connections to randomization today: statistical leverage is the key bottleneck to getting good randomized matrix algorithms (both in theory and in practice) the fast algorithm to approximate statistical leverage is a randomized algorithm

Randomized algorithms for matrices (2) Maybe unfamiliar, but no need to be afraid: (unless you are afraid of flipping a fair coin heads 100 times in a row) Lots of work in TCS - but that tends to be very cavalier w.r.t. metrics that NLA and Sci Comp tends to care about. Two recent reviews: Randomized algorithms for matrices and data, M. W. Mahoney, arXiv: focuses on “what makes the algorithms work” and interdisciplinary “bridging the gap” issues Finding structure with randomness, Halko, Martinsson, and Tropp, arXiv: – focuses on connecting with traditional “NLA and scientific computing” issues

Main Theorem Theorem: Given an n x d matrix A, with n >> d, let P A be the projection matrix onto the column space of A. Then, there is a randomized algorithm that w.p. ≥ 0.999: computes all of the n diagonal elements of P A (i.e., leverage scores) to within relative (1±  ) error; computes all the large off-diagonal elements of P A to within additive error; runs in o(nd 2 )* time. *Running time is basically O(n d log(n)/  ), i.e., same as DMMS fast randomized algorithm for over-constrained least squares.

A “classic” randomized algorithm (1of3) Over-constrained least squares (n x d matrix A,n >>d) Solve: Solution: Algorithm: For all i  {1,...,n}, compute Randomly sample O(d log(d)/  ) rows/elements fro A/b, using {p i } as importance sampling probabilities. Solve the induced subproblem:

A “classic” randomized algorithm (2of3) Theorem: Let. Then: This naïve algorithm runs in O(nd 2 ) time But it can be improved !!! This algorithm is bottleneck for Low Rank Matrix Approximation and many other matrix problems.

A “classic” randomized algorithm (3of3) Sufficient condition for relative-error approximation. For the “preprocessing” matrix X: Important: this condition decouples the randomness from the linear algebra. Random sampling algorithms with leverage score probabilities and random projections satisfy it

Two ways to speed up running time Random Projection: uniformize leverage scores rapidly Apply a Structured Randomized Hadamard Transform to A and b Do uniform sampling to construct SHA and SHb Solve the induced subproblem I.e., call Main Algorithm on preprocessed problem Random Sampling: approximate leverage scores rapidly Described below Rapidly approximate statistical leverage scores and call Main Algorithm Open problem for a long while

Under-constrained least squares (1of2) Basic setup: n x d matrix A,n « d, and n-vector b Solve: Solution: Algorithm: For all j  {1,...,d}, compute Randomly sample O(n log(n)/  ) columns fro A, using {p j } as importance sampling probabilities. Return:

Under-constrained least squares (2of2) Theorem: w.h.p. Notes: Can speed up this O(nd 2 ) algorithm by doing random projection or by using fast leverage score algorithm below. Meng, Saunders, and Mahoney 2011 treat over-constrained and under-constrained, rank deficiencies, etc, in more general and uniform manner for numerical implementations.

Back to approximating leverage scores View the computation of leverage scores i.t.o an under-constrained LS problem Recall (A is n x d, n » d): But: Leverage scores are the norm of a min-length solution of an under-constrained LS problem!

The key idea Note: this expression is simpler than that for the full under-constrained LS solution since we only need the norm of the solution.

Algorithm and theorem Algorithm: (  1 is r 1 x n, with r 1 =O(d log(n)/  2 ), SRHT matrix) (  2 is r 1 x r 2, with r 2 =O(log(n)/  2 ), RP matrix) Compute the n x log(n)/  2 matrix X = A(  1 A) +  2, and return the Euclidean norm of each row of X. Theorem: p i ≈ ||X (i) || 2 2, up to multiplicative 1± , forall i. Runs is roughly O(nd log(n)/  ) time.

Running time analysis Running time: Random rotation:  1 A takes O(nd log(r 1 )) time Pseudoinverse: (  1 A) + takes O(r 1 d 2 ) time Matrix multiplication: A(  1 A) + takes O(ndr 1 ) ≥ nd 2 time - too much! Another projection: (  1 A) +  2 takes O(dr 1 r 2 ) time Matrix Multiplication: A(  1 A) +  2 takes O(ndr 2 ) time Overall, takes O(nd log(n)) time.

An almost equivalent approach 1.Preprocess A to  1 A with SRHT 2.Find R s.t.  1 A=QR 3.Compute norms of rows of AR -1  2 (a “sketch” which is an “approximately orthogonal” matrix) Same quality-of-approximation and running-time bounds Previous algorithm amounts to choosing a particular rotation Using R -1 as a preconditioner is how randomized algorithms for overconstrained LS were implemented numerically

Getting the large off-diagonal elements Let X = A(  1 A) +  2 or X=AR -1  2. Also true that: So, can use hash function approaches to approximate all off-diagonal elements with to additive error of.

Extensions to “fat” matrices (1 of 2) Question: Can we approximate leverage scores relative to best rank-k approximation to A? (Given an arbitrary n x d matrix A and rank parameter k)? Ill-posed: Consider A = I n and k < n. (Subpace not even unique, so leverage scores not even well-defined.) Unstable: Consider (Subspace unique, but unstable w.r.t.  -> 0.)

Extensions to “fat” matrices (2 of 2) Define: S as set of matrices near the best rank-k approximation: (Results different if norm is spectral or Frobenius.) Algorithm: Construct compact sketch of A with random projection (several recent variants). Use left singular vectors of sketch to compute scores for some matrix in S.

Extensions to streaming environments Data “streams” by, and can only keep a “small” sketch But still can compute: Rows with large leverage scores and their scores Entropy and number of nonzero leverage scores Sample of rows according to leverage score probs Use hash functions, linear sketching matrices, etc. from streaming literature.

Conclusions Statistical leverage scores... measure correlation of the dominant subspace with the canonical basis have a natural statistical interpretation in terms of outliers define “bad” examples for Nystrom method, matrix completion. define the key non-uniformity structure for improved worst- case randomized matrix algorithms, e.g., relative-error CUR algorithms, and for making TCS randomized matrix useful for NLA and scientific computing take O(nd 2 ), i.e., O(orthonormal-basis), time to compute

Conclusions... can be computed to 1±  accuracy in o(nd 2 ) time relates the leverage scores to an underdetermined LS problem running time comparable to DMMS-style fast approximation algorithms for over-determined LS problem also provides large off-diagonal elements in same time better numerical understanding of fast randomized matrix algorithms