A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi.

Slides:



Advertisements
Similar presentations
Estimating Distinct Elements, Optimally
Advertisements

Fast Moment Estimation in Data Streams in Optimal Space Daniel Kane, Jelani Nelson, Ely Porat, David Woodruff Harvard MIT Bar-Ilan IBM.
1+eps-Approximate Sparse Recovery Eric Price MIT David Woodruff IBM Almaden.
The Complexity of Linear Dependence Problems in Vector Spaces David Woodruff IBM Almaden Joint work with Arnab Bhattacharyya, Piotr Indyk, and Ning Xie.
Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Numerical Linear Algebra in the Streaming Model
Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
3D Geometry for Computer Graphics
Dimensionality reduction. Outline From distances to points : – MultiDimensional Scaling (MDS) Dimensionality Reductions or data projections Random projections.
Dimensionality Reduction PCA -- SVD
3D Reconstruction – Factorization Method Seong-Wook Joo KG-VISA 3/10/2004.
Sketching for M-Estimators: A Unified Approach to Robust Regression
Turnstile Streaming Algorithms Might as Well Be Linear Sketches Yi Li Huy L. Nguyen David Woodruff.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Lecture 19 Singular Value Decomposition
1cs542g-term High Dimensional Data  So far we’ve considered scalar data values f i (or interpolated/approximated each component of vector values.
Sampling algorithms for l 2 regression and applications Michael W. Mahoney Yahoo Research (Joint work with P. Drineas.
Computer Graphics Recitation 5.
3D Geometry for Computer Graphics
Hongtao Cheng 1 Efficiently Supporting Ad Hoc Queries in Large Datasets of Time Sequences Author:Flip Korn H. V. Jagadish Christos Faloutsos From ACM.
Math for CSLecture 41 Linear Least Squares Problem Over-determined systems Minimization problem: Least squares norm Normal Equations Singular Value Decomposition.
Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 4 March 30, 2005
Approximate Nearest Neighbors and the Fast Johnson-Lindenstrauss Transform Nir Ailon, Bernard Chazelle (Princeton University)
Information Retrieval in Text Part III Reference: Michael W. Berry and Murray Browne. Understanding Search Engines: Mathematical Modeling and Text Retrieval.
TFIDF-space  An obvious way to combine TF-IDF: the coordinate of document in axis is given by  General form of consists of three parts: Local weight.
1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Clustering In Large Graphs And Matrices Petros Drineas, Alan Frieze, Ravi Kannan, Santosh Vempala, V. Vinay Presented by Eric Anderson.
Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006
3D Geometry for Computer Graphics
Lecture 20 SVD and Its Applications Shang-Hua Teng.
Ordinary least squares regression (OLS)
Sketching as a Tool for Numerical Linear Algebra David Woodruff IBM Almaden.
Sketching for M-Estimators: A Unified Approach to Robust Regression Kenneth Clarkson David Woodruff IBM Almaden.
1cs542g-term Notes  Extra class next week (Oct 12, not this Friday)  To submit your assignment: me the URL of a page containing (links to)
化工應用數學 授課教師: 郭修伯 Lecture 9 Matrices
How Robust are Linear Sketches to Adaptive Inputs? Moritz Hardt, David P. Woodruff IBM Research Almaden.
Linear Algebra Review By Tim K. Marks UCSD Borrows heavily from: Jana Kosecka Virginia de Sa (UCSD) Cogsci 108F Linear.
Projective geometry of 2-space DLT alg HZ 4.1 Rectification HZ 2.7 Hierarchy of maps Invariants HZ 2.4 Projective transform HZ 2.3 Behaviour at infinity.
Next. A Big Thanks Again Prof. Jason Bohland Quantitative Neuroscience Laboratory Boston University.
Tight Bounds for Graph Problems in Insertion Streams Xiaoming Sun and David P. Woodruff Chinese Academy of Sciences and IBM Research-Almaden.
Latent Semantic Indexing: A probabilistic Analysis Christos Papadimitriou Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala.
CS717 Algorithm-Based Fault Tolerance Matrix Multiplication Greg Bronevetsky.
4.8 Rank Rank enables one to relate matrices to vectors, and vice versa. Definition Let A be an m  n matrix. The rows of A may be viewed as row vectors.
Introduction to Linear Algebra Mark Goldman Emily Mackevicius.
Review of Linear Algebra Optimization 1/16/08 Recitation Joseph Bradley.
Low Rank Approximation and Regression in Input Sparsity Time David Woodruff IBM Almaden Joint work with Ken Clarkson (IBM Almaden)
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
The Message Passing Communication Model David Woodruff IBM Almaden.
4 4.2 © 2016 Pearson Education, Inc. Vector Spaces NULL SPACES, COLUMN SPACES, AND LINEAR TRANSFORMATIONS.
Beating CountSketch for Heavy Hitters in Insertion Streams Vladimir Braverman (JHU) Stephen R. Chestnut (ETH) Nikita Ivkin (JHU) David P. Woodruff (IBM)
New Algorithms for Heavy Hitters in Data Streams David Woodruff IBM Almaden Joint works with Arnab Bhattacharyya, Vladimir Braverman, Stephen R. Chestnut,
An Optimal Algorithm for Finding Heavy Hitters
Support Vector Machines (SVM)
Stochastic Streams: Sample Complexity vs. Space Complexity
New Characterizations in Turnstile Streams with Applications
Matrix Sketching over Sliding Windows
Sam Hopkins Cornell Tselil Schramm UC Berkeley Jonathan Shi Cornell
LSI, SVD and Data Management
Singular Value Decomposition
Turnstile Streaming Algorithms Might as Well Be Linear Sketches
Overview Massive data sets Streaming algorithms Regression
The Communication Complexity of Distributed Set-Joins
Linear sketching with parities
Lecture 15: Least Square Regression Metric Embeddings
Learning-Based Low-Rank Approximations
Presentation transcript:

A Story of Principal Component Analysis in the Distributed Model David Woodruff IBM Almaden Based on works with Christos Boutsidis, Ken Clarkson, Ravi Kannan, Santosh Vempala, and Peilin Zhong

2 Talk Outline 1.What is low rank approximation? 2.How do we solve it offline? 3.How do we solve it in a distributed setting?

3 Low rank approximation  A is an n x d matrix  Think of n points in R d  E.g., A is a customer-product matrix  A i,j = how many times customer i purchased item j  A is typically well-approximated by low rank matrix  E.g., high rank because of noise  Goal: find a low rank matrix approximating A  Easy to store, data more interpretable

4 What is a good low rank approximation? Singular Value Decomposition (SVD) Any matrix A = U ¢ Σ ¢ V  U has orthonormal columns  Σ is diagonal with non-increasing positive entries down the diagonal  V has orthonormal rows  Rank-k approximation: A k = U k ¢ Σ k ¢ V k A k = argmin rank k matrices B |A-B| F (|C| F = (Σ i,j C i,j 2 ) 1/2 ) Computing A k exactly is expensive A k = argmin rank k matrices B |A-B| F (|C| F = (Σ i,j C i,j 2 ) 1/2 ) Computing A k exactly is expensive The rows of V k are the top k principal components

5 Low rank approximation  Goal: output a rank k matrix A’, so that |A-A’| F · (1+ε) |A-A k | F  Can do this in nnz(A) + (n+d)*poly(k/ε) time [S,CW]  nnz(A) is number of non-zero entries of A

6 Solution to low-rank approximation [S]  Given n x d input matrix A  Compute S*A using a random matrix S with k/ε << n rows. S*A takes random linear combinations of rows of A SA A  Project rows of A onto SA, then find best rank-k approximation to points inside of SA.

7 What is the matrix S?  S can be a k/ε x n matrix of i.i.d. normal random variables  [S] S can be a k/ε x n Fast Johnson Lindenstrauss Matrix  Uses Fast Fourier Transform  [CW] S can be a poly(k/ε) x n CountSketch matrix [ [ S ¢ A can be computed in nnz(A) time

8 Caveat: projecting the points onto SA is slow  Current algorithm: 1.Compute S*A 2.Project each of the rows onto S*A 3.Find best rank-k approximation of projected points inside of rowspace of S*A  Bottleneck is step 2  [CW] Approximate the projection  Fast algorithm for approximate regression min rank-k X |X(SA)-A| F 2  nnz(A) + (n+d)*poly(k/ε) time min rank-k X |X(SA)R-AR| F 2

9 Distributed low rank approximation  We have fast algorithms, but can they be made to work in a distributed setting?  Matrix A distributed among s servers  For t = 1, …, s, we get a customer-product matrix from the t-th shop stored in server t. Server t’s matrix = A t  Customer-product matrix A = A 1 + A 2 + … + A s  Model is called the arbitrary partition model  More general than the row-partition model in which each customer shops in only one shop

10 The Communication Model … Server 1 Coordinator Each player talks only to a Coordinator via 2-way communication Can simulate arbitrary point-to-point communication up to factor of 2 (and an additive O(log s) factor per message) Server 2Server s

11 Communication cost of low rank approximation

12 Work on Distributed Low Rank Approximation  [FSS]: First protocol for the row-partition model.  O(sdk/ε) real numbers of communication  Don’t analyze bit complexity (can be large)  SVD Running time, see also [BKLW]  [KVW]: O(skd/ε) communication in arbitrary partition model  [BWZ]: O(skd) + poly(sk/ε) words of communication in arbitrary partition model. Input sparsity time  Matching Ω(skd) words of communication lower bound  Variants: kernel low rank approximation [BLSWX], low rank approximation of an implicit matrix [WZ], sparsity [BWZ]

13 Outline of Remainder of the Talk  [FSS] protocol  [KVW] protocol  [BWZ] protocol

14 [FSS] Row-Partition Protocol … Coordinator Problems: 1. sdk/ε real numbers of communication 2. bit complexity can be large 3. running time for SVDs [BLKW] 4. doesn’t work in arbitrary partition model This is an SVD-based protocol. Maybe our random matrix techniques can improve communication just like they improved computation? [KVW] protocol will handle 2, 3, and 4

15 [KVW] Arbitrary Partition Model Protocol  Inspired by the sketching algorithm presented earlier  Let S be one of the k/ε x n random matrices discussed  S can be generated pseudorandomly from small seed  Coordinator sends small seed for S to all servers  Server t computes SA t and sends it to Coordinator  Coordinator sends Σ t=1 s SA t = SA to all servers  There is a good k-dimensional subspace inside of SA. If we knew it, t-th server could output projection of A t onto it Problems:  Can’t output projection of A t onto SA since the rank is too large  Could communicate this projection to the coordinator who could find a k-dimensional space, but communication depends on n Fix:  Instead of projecting A onto SA, project small number of random linear combinations of A onto SA  Find best k-dimensional space of random linear combinations inside of SA  Communication of this part is independent of n and d

16 [KVW] protocol  Phase 1:  Learn the row space of SA SA optimal k-dimensional space in SA cost · (1+ε)|A-A k | F

17 [KVW] protocol  Phase 2:  Communicate random linear combinations of points inside SA  Find an approximately optimal space W inside of SA SA optimal space in SA approximate space W in SA cost · (1+ε) 2 |A-A k | F

18 [BWZ] Protocol

19 [BWZ] Protocol Intuitively, U looks like top k left singular vectors of SA Top k right singular vectors of SA work because S is a projection- cost preserving sketch!

20 [BWZ] Analysis

21 Conclusions  [BWZ] Optimal O(sdk) + poly(sk/ε) communication protocol for low rank approximation in arbitrary partition model  Handle bit complexity by adding Tao/Vu noise  Input sparsity time  2 rounds, which is optimal [W]  Optimal data stream algorithms improves [CW, L, GP]  Communication of other optimization problems?  Computing the rank of an n x n matrix over the reals  Linear Programming  Graph problems: Matching  etc.