The Elements of Linear Algebra Alexander G. Ororbia II The Pennsylvania State University IST 597: Foundations of Deep Learning
About this chapter Not a comprehensive survey of all of linear algebra Focused on the subset most relevant to deep learning Larger subset: e.g., Linear Algebra by Georgi E. Shilov
Scalars A scalar is a single number Integers, real numbers, rational numbers, etc. Denoted with italic font:
Vectors A vector is a 1-D array of numbers: Can be real, binary, integer, etc. Example notation for type and size:
Matrices A matrix is a 2-D array of numbers: Example notation for type and shape:
Tensors A tensor is an array of numbers, that may have zero dimensions, and be a scalar one dimension, and be a vector two dimensions, and be a matrix or more dimensions.
Tensor = Multidimensional Array https://www.slideshare.net/BertonEarnshaw/a-brief-survey-of-tensors
Matrix Transpose
Matrix (Dot) Product m = m • n p p n Must match
Matrix Addition/Subtraction Assume column-major matrices (for efficiency) Add/subtract operators follow basic properties of normal add/subtract Matrix A + Matrix B is computed element-wise 0.5 -0.7 -0.69 1.8 0.5 -0.7 -0.69 1.8 .5 + .5 = 1.0 -.7 - .7 = -1.4 -.69 - .69 = -1.38 1.8 + 1.8 = 3.6 + =
Matrix-Matrix Multiply Matrix-Matrix multiply (outer product) Vector-Vector multiply (dot product) The usual workhorse of learning algorithms Vectorizes sums of products 0.5 -0.7 -0.69 1.8 0.5 -0.7 -0.69 1.8 (.5 * .5) + (-.7 * -.69) (.5 * -.7) + (-.7 * 1.8) (-.69 * .5) + (1.8 * -.69) (-.69 * -.7) + (1.8 * 1.8) * =
Hadamard Product Multiply each A(I, j) to each corresponding B(I, j) Element-wise multiplication 0.5 -0.7 -0.69 1.8 0.5 -0.7 -0.69 1.8 .5 * .5 = .25 -.7 * .7 = .49 -.69 * -.69 = .4761 1.8 * 1.8 = 3.24 *@ =
Elementwise Functions Applied to each element (i, j) of matrix argument Could be cos(.), sin(.), tanh(.), etc. Identity: 𝜑 v =v Logistic Sigmoid: 𝜑 v =𝜎 𝑣 = 1 1+ 𝑒 −𝑣 Linear Rectifier: 𝜑 v =max(0,v) Softmax: 𝜑 𝑣 𝑐 = 𝑒 − 𝑣 𝑐 𝑐=1 𝑐= 𝐶 𝑒 − 𝑣 𝑐 𝝋( ) 𝜑(1.0) = 1 𝜑(-1.4) = 0 𝜑(-1.38) = 0 𝜑(1.8) = 1.8 0.5 -0.7 -0.69 1.8 =
Why do we care? Computation Graphs Linear algebra operators arranged in a direct graph!
𝒙 𝟎 𝒘 𝟎 𝒘 𝟏 𝒙 𝟏 𝝋 H 𝒘 𝟐 𝒙 𝟐
𝒘 𝟎 𝒙 𝟎 = 𝒛 𝟎 𝒘 𝟏 𝒙 𝟏 = 𝒛 𝟏 𝝋 H 𝒘 𝟐 𝒙 𝟐 = 𝒛 𝟐
𝒘 𝟎 𝒘 𝟏 𝝋 H 𝒁= 𝒛 𝟎 + 𝒛 𝟏 + 𝒛 𝟐 𝒘 𝟐
𝒘 𝟎 𝒘 𝟏 𝝋 H 𝑯=𝝋(𝒁) 𝒘 𝟐
Vector Form (One Unit) This calculates activation value of single hidden unit that is connected to 3 sensors. 𝒉 𝟎 : 𝐰 𝟎 𝐰 𝟏 𝐰 𝟐 * 𝐱 𝟎 𝒙 𝟏 𝒙 𝟐 = 𝛗( 𝐰 𝟎 ∗ 𝐱 𝟎 + 𝐰 𝟏 ∗ 𝐱 𝟏 + 𝐰 𝟐 ∗ 𝐱 𝟐 )
Vector Form (Two Units) This vectorization easily generalizes to multiple sensors feeding into multiple units. 𝒉 𝟎 : 𝐰 𝟎 𝐰 𝟏 𝐰 𝟐 𝐰 𝟑 𝐰 𝟒 𝐰 𝟓 * 𝐱 𝟎 𝒙 𝟏 𝒙 𝟐 = 𝛗( 𝐰 𝟎 ∗ 𝐱 𝟎 + 𝐰 𝟏 ∗ 𝐱 𝟏 + 𝐰 𝟐 ∗ 𝐱 𝟐 ) 𝛗( 𝐰 𝟑 ∗ 𝐱 𝟎 + 𝐰 𝟒 ∗ 𝐱 𝟏 + 𝐰 𝟓 ∗ 𝐱 𝟐 ) 𝒉 𝟏 : Known as vectorization!
Now Let Us Fully Vectorize This! This vectorization is also important for formulating mini-batches. (Good for GPU-based processing.) 𝒉 𝟎 : 𝐰 𝟎 𝐰 𝟏 𝐰 𝟐 𝐰 𝟑 𝐰 𝟒 𝐰 𝟓 * 𝐱 𝟎 𝐱 𝟑 𝒙 𝟏 𝒙 𝟒 𝒙 𝟐 𝒙 𝟓 = 𝛗( 𝐰 𝟎 ∗ 𝐱 𝟎 + …) 𝛗( 𝐰 𝟎 ∗ 𝐱 𝟑 + …) 𝛗( 𝐰 𝟑 ∗ 𝐱 𝟎 + …) 𝛗( 𝐰 𝟑 ∗ 𝐱 𝟑 + …) 𝒉 𝟏 :
Identity Matrix
Systems of Equations expands to
Solving Systems of Equations A linear system of equations can have: No solution Many solutions Exactly one solution: this means multiplication by the matrix is an invertible function
Matrix Inversion Matrix inverse: Solving a system using an inverse: Numerically unstable, but useful for abstract analysis
Invertibility Matrix can’t be inverted if… More rows than columns More columns than rows Redundant rows/columns (“linearly dependent”, “low rank”)
Norms Functions that measure how “large” a vector is Similar to a distance between zero and the point represented by the vector
Norms Lp norm Most popular norm: L2 norm, p=2 (Euclidean) Max norm, infinite p: (Manhattan)
Special Matrices and Vectors Unit vector: Symmetric Matrix: Orthogonal matrix:
Eigendecomposition Eigenvector and eigenvalue: Eigendecomposition of a diagonalizable matrix: Every real symmetric matrix has a real, orthogonal eigendecomposition:
Effect of Eigenvalues
Singular Value Decomposition Similar to eigendecomposition More general; matrix need not be square Stanford lecture
Moore-Penrose Pseudoinverse If the equation has: Exactly one solution: this is the same as the inverse. No solution: this gives us the solution with the smallest error Many solutions: this gives us the solution with the smallest norm of x. Linear Moore-Penrose
Computing the Pseudoinverse The SVD allows the computation of the pseudoinverse: Take reciprocal of non-zero entries
Trace
Learning Linear Algebra Do a lot of practice problems Linear Algebra Done Right http://www.springer.com/us/book/9783319110790 Linear Algebra for Dummies http://www.wiley.com/WileyCDA/WileyTitle/productCd- 0470430907.html Start out with lots of summation signs and indexing into individual entries Code up a few basic matrix operations and compare to worked- out solutions Eventually you will be able to mostly use matrix and vector product notation quickly and easily
References This is a variation presentation of Ian Goodfellow’s slides, for Chapter 2 of Deep Learning (http://www.deeplearningbook.org/lecture_slides.ht ml) Mention cosine similarity!