Some Topics Deserved Concerns Songcan Chen 2013.3.6.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Pattern Recognition and Machine Learning
Scaling Multivariate Statistics to Massive Data Algorithmic problems and approaches Alexander Gray Georgia Institute of Technology
EigenFaces and EigenPatches Useful model of variation in a region –Region must be fixed shape (eg rectangle) Developed for face recognition Generalised.
Eigen Decomposition and Singular Value Decomposition
CSCE643: Computer Vision Bayesian Tracking & Particle Filtering Jinxiang Chai Some slides from Stephen Roth.
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
SVM - Support Vector Machines A new classification method for both linear and nonlinear data It uses a nonlinear mapping to transform the original training.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
CS479/679 Pattern Recognition Dr. George Bebis
Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.
Proportion Priors for Image Sequence Segmentation Claudia Nieuwenhuis, etc. ICCV 2013 Oral.
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,
Pattern Recognition and Machine Learning
Graph Based Semi- Supervised Learning Fei Wang Department of Statistical Science Cornell University.
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
Principal Component Analysis
University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
A Study of Approaches for Object Recognition
Dimensional reduction, PCA
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
1 Numerical geometry of non-rigid shapes Spectral Methods Tutorial. Spectral Methods Tutorial 6 © Maks Ovsjanikov tosca.cs.technion.ac.il/book Numerical.
Machine Learning CMPT 726 Simon Fraser University
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
Independent Component Analysis (ICA) and Factor Analysis (FA)
Support Vector Machines
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Bootstrapping a Heteroscedastic Regression Model with Application to 3D Rigid Motion Evaluation Bogdan Matei Peter Meer Electrical and Computer Engineering.
Lecture II-2: Probability Review
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Enhancing Tensor Subspace Learning by Element Rearrangement
Presented By Wanchen Lu 2/25/2013
Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Graph Embedding: A General Framework for Dimensionality Reduction Dong XU School of Computer Engineering Nanyang Technological University
Elements of Financial Risk Management Second Edition © 2012 by Peter Christoffersen 1 Distributions and Copulas for Integrated Risk Management Elements.
Shape Analysis and Retrieval Statistical Shape Descriptors Notes courtesy of Funk et al., SIGGRAPH 2004.
计算机学院 计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知 计算机学院 Perceptron Revisited: Linear Separators Binary classification.
IEEE TRANSSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
December 9, 2014Computer Vision Lecture 23: Motion Analysis 1 Now we will talk about… Motion Analysis.
Non-Euclidean Example: The Unit Sphere. Differential Geometry Formal mathematical theory Work with small ‘patches’ –the ‘patches’ look Euclidean Do calculus.
CS654: Digital Image Analysis Lecture 30: Clustering based Segmentation Slides are adapted from:
Geodesic Flow Kernel for Unsupervised Domain Adaptation Boqing Gong University of Southern California Joint work with Yuan Shi, Fei Sha, and Kristen Grauman.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.
Final Review Course web page: vision.cis.udel.edu/~cv May 21, 2003  Lecture 37.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Dimension reduction (1) Overview PCA Factor Analysis Projection persuit ICA.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.
Introduction to several works and Some Ideas Songcan Chen
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Machine Learning Basics
Spectral Methods Tutorial 6 1 © Maks Ovsjanikov
Learning with information of features
Principal Component Analysis
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Goodfellow: Chapter 14 Autoencoders
Presentation transcript:

Some Topics Deserved Concerns Songcan Chen

Outlines Copula & its applications Kronecker Decomposition for Matrix Covariance Descriptors & Metric on manifold

[1] Fabrizio Durante and Carlo Sempi, Copula Theory: An Introduction (Chapt. 1), P. Jaworski et al. (eds.), Copula Theory and Its Applications, Lecture Notes in Statistics 198,2010. [2] Jean-David Fermanian, An overview of the goodness-of-fit test problem for copulas (Chapt 1), arXiv: 19 Nov Applications [A1] David Lopez-Paz, Jose Miguel Hernandez-Lobato, Bernhard Scholkopf, Semi- Supervised Domain Adaptation with Non-Parametric Copulas, NIPS2012/arXiv:1 Jan,2013. [A2] David Lopez-Paz, et al, Gaussian Process Vine Copulas for Multivariate Dependence, ICML2013/arXiv: 16 Feb [A3] Carlos Almeida, et al, Modeling high dimensional time-varying dependence using D-vine SCAR models, arXiv: 9 Feb [A4] Alexander Baue, et al, Pair-copula Bayesian networks, arXiv:23 Nov … … Copula & its applications

Kronecker Decomposition for Matrix [1] C. V. Loan and N. Pitsianis, Approximation with kronecker products, in Linear Algebra for Large Scale and Real Time Applications. Kluwer Publications, 1993, pp. 293–314. [2] T. Tsiligkaridis, A. Hero, and S. Zhou, On Convergence of Kronecker Graphical Lasso Algorithms, to appear in IEEE TSP, [3] ---, Convergence Properties of Kronecker Graphical Lasso Algorithms, arXiv: , July [4] ---, Low Separation Rank Covariance Estimation using Kronecker Product Expansions, google [5] --- Covariance Estimation in High Dimensions via Kronecker Product Expansions, arXiv:12 Feb [6] --- SPARSE COVARIANCE ESTIMATION UNDER KRONECKER PRODUCT STRUCTURE, ICCASP2012,pp: [7] Marco F. Duarte, Richard G. Baraniuk, Kronecker Compressive Sensing, IEEE TIP, 21(2) [8] MARTIN SINGULL, et al, More on the Kronecker Structured Covariance Matrix, Communications in Statistics—Theory and Methods, 41: 2512–2523, 2012

Covariance Descriptor [1] Oncel Tuzel, Fatih Porikli, and Peter Meer,Region Covariance-A Fast Descriptor for Detection and Classification, Tech. Report [2] Yanwei Pang, Yuan Yuan, Xuelong Li, Gabor-Based Region Covariance Matrices for Face Recognition, IEEE T CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 18(7): ,2008 [3] Anoop Cherian, et al, Jensen-Bregman LogDet Divergence with Application to Efficient Similarity Search for Covariance Matrices, IEEE TPAMI, in press, [4] Pedro Cortez Cargill,et al, Object Tracking based on Covariance Descriptors and On-Line Naive Bayes Nearest Neighbor Classifier, th Pacific-Rim Symp. Image and Video Technology,pp [5] Ravishankar Sivalingam, et al, Positive Definite Dictionary Learning for Region Covariances, ICCV [6] Mehrtash T. Harandi, et al, Kernel Analysis over Riemannian Manifolds for Visual Recognition of Actions, Pedestrians and Textures, CVPR2012.

Copula & its applications

What is Copula? Definition Copulas are statistical tools that factorize multivariate distributions into the product of its marginals and a function that captures any possible form of dependence among them (marginals). This function is referred to as the copula, and it links the marginals together into the joint multivariate model.

What is Copula? Mathematical formulation: P(x i ) is the marginal cdf of the random variable x i. Interestingly, this density has uniform marginals, since P(z)~ U[0; 1] for any random variable z. When P(x 1 ); … ; P(x d ) are continuous, the copula c(.) is unique (2)

Especially, when factorizing multivariate densities into a product of marginal distributions and bivariate copula functions (called as vines). Each of these factors corresponds to one of the building blocks that are assumed either constant or varying across different learning domains.  applicable to DA, TL and MTL!

Characteristics Infinitely many multivariate models share the same underlying copula function!

main advantage allowed to model separately the marginal distributions and the dependencies linking them together to produce the multivariate model subject of study.

Estimate p(x) from given samples Step 1: Construct estimates of the marginal pdfs  cdfs Step 2: Combine them

Estimate marginal pdfs and cdfs Parametric (copula) manners Examples: Gaussian, Gumbel, Frank, Clayton or Student copulas, etc. Weaknesses: Real-world data often exhibit complex dependencies which cannot be correctly described!

Illustration of Weaknesses

Non-parametric manners Using unidimensional KDEs. Illustration of estimation for Bivariate Copulas Estimate marginal pdfs and cdfs

Non-parametric Bivariate Copulas Estimating: Now From pdf to cdf (pseudo-sample from its copula c): Where r.v. (u, v): (4)

Non-parametric Bivariate Copulas (u,v)’s joint density is the copula function c(u; v)! Using KDE with Gaussian kernels can approximate c(u; v)! but will lead to (u,v)’s support of [0,1]x[0,1] rather than R 2 ! Instead, performing the density estimation in a transformed space: Selecting some continuous distribution with support on R, strictly positive density, cumulative distribution and quantile function. Let their joint pdf: (6)

Non-parametric Bivariate Copulas The copula of this new density is identical to the copula of (4), since the performed transformations are marginal- wise and the support of (6) is now R 2 ; Specially using Gauss density, having See [A1] for more details of derivation!

Non-parametric Multivariate Copulas From Bivariate (pair copula) to multivariate (copula): Extension Trick: Introduction of R-vine

Domain Adaptation: Non-linear regression with continuous data regression Given the source pdf: And solving a target task with density:

DA of Non-linear regression Given the data available for both tasks, our objective is to build a good estimate for the conditional density To address this domain adaptation problem, we assume that p t is a modified version of p s, In particular, we assume that p t is obtained in two steps from p s.

DA of Non-linear regression Step1: p s is expressed using an R-vine representation as follows: Step2: Some of the factors included in that representation (marginal distributions or pairwise copulas) are modified to derive p t. All we need to address the adaptation across domains is to reconstruct the R-vine representation of p s using data from the source task, and then identify which of the factors have been modified to produce p t. These factors are corrected using data from the target task.

DA of Non-linear regression A Key : Changes in these factors across different domains can be detected using two sample tests (such as MMD), and transferred across domains in order to adapt the target task density model! See [A1] for more details! Maximum Mean Discrepancy (MMD) will return low p-values when two samples are unlikely to have been drawn from the same distribution!

Insights How to extend the copula with image patches? How to apply it to multiview learning with (semi-) pairing or/and (semi-)supervision? How to adapt the universum to such new problem? How to apply it to zero-data learning? Tailor it to 2D (even Tensor) copula …

Kronecker Product Decomposition for (Covariance) Matrix

Kronecker Product (KP) Covariance [1] C. V. Loan and N. Pitsianis, Approximation with kronecker products, in Linear Algebra for Large Scale and Real Time Applications. Kluwer Publications, 1993, pp. 293–314. [1] proves that any pqxpq matrix ∑ 0 can be written as an orthogonal expansion of KPs of the form (1), thus allowing any covariance matrix to be arbitrarily approximated by a bilinear decomposition of the form (1). (1)

Estimation of HD Covariance matrix Applications Channel modeling for MIMO wireless communications, Geo-statistics, Genomics, Multi-task learning, Face recognition, Recommendation systems, Collaborative filtering, …

Estimation of HD Covariance matrix Main difficulty of estimation via the maximum likelihood principle: The nonconvexity of optimization problem! Seeking alternatives! 1) The flip flop (FF) algorithm [WJS08]; 2) Penalized Least squares (PLS)[Lou12] 3) PERMUTED RANK-PLS (PRLS)[5] [WJS08] K. Werner, M. Jansson, and P. Stoica, On estimation of covariance matrices with Kronecker product structure, IEEE TSP, 56(2), [Lou12]K. Lounici, “High-dimensional covariance matrix estimation with missing observations,” arXiv: v5, May 2012

PLS Sample covariance matrix (SCM): with 0 means and covariance (1) (2) (3)

PRLS (4) (5) As a result, the closed-form solution of (4) is

A Theorem See [5] for more details!

Other estimation for KP structured covariance estimation

The basic Kronecker model is The ML objective:

Use The problem (58) turns to

Hybrid Robust Kronecker Model The ML objective: Solving for Σ>0 again via Lemma 4 yields

the problem (73) reduces to Solve (75) using the fixed point iteration Arbitrary can be used as initial iteration.

… Insights (1)

1) Metric Learning (ML) ML&CL, Relative Distance constraints, LMNN-like,… 2) Classification learning Predictive function: f(X)=tr(W T X)+b; The objective: Insights (2)

ML across heterogeneous domains 2 lines: 1) Line 1: 2) Line 2 (for ML&CL) Symmetry and PSD An indefinite measure ({U i } is base & { α i } is sparsified) Implying that 2 lines can be unified to a common indefinite ML!

Noise model Where c is the c-th class or cluster, e ci is noise and o ci is outlier and its ||o ci ||≠0 if outlier, 0 otherwise. Discuss: 1)U c =0, o ci =0; e ci ~N(0, dI)  Means; Lap(0,dI)  Medians; other priors  other statistics 2)U c ≠ 0, o ci =0; e ci ~ N(0, dI)  PCA; Lap(0,dI)  L 1 -PCA; other priors  other PCAs; Insights (4)

3) U c =0, o ci ≠0; e ci ~N(0, dI)  Robust (k-)Means; ~ Lap(0,dI)  (k-)Medians; 4) Subspace U c ≠0, o ci ≠0; e ci ~N(0, dI)  Robust k-subspaces; 5) m c =0 …… 6) Robust (Semi-)NMF …… 7) Robust CA …… where noise model:Γ=BA T Υ+E+O

Covariance Descriptor (CD)

Applications of CD Multi-camera object tracking; Human detection, Hmage segmentation, Texture segmentation, Robust face recognition, Emotion recognition, Human action recognition, Speech recognition … [3] Anoop Cherian, et al, Jensen-Bregman LogDet Divergence with Application to Efficient Similarity Search for Covariance Matrices, IEEE TPAMI, in press, 2012.

CD for Image and vision I: an intensity or color image. F: WxHxd feature image extracted from I by (1) where the function can be any mapping such as intensity, color, gradients, filter responses, etc. E.g.,

CD for Image and vision For a given rectangular region R in F, let {z k }, k=1..n be the d-dimensional feature points inside R, the CD of R is defined (2)

CD for Face Image Object representation: Construct five covariance matrices from overlapping regions of an object feature image. The covariances are used as the object descriptors!

CD for Textures Texture representation. There are u images for each texture class and we sample s regions from each image and compute covariance matrices C

Advantages A single covariance matrix extracted from a region is usually enough to match the region in different views and poses; a natural way of fusing multiple features which might be correlated; low-dimensional compared to other region descriptors and due to symmetry C R ; a certain scale and rotation invariance over the regions in different images due to regardless of the ordering and the number of points. Fast in calculation via integral image!

Matching Key: Distance Measures between SPD matrices! Known: All SPD matrices with the size form a Riemannian manifold! Thus the distance between 2 SPDs can be measured using geodesics! However, computing similarity between covariance matrices is non-trivial.

Metrics between 2 SPD Matrices X and Y Affine Invariant Riemannian Metric (AIRM) Log-Euclidean Riemannian Metric (LERM)

Metrics between 2 SPD Matrices X and Y Symmetrized KL-Divergence Metric (KLDM) Jensen-Bregman LogDet Divergence (JBLD)

Properties of JBLD

Important Theorems (1)

Important Theorems (2)

Computing time (1)

Computing time (2)

K-means with JBLD Objective

Isosurface plots for various distance measures (a) Frobenius distance, (b) AIRM, (c) KLDM, and (d) JBLD

Table 3, A comparison of various metrics on covariances and their computational complexities against JBLD See [3] for more details! [3] Anoop Cherian, et al, Jensen-Bregman LogDet Divergence with Application to Efficient Similarity Search for Covariance Matrices, IEEE TPAMI, in press, 2012.

Insights How to extend CD to text? Key: define CD on general graph with discrete operators on graph, including local: derivative, gradient, difference, etc.. global: centrality, etc.. Tailor CD to 2D classifier with various scenarios KP and PDF defined on CD Copula on CD! Extend it to multiview with heterogeneous sources! …

Thanks! Q&A