Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North.

Similar presentations


Presentation on theme: "1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North."— Presentation transcript:

1 1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina February 21, 2016

2 2 UNC, Stat & OR Object Oriented Data Analysis, I What is the “atom” of a statistical analysis? 1 st Course: Numbers Multivariate Analysis Course : Vectors Functional Data Analysis: Curves More generally: Data Objects

3 3 UNC, Stat & OR Object Oriented Data Analysis, II Examples: Medical Image Analysis Images as Data Objects? Shape Representations as Objects Micro-arrays Just multivariate analysis?

4 4 UNC, Stat & OR Object Oriented Data Analysis, III Typical Goals: Understanding population variation Principal Component Analysis + Discrimination (a.k.a. Classification) Time Series of Data Objects

5 5 UNC, Stat & OR Object Oriented Data Analysis, IV Major Statistical Challenge, I: High Dimension Low Sample Size (HDLSS) Dimension d >> sample size n “Multivariate Analysis” nearly useless Can’t “normalize the data” Land of Opportunity for Statisticians Need for “creative statisticians”

6 6 UNC, Stat & OR Object Oriented Data Analysis, V Major Statistical Challenge, II: Data may live in non-Euclidean space Lie Group / Symmetric Spaces Trees/Graphs as data objects Interesting Issues: What is “the mean” (pop’n center)? How do we quantify “pop’n variation”?

7 7 UNC, Stat & OR Statistics in Image Analysis, I First Generation Problems: Denoising Segmentation Registration (all about single images)

8 8 UNC, Stat & OR Statistics in Image Analysis, II Second Generation Problems: Populations of Images Understanding Population Variation Discrimination (a.k.a. Classification) Complex Data Structures (& Spaces) HDLSS Statistics

9 9 UNC, Stat & OR HDLSS Statistics in Imaging Why HDLSS (High Dim, Low Sample Size)? Complex 3-d Objects Hard to Represent Often need d = 100’s of parameters Complex 3-d Objects Costly to Segment Often have n = 10’s cases

10 10 UNC, Stat & OR Object Representation Landmarks (hard to find) Boundary Rep’ns (no correspondence) Medial representations Find “skeleton” Discretize as “atoms” called M-reps

11 11 UNC, Stat & OR 3-d m-reps Bladder – Prostate – Rectum (multiple objects, J. Y. Jeong) Medial Atoms provide “skeleton” Implied Boundary from “spokes”  “surface”

12 12 UNC, Stat & OR Illuminating Viewpoint Object Space  Feature Space Focus here on collection of data objects Here conceptualize population structure via “point clouds”

13 13 UNC, Stat & OR PCA for m-reps, I Major issue: m-reps live in (locations, radius and angles) E.g. “average” of: = ??? Natural Data Structure is: Lie Groups ~ Symmetric spaces (smooth, curved manifolds)

14 14 UNC, Stat & OR PCA for m-reps, II PCA on non-Euclidean spaces? (i.e. on Lie Groups / Symmetric Spaces) T. Fletcher: Principal Geodesic Analysis Idea: replace “linear summary of data” With “geodesic summary of data”…

15 15 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

16 16 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

17 17 UNC, Stat & OR PGA for m-reps, Bladder-Prostate-Rectum Bladder – Prostate – Rectum, 1 person, 17 days PG 1 PG 2 PG 3 (analysis by Ja Yeon Jeong)

18 18 UNC, Stat & OR HDLSS Classification (i.e. Discrimination) Background: Two Class (Binary) version: Using “training data” from Class +1, and from Class -1 Develop a “rule” for assigning new data to a Class Canonical Example: Disease Diagnosis New Patients are “Healthy” or “Ill” Determined based on measurements

19 19 UNC, Stat & OR HDLSS Classification (Cont.) Ineffective Methods: Fisher Linear Discrimination Gaussian Likelihood Ratio Less Useful Methods: Nearest Neighbors Neural Nets (“black boxes”, no “directions” or intuition)

20 20 UNC, Stat & OR HDLSS Classification (Cont.) Currently Fashionable Methods: Support Vector Machines Trees Based Approaches New High Tech Method Distance Weighted Discrimination (DWD) Specially designed for HDLSS data Avoids “data piling” problem of SVM Solves more suitable optimization problem

21 21 UNC, Stat & OR HDLSS Classification (Cont.) Currently Fashionable Methods: Trees Based Approaches Support Vector Machines:

22 22 UNC, Stat & OR HDLSS Classification (Cont.) Comparison of Linear Methods (toy data): Optimal Direction Excellent, but need dir’n in dim = 50 Maximal Data Piling (J. Y. Ahn, D. Peña) Great separation, but generalizability??? Support Vector Machine More separation, gen’ity, but some data piling? Distance Weighted Discrimination Avoids data piling, good gen’ity, Gaussians?

23 23 UNC, Stat & OR Distance Weighted Discrimination Maximal Data Piling

24 24 UNC, Stat & OR Maximal Data Piling Mind boggling? J. Y. Ahn has characterized Formula ~ FLD There are many Publishable???

25 25 UNC, Stat & OR Distance Weighted Discrimination Based on Optimization Problem: More precisely work in appropriate penalty for violations Optimization Method (Michael Todd): Second Order Cone Programming Still Convex gen’tion of quadratic prog’ing Fast greedy solution Can use existing software

26 26 UNC, Stat & OR Simulation Comparison E.G. Above Gaussians: Wide array of dim’s SVM Subst’ly worse MD – Bayes Optimal DWD close to MD

27 27 UNC, Stat & OR Simulation Comparison E.G. Outlier Mixture: Disaster for MD SVM & DWD much more solid Dir’ns are “robust” SVM & DWD similar

28 28 UNC, Stat & OR Simulation Comparison E.G. Wobble Mixture: Disaster for MD SVM less good DWD slightly better Note: All methods come together for larger d ???

29 29 UNC, Stat & OR DWD in Face Recognition, I Face Images as Data (with M. Benito & D. Peña) Registered using landmarks Male – Female Difference? Discrimination Rule?

30 30 UNC, Stat & OR DWD in Face Recognition, II DWD Direction Good separation Images “make sense” Garbage at ends? (extrapolation effects?)

31 31 UNC, Stat & OR DWD in Face Recognition, III Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness

32 32 UNC, Stat & OR DWD in Face Recognition, IV Fun Comparison: Jump between means (in SVM direction) Also distinguishes Maleness vs. Femaleness But not as well as DWD

33 33 UNC, Stat & OR DWD in Face Recognition, V Analysis of difference: Project onto normals SVM has “small gap” (feels noise artifacts?) DWD “more informative” (feels real structure?)

34 34 UNC, Stat & OR DWD in Face Recognition, VI Current Work: Focus on “drivers”: (regions of interest) Relation to Discr’n? Which is “best”? Lessons for human perception?

35 35 UNC, Stat & OR DWD & Microarrays for Gene Expression Skip due to time pressure Some have seen this… DWD provides excellent tool for: Combining Data Sets (caBIG funded) Visualization of HDLSS data HDLSS hypothesis testing Let’s talk informally if you are interested

36 36 UNC, Stat & OR Discrimination for m-reps Classification for Lie Groups – Symm. Spaces S. K. Sen & S. Joshi What is “separating plane” (for SVM-DWD)?

37 37 UNC, Stat & OR Trees as Data Points, I Brain Blood Vessel Trees - E. Bullit & H. Wang Statistical Understanding of Population? Mean? PCA? Challenge: Very Non-Euclidean

38 38 UNC, Stat & OR Trees as Data Points, II Mean of Tree Population: Frechét Approach PCA on Trees (based on “tree lines”) Theory in Place - Implementation?

39 39 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, I For dim’al “Standard Normal” dist’n: Euclidean Distance to Origin (as ): - Data lie roughly on surface of sphere of radius - Yet origin is point of “highest density”??? - Paradox resolved by: “density w. r. t. Lebesgue Measure”

40 40 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, II For dim’al “Standard Normal” dist’n: indep. of Euclidean Dist. between and (as ): Distance tends to non-random constant: Can extend to Where do they all go??? (we can only perceive 3 dim’ns)

41 41 UNC, Stat & OR HDLSS Asymptotics: Simple Paradoxes, III For dim’al “Standard Normal” dist’n: indep. of High dim’al Angles (as ): - -“Everything is orthogonal”??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random

42 42 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, I Assume, let Study Subspace Generated by Data a. Hyperplane through 0, of dimension b. Points are “nearly equidistant to 0”, & dist c. Within plane, can “rotate towards Unit Simplex” d. All Gaussian data sets are“near Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex With P. Hall & A. Neemon

43 43 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, II Assume, let Study Hyperplane Generated by Data a. dimensional hyperplane b. Points are pairwise equidistant, dist c. Points lie at vertices of “regular hedron” d. Again “randomness in data” is only in rotation e. Surprisingly rigid structure in data?

44 44 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, III Simulation View: shows “rigidity after rotation”

45 45 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, III Straightforward Generalizations: non-Gaussian data: only need moments non-independent: use “mixing conditions” Mild Eigenvalue condition on Theoretical Cov. (with J. Ahn, K. Muller & Y. Chi) All based on simple “Laws of Large Numbers”

46 46 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, IV Explanation of Observed (Simulation) Behavior: “everything similar for very high d” 2 popn’s are 2 simplices (i.e. regular n-hedrons) All are same distance from the other class i.e. everything is a support vector i.e. all sensible directions show “data piling” so “sensible methods are all nearly the same” Including 1 - NN

47 47 UNC, Stat & OR HDLSS Asy’s: Geometrical Representation, V Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates “weighted version”, work in progress) 2. DWD more “stable” than SVM (based on “deeper limiting distributions”) (reflects intuitive idea “feeling sampling variation”) (something like “mean vs. median”) 3. 1-NN rule inefficiency is quantified.

48 48 UNC, Stat & OR The Future of Geometrical Representation? HDLSS version of “optimality” results? “Contiguity” approach? Params depend on d? Rates of Convergence? Improvements of DWD? (e.g. other functions of distance than inverse) It is still early days …

49 49 UNC, Stat & OR Some Carry Away Lessons Atoms of the Analysis: Object Oriented HDLSS contexts deserve further study DWD is attractive for HDLSS classification “Randomness” in HDLSS data is only in rotations (Modulo rotation, have context simplex shape) How to put HDLSS asymptotics to work?


Download ppt "1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North."

Similar presentations


Ads by Google