Participant Presentations Please Sign Up: Name Email (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct.,

Slides:



Advertisements
Similar presentations
Object Orie’d Data Analysis, Last Time •Clustering –Quantify with Cluster Index –Simple 1-d examples –Local mininizers –Impact of outliers •SigClust –When.
Advertisements

STOR 892 Object Oriented Data Analysis Radial Distance Weighted Discrimination Jie Xiong Advised by Prof. J.S. Marron Department of Statistics and Operations.
Independent Component Analysis Personal Viewpoint: Directions that maximize independence Motivating Context: Signal Processing “Blind Source Separation”
October 1999 Statistical Methods for Computer Science Marie desJardins CMSC 601 April 9, 2012 Material adapted.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
HDLSS Asy’s: Geometrical Represent’n Assume, let Study Subspace Generated by Data Hyperplane through 0, ofdimension Points are “nearly equidistant to 0”,
Dimension reduction (1)
Interesting Statistical Problem For HDLSS data: When clusters seem to appear E.g. found by clustering method How do we know they are really there? Question.
SigClust Gaussian null distribution - Simulation Now simulate from null distribution using: where (indep.) Again rotation invariance makes this work (and.
Object Orie’d Data Analysis, Last Time Finished NCI 60 Data Started detailed look at PCA Reviewed linear algebra Today: More linear algebra Multivariate.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Dimension reduction : PCA and Clustering Agnieszka S. Juncker Slides: Christopher Workman and Agnieszka S. Juncker Center for Biological Sequence Analysis.
Dimension reduction : PCA and Clustering Christopher Workman Center for Biological Sequence Analysis DTU.
Matlab Software To Do Analyses as in Marron’s Talks Matlab Available from UNC Site License Download Software: Google “Marron Software”
Meet the Kiwis…. Population of kiwis… Codes… Species Region GS-Great Spotted, NIBr-NorthIsland Brown, Tok-Southern Tokoeka NWN-North West Nelson, CW-Central.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Finished Algebra Review Multivariate Probability Review PCA as an Optimization Problem (Eigen-decomp. gives rotation,
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: More NCI 60 Data.
Object Orie’d Data Analysis, Last Time
Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.
Object Orie’d Data Analysis, Last Time Distance Weighted Discrimination: Revisit microarray data Face Data Outcomes Data Simulation Comparison.
StAR web server tutorial for ROC Analysis. ROC Analysis ROC Analysis: This module allows the user to input data for several classifiers to be tested.
Statistics – O. R. 891 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Support Vector Machines Graphical View, using Toy Example:
Object Orie’d Data Analysis, Last Time Discrimination for manifold data (Sen) –Simple Tangent plane SVM –Iterated TANgent plane SVM –Manifold SVM Interesting.
Object Orie’d Data Analysis, Last Time Finished Q-Q Plots –Assess variability with Q-Q Envelope Plot SigClust –When is a cluster “really there”? –Statistic:
Object Orie’d Data Analysis, Last Time Gene Cell Cycle Data Microarrays and HDLSS visualization DWD bias adjustment NCI 60 Data Today: Detailed (math ’
Statistical Methods for Identifying Differentially Expressed Genes in Replicated cDNA Microarray Experiments Presented by Nan Lin 13 October 2002.
1 UNC, Stat & OR DWD in Face Recognition, (cont.) Interesting summary: Jump between means (in DWD direction) Clear separation of Maleness vs. Femaleness.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, II J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
MRNA Expression Experiment Measurement Unit Array Probe Gene Sequence n n n Clinical Sample Anatomy Ontology n 1 Patient 1 n Disease n n ProjectPlatform.
SWISS Score Nice Graphical Introduction:. SWISS Score Toy Examples (2-d): Which are “More Clustered?”
Maximal Data Piling Visual similarity of & ? Can show (Ahn & Marron 2009), for d < n: I.e. directions are the same! How can this be? Note lengths are different.
Class 23, 2001 CBCl/AI MIT Bioinformatics Applications and Feature Selection for SVMs S. Mukherjee.
Common Property of Shape Data Objects: Natural Feature Space is Curved I.e. a Manifold (from Differential Geometry) Shapes As Data Objects.
Return to Big Picture Main statistical goals of OODA: Understanding population structure –Low dim ’ al Projections, PCA … Classification (i. e. Discrimination)
Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Thurs., Early, Oct., Nov.,
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 892 Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research University of North Carolina.
Statistics – O. R. 893 Object Oriented Data Analysis Steve Marron Dept. of Statistics and Operations Research University of North Carolina.
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Participant Presentations Please Prepare to Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early,
1 UNC, Stat & OR U. C. Davis, F. R. G. Workshop Object Oriented Data Analysis J. S. Marron Dept. of Statistics and Operations Research, University of North.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
Object Orie’d Data Analysis, Last Time PCA Redistribution of Energy - ANOVA PCA Data Representation PCA Simulation Alternate PCA Computation Primal – Dual.
GWAS Data Analysis. L1 PCA Challenge: L1 Projections Hard to Interpret (i.e. Little Data Insight) Solution: 1)Compute PC Directions Using L1 2)Compute.
Object Orie’d Data Analysis, Last Time Reviewed Clustering –2 means Cluster Index –SigClust When are clusters really there? Q-Q Plots –For assessing Goodness.
1 UNC, Stat & OR Hailuoto Workshop Object Oriented Data Analysis, III J. S. Marron Dept. of Statistics and Operations Research, University of North Carolina.
PCA Data Represent ’ n (Cont.). PCA Simulation Idea: given Mean Vector Eigenvectors Eigenvalues Simulate data from Corresponding Normal Distribution.
Object Orie’d Data Analysis, Last Time Organizational Matters
Kernel Embedding Polynomial Embedding, Toy Example 3: Donut FLD Good Performance (Slice of Paraboloid)
Recall Flexibility From Kernel Embedding Idea HDLSS Asymptotics & Kernel Methods.
Distance Weighted Discrim ’ n Based on Optimization Problem: For “Residuals”:
SigClust Statistical Significance of Clusters in HDLSS Data When is a cluster “really there”? Liu et al (2007), Huang et al (2014)
Object Orie’d Data Analysis, Last Time DiProPerm Test –Direction – Projection – Permutation –HDLSS hypothesis testing –NCI 60 Data –Particulate Matter.
Statistical Smoothing
Return to Big Picture Main statistical goals of OODA:
SiZer Background Finance "tick data":
Object Orie’d Data Analysis, Last Time
Object Orie’d Data Analysis, Last Time
CellExpress Examples A Comprehensive Microarray-Based Cancer Cell Line and Clinical Sample Gene Expression Analysis Online System :8080 NTU.
Radial DWD Main Idea: Linear Classifiers Good When Each Class Lives in a Distinct Region Hard When All Members Of One Class Are Outliers in a Random Direction.
Statistics – O. R. 881 Object Oriented Data Analysis
Maximal Data Piling MDP in Increasing Dimensions:
Participant Presentations
Participant Presentations
Sampling Distribution Models
Participant Presentations
Presentation transcript:

Participant Presentations Please Sign Up: Name (Onyen is fine, or …) Are You ENRolled? Tentative Title (???? Is OK) When: Next Week, Early, Oct., Nov., Late

Object Oriented Data Analysis Three Major Parts of OODA Applications: I. Object Definition “What are the Data Objects?” II.Exploratory Analysis “What Is Data Structure / Drivers?” III. Confirmatory Analysis / Validation Is it Really There (vs. Noise Artifact)?

Yeast Cell Cycle Data, FDA View Central question: Which genes are “ periodic ” over 2 cell cycles?

Frequency 2 Analysis Colors are

Batch and Source Adjustment For Stanford Breast Cancer Data (C. Perou) Analysis in Benito, et al (2004) Adjust for Source Effects –Different sources of mRNA Adjust for Batch Effects –Arrays fabricated at different times

Source Batch Adj: PC 1-3 & DWD direction

Source Batch Adj: DWD Source Adjustment

NCI 60: Raw Data, Platform Colored

NCI 60: Fully Adjusted Data, Platform Colored

Object Oriented Data Analysis Three Major Parts of OODA Applications: I. Object Definition “What are the Data Objects?” II.Exploratory Analysis “What Is Data Structure / Drivers?” III. Confirmatory Analysis / Validation Is it Really There (vs. Noise Artifact)?

Recall Drug Discovery Data

Raw Data – PCA Scatterplot Dominated By Few Large Compounds Not Good Blue - Red Separation

Recall Drug Discovery Data MargDistPlot.m – Sorted on Means Revealed Many Interesting Features Led To Data Modifcation

Recall Drug Discovery Data PCA on Binary Variables Interesting Structure? Clusters? Stronger Red vs. Blue

Recall Drug Discovery Data PCA on Binary Variables Deep Question: Is Red vs. Blue Separation Better?

Recall Drug Discovery Data PCA on Transformed Non-Binary Variables Interesting Structure? Clusters? Stronger Red vs. Blue

Recall Drug Discovery Data PCA on Transformed Non-Binary Variables Same Deep Question: Is Red vs. Blue Separation Better?

Recall Drug Discovery Data Question: When Is Red vs. Blue Separation Better? Visual Approach:  Train DWD to Separate  Project, and View How Separated  Useful View, Add Orthogonal PC Directions

Recall Drug Discovery Data Raw Data – DWD & Ortho PCs Scatterplot Some Blue - Red Separation But Dominated By Few Large Compounds

Recall Drug Discovery Data Binary Data – DWD & Ortho PCs Scatterplot Better Blue - Red Separation And Visualization

Recall Drug Discovery Data Transform’d Non-Binary Data – DWD & OPCA Better Blue - Red Separation ??? Very Useful Visualization

Caution DWD Separation Can Be Deceptive Since DWD is Really Good at Separation Important Concept: Statistical Inference is Essential

Caution Toy 2-Class Example See Structure? Careful, Only PC1-4

Caution Toy 2-Class Example DWD & Ortho PCA Finds Big Separation

Caution

Toy 2-Class Example Separation Is Natural Sampling Variation (Will Study in Detail Later)

Caution Main Lesson Again: DWD Separation Can Be Deceptive Since DWD is Really Good at Separation Important Concept: Statistical Inference is Essential III. Confirmatory Analysis

DiProPerm Hypothesis Test

Context: 2 – sample means H 0 : μ +1 = μ -1 vs. H 1 : μ +1 ≠ μ -1 (in High Dimensions) Approach taken here: Wei et al (2013) Focus on Visualization via Projection (Thus Test Related to Exploration)

DiProPerm Hypothesis Test Context: 2 – sample means H 0 : μ +1 = μ -1 vs. H 1 : μ +1 ≠ μ -1 Challenges:  Distributional Assumptions  Parameter Estimation  HDLSS space is slippery

DiProPerm Hypothesis Test Context: 2 – sample means H 0 : μ +1 = μ -1 vs. H 1 : μ +1 ≠ μ -1 Challenges:  Distributional Assumptions  Parameter Estimation Suggested Approach: Permutation test (A flavor of classical “non-parametrics”)

DiProPerm Hypothesis Test Suggested Approach: Find a DIrection (separating classes) PROject the data (reduces to 1 dim) PERMute (class labels, to assess significance, with recomputed direction)

DiProPerm Hypothesis Test

Toy 2-Class Example Separated DWD Projections Measure Separation of Classes Using: Mean Difference = 6.209

DiProPerm Hypothesis Test Toy 2-Class Example Separated DWD Projections Measure Separation of Classes Using: Mean Difference = Record as Vertical Line

DiProPerm Hypothesis Test Toy 2-Class Example Separated DWD Projections Measure Separation of Classes Using: Mean Difference = Statistically Significant???

DiProPerm Hypothesis Test Toy 2-Class Example Permuted Class Labels

DiProPerm Hypothesis Test Toy 2-Class Example Permuted Class Labels Recompute DWD & Projections

DiProPerm Hypothesis Test Toy 2-Class Example Measure Class Separation Using Mean Difference = 6.26

DiProPerm Hypothesis Test Toy 2-Class Example Measure Class Separation Using Mean Difference = 6.26 Record as Dot

DiProPerm Hypothesis Test Toy 2-Class Example Generate 2 nd Permutation

DiProPerm Hypothesis Test Toy 2-Class Example Measure Class Separation Using Mean Difference = 6.15

DiProPerm Hypothesis Test Toy 2-Class Example Record as Second Dot

DiProPerm Hypothesis Test. Repeat This 1,000 Times To Generate Null Distribution

DiProPerm Hypothesis Test Toy 2-Class Example Generate Null Distribution

DiProPerm Hypothesis Test Toy 2-Class Example Generate Null Distribution Compare With Original Value

DiProPerm Hypothesis Test Toy 2-Class Example Generate Null Distribution Compare With Original Value Take Proportion Larger as P-Value

DiProPerm Hypothesis Test Toy 2-Class Example Generate Null Distribution Compare With Original Value Not Significant

DiProPerm Hypothesis Test

>> 5.4 above

DiProPerm Hypothesis Test Real Data Example: Autism Caudate Shape (sub-cortical brain structure) Shape summarized by 3-d locations of 1032 corresponding points Autistic vs. Typically Developing (Thanks to Josh Cates)

DiProPerm Hypothesis Test Finds Significant Difference Despite Weak Visual Impression

DiProPerm Hypothesis Test Also Compare: Developmentally Delayed No Significant Difference But Stronger Visual Impression

DiProPerm Hypothesis Test Two Examples Which Is “More Distinct”? Visually Better Separation? Thanks to Katie Hoadley

DiProPerm Hypothesis Test Two Examples Which Is “More Distinct”? Stronger Statistical Significance! (Reason: Differing Sample Sizes)

DiProPerm Hypothesis Test

Choice of Direction:  Distance Weighted Discrimination (DWD)  Support Vector Machine (SVM)  Mean Difference  Maximal Data Piling Introduced Later

DiProPerm Hypothesis Test Choice of 1-d Summary Statistic:  2-sample t-stat  Mean difference  Median difference  Area Under ROC Curve Surprising Comparison Coming Later

Recall Matlab Software Posted Software for OODA

DiProPerm Hypothesis Test Matlab Software: DiProPermSM.m In BatchAdjust Directory

Recall Drug Discovery Data Raw Data – DWD & Ortho PCs Scatterplot Some Blue - Red Separation But Dominated By Few Large Compounds

Recall Drug Discovery Data Binary Data – DWD & Ortho PCs Scatterplot Better Blue - Red Separation And Visualization

Recall Drug Discovery Data Transform’d Non-Binary Data – DWD & OPCA Better Blue - Red Separation ??? Very Useful Visualization

Recall Drug Discovery Data DiProPerm test of Blue vs. Red Full Raw Data Z = 10.4 Reasonable Difference

Recall Drug Discovery Data DiProPerm test of Blue vs. Red Delete var = 0 & -999 Variables Z = 11.6 Slightly Stronger

Recall Drug Discovery Data DiProPerm test of Blue vs. Red Binary Variables Only Z = 14.6 More Than Raw Data

Recall Drug Discovery Data DiProPerm test of Blue vs. Red Non-Binary – Standardized Z = 17.3 Stronger

Recall Drug Discovery Data DiProPerm test of Blue vs. Red Non-Binary – Shifted Log Transform Z = 17.9 Slightly Stronger

HDLSS Asymptotics Modern Mathematical Statistics:  Based on asymptotic analysis

HDLSS Asymptotics

Personal Observations: HDLSS world is…  Surprising (many times!) [Think I’ve got it, and then …]  Mathematically Beautiful (?)  Practically Relevant HDLSS Asymptotics

HDLSS Asymptotics: Simple Paradoxes

Ever Wonder Why? o Perceptual System from Ancestors o They Needed to Find Food o Food Exists in 3-d World (We can only perceive 3 dimensions)

HDLSS Asymptotics: Simple Paradoxes

HDLSS Asy’s: Geometrical Represent’n Hall, Marron & Neeman (2005)

HDLSS Asy’s: Geometrical Represent’n Hall, Marron & Neeman (2005)

HDLSS Asy’s: Geometrical Represent’n Hall, Marron & Neeman (2005)

HDLSS Asy’s: Geometrical Represent’n Hall, Marron & Neeman (2005)

HDLSS Asy’s: Geometrical Represent’n Hall, Marron & Neeman (2005)

HDLSS Asy’s: Geometrical Represent’n Hall, Marron & Neeman (2005)

HDLSS Asy’s: Geometrical Represent’n

HDLSS Asy’s: Geometrical Represen’tion Simulation View: study “rigidity after rotation” Simple 3 point data sets In dimensions d = 2, 20, 200, Generate hyperplane of dimension 2 Rotate that to plane of screen Rotate within plane, to make “comparable” Repeat 10 times, use different colors

HDLSS Asy’s: Geometrical Represen’tion Simulation View: Shows “Rigidity after Rotation”

HDLSS Asy’s: Geometrical Represen’tion