Presentation is loading. Please wait.

Presentation is loading. Please wait.

Object Orie’d Data Analysis, Last Time Organizational Matters

Similar presentations


Presentation on theme: "Object Orie’d Data Analysis, Last Time Organizational Matters"— Presentation transcript:

1 Object Orie’d Data Analysis, Last Time Organizational Matters http://www.stat-or.unc.edu/webspace/courses/marron/UNCstor891OODA-2007/Stor891-07Home.html What is OODA? Visualization by Projection Object Space & Feature Space Curves as Data Data Representation Issues PCA visualization

2 Data Object Conceptualization Object Space  Feature Space Curves Images Manifolds Shapes Tree Space Trees

3 Functional Data Analysis, Toy EG I

4 Easy way to do these analyses Matlab software (user friendly?) available: http://www.stat.unc.edu/postscript/papers/marron/Matlab7Software/ Download & put in Matlab Path: General Smoothing Look first at: curvdatSM.m scatplotSM.m

5 Easy way to do these analyses Matlab software (user friendly?) available: http://www.stat.unc.edu/postscript/papers/marron/Matlab7Software/ ???????????????????????????? ??? Next time: Spend some time going through these As many students seem to want to use them

6 Time Series of Curves Again a “Set of Curves” But now Time Order is Important! An approach: Use color to code for time Start End

7 Time Series Toy E.g. Explore Question: “Is Horizontal Motion Linear Variation?” Example: Set of time shifted Gaussian densities View: Code time with colors as above

8 T. S. Toy E.g., Raw Data

9 T. S. Toy E.g., PCA View PCA gives “Modes of Variation” But there are Many… Intuitively Useful??? Like “harmonics”? Isn’t there only 1 mode of variation? Answer comes in 2-d scatterplots

10 T. S. Toy E.g., PCA Scatterplot

11 Where is the Point Cloud? Lies along a 1-d curve in So actually have 1-d mode of variation But a non-linear mode of variation Poorly captured by PCA (linear method) Will study more later

12 Chemo-metric Time Series Mass Spectrometry Measurements On an Aging Substance, called “Estane” Made over Logarithmic Time Grid, n = 60 Each is a Spectrum What about Time Evolution? Approach: PCA & Time Coloring

13 Chemo-metric Time Series Joint Work w/ E. Kober & J. Wendelberger Los Alamos National Lab Four Experimental Conditions: 1.Control 2.Aged 59 days in Dry Air 3.Aged 27 days in Humid Air 4.Aged 59 days in Humid Air

14 Chemo-metric Time Series, HA 27

15 Raw Data: All 60 spectra essentially the same “Scale” of mean is much bigger than variation about mean Hard to see structure of all 1600 freq’s Centered Data: Now can see different spectra Since mean subtracted off Note much smaller vertical axis

16 Chemo-metric Time Series, HA 27

17 Data zoomed to “important” freq’s: Raw Data: Now see slight differences Smoother “natural looking” spectra Centered Data: Differences in spectra more clear Maybe now have “real structure” Scale is important

18 Chemo-metric Time Series, HA 27

19 Use of Time Order Coloring: Raw Data: Can see a little ordering, not much Centered Data: Clear time ordering Shifting peaks? (compare to Raw) PC1: Almost everything? PC1 Residuals: Data nearly linear (same scale import’nt)

20 Chemo-metric Time Series, Control

21 PCA View Clear systematic structure Time ordering very important Reminiscent of Toy Example A clear 1-d curve in Feature Space Physical Explanation?

22 Toy Data Explanations Simple Chemical Reaction Model: Subst. 1 transforms into Subst. 2 Note: linear path in Feature Space

23 Toy Data Explanations Richer Chemical Reaction Model: Subst. 1  Subst. 2  Subst. 3 Curved path in Feat. Sp. 2 Reactions  Curve lies in 2-dim’al subsp.

24 Toy Data Explanations Another Chemical Reaction Model: Subst. 1  Subst. 2 & Subst. 5  Subst. 6 Curved path in Feat. Sp. 2 Reactions  Curve lies in 2-dim’al subsp.

25 Toy Data Explanations More Complex Chemical Reaction Model: 1  2  3  4 Curved path in Feat. Sp. (lives in 3-d) 3 Reactions  Curve lies in 3-dim’al subsp.

26 Toy Data Explanations Even More Complex Chemical Reaction Model: 1  2  3  4  5 Curved path in Feat. Sp. (lives in 4-d) 4 Reactions  Curve lies in 4-dim’al subsp.

27 Chemo-metric Time Series, Control

28 Suggestions from Toy Examples: Clearly 3 reactions under way Maybe a 4 th ??? Hard to distinguish from noise? Interesting statistical open problem!

29 Chemo-metric Time Series What about the other experiments? Recall: 1.Control 2.Aged 59 days in Dry Air 3.Aged 27 days in Humid Air 4.Aged 59 days in Humid Air Above results were “cherry picked”, to best makes points What about cases???

30 Scatterplot Matrix, Control Above E.g., maybe ~4d curve  ~4 reactions

31 Scatterplot Matrix, Da59 PC2 is “bleeding of CO2”, discussed below

32 Scatterplot Matrix, Ha27 Only “3-d + noise”?  Only 3 reactions

33 Scatterplot Matrix, Ha59 Harder to judge???

34 Object Space View, Control Terrible discretization effect, despite ~4d …

35 Object Space View, Da59 OK, except strange at beginning (CO2 …)

36 Object Space View, Ha27 Strong structure in PC1 Resid (d < 2)

37 Object Space View, Ha59 Lots at beginning, OK since “oldest”

38 Problem with Da59 What about strange behavior for DA59? Recall: PC2 showed “really different behavior at start” Chemists comments: Ignore this, should have started measuring later…

39 Problem with Da59 But still fun to look at broader spectra

40 Chemo-metric T. S. Joint View Throw them all together as big population Take Point Cloud View

41 Chemo-metric T. S. Joint View

42 Throw them all together as big population Take Point Cloud View Note 4d space of interest, driven by: 4 clusters (3d) PC1 of chemical reaction (1-d) But these don’t appear as the 4 PCs Chem. PC1 “spread over PC2,3,4” Essentially a “rotation of interesting dir’ns” How to “unrotate”???

43 Chemo-metric T. S. Joint View Interesting Variation: Remove cluster means Allows clear comparison of within curve variation

44 Chemo-metric T. S. Joint View (- mean)

45 Chemo-metric T. S. Joint View Interesting Variation: Remove cluster means Allows clear comparison of within curve variation: PC1 versus others are quite revealing (note different “rotations”) Others don’t show so much

46 Demography Data Joint Work with: Andres Alonso Univ. Carlos III, Madrid Mortality, as a function of age “ Chance of dying ”, for Males, in Spain of each 1-year age group Curves are years 1908 - 2002 PCA of the family of curves

47 Demography Data PCA of the family of curves for Males Babies & elderly “ most mortal ” (Raw) All getting better over time (Raw & PC1) Except 1918 - Influenza Pandemic (see Color Scale)Color Scale Middle age most mortal (PC2): –1918 –Early 1930s - Spanish Civil War –1980 – 1994 (then better) auto wrecks Decade Rounding (several places)

48 Demography Data PCA for Females in Spain Most aspects similar (see Color Scale)Color Scale No War Changes –Steady improvement until 70s (PC2) –When auto accidents kicked in

49 Demography Data PCA for Males in Switzerland Most aspects similar No decade rounding (better records) 1918 Flu – Different Color (PC2) (see Color Scale)Color Scale No War Changes –Steady improvement until 70s (PC2) –When auto accidents kicked in

50 Demography Data Dual PCA Idea: Rows and Columns trade places Terminology: from optimization Insights come from studying “primal” & “dual” problems

51 Primal / Dual PCA Consider “Data Matrix”

52 Primal / Dual PCA Consider “Data Matrix” Primal Analysis: Columns are data vectors

53 Primal / Dual PCA Consider “Data Matrix” Dual Analysis: Rows are data vectors

54 Demography Data Dual PCA Idea: Rows and Columns trade places Demographic Primal View: Curves are Years, Coord ’ s are Ages Demographic Dual View: Curves are Ages, Coord ’ s are Years Dual PCA View, Spanish Males

55 Demography Data Dual PCA View, Spanish Males Old people have const. mortality (raw) But improvement for rest (raw) Bad for 1918 (flu) & Spanish Civil War, but generally improving (mean) Improves for ages 1-6, then worse (PC1) Big Improvement for young (PC2) (Age Color Key)Age Color Key

56 Primal / Dual PCA Reference: Gabriel, K. R. (1971) The biplot display of matrices with application to principal component analysis, Biometrika, 58, 467. Will study more later “Centering” is a critical issue

57 Yeast Cell Cycle Data “ Gene Expression ” – Micro-array data Data (after major preprocessing): Expression “ level ” of: thousands of genes (d ~ 1,000s) but only dozens of “ cases ” (n ~ 10s) Interesting statistical issue: High Dimension Low Sample Size data (HDLSS)

58 Yeast Cell Cycle Data Data from: Spellman, P. T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D. and Futcher, B. (1998), “ Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization ”, Molecular Biology of the Cell, 9, 3273-3297.

59 Yeast Cell Cycle Data Analysis here is from: Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808

60 Yeast Cell Cycle Data Lab experiment: Chemically “ synchronize cell cycles ”, of yeast cells Do cDNA micro-arrays over time Used 18 time points, over “ about 2 cell cycles ” Studied 4,489 genes (whole genome) Time series view of data: 4,489 time series of length 18 Functional Data View: 4,489 “ curves ”

61 Yeast Cell Cycle Data, FDA View Central question: Which genes are “ periodic ” over 2 cell cycles?


Download ppt "Object Orie’d Data Analysis, Last Time Organizational Matters"

Similar presentations


Ads by Google