Download presentation
Presentation is loading. Please wait.
Published byJade Lindsey Modified over 8 years ago
1
1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia
2
2 What Is Multivariate Analysis? Statistical methodology to analyze data with measurements on many variables Process controllable factors uncontrollable factors input output
3
3 Why to Learn Multivariate Analysis? Explanation of a social or physical phenomenon must be tested by gathering and analyzing data Complexities of most phenomena require an investigator to collect observations on many different variables
4
4 Application Examples Is one product better than the other? Which factor is the most important to determine the performance of a system? How to classify the results into clusters? What are the relationships between variables?
5
5 Course Outline Introduction Matrix Algebra and Random Vectors Sample Geometry and Random Samples Multivariate Normal Distribution Inference about a Mean Vector Comparison of Several Multivariate Means Multivariate Linear Regression Models
6
6 Course Outline Principal Components Factor Analysis and Inference for Structured Covariance Matrices Canonical Correlation Analysis Discrimination and Classification Clustering, Distance Methods, and Ordination Multidimensional Scaling* Structural Equation Modeling*
7
7 Text Book and Website R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 5th ed., Prentice Hall, 2002. ( 雙葉 ) http://cc.ee.ntu.edu.tw/~skjeng/ MultivariateAnalysis2006.htm MultivariateAnalysis2006.htm
8
8 References J. F. Hair, Jr., B. Black, B. Babin, R. E. Anderson, and R. L. Tatham, Multivariate Data Analysis, 6th ed., Prentice Hall, 2006. ( 華泰 ) D. C. Montgomery, Design and Analysis of Experiments, 6th ed., John Wiley, 2005. ( 歐亞 ) D. Salsberg 著, 葉偉文譯, 統計, 改變了世 界, 天下遠見, 2001.
9
9 References 張碧波,推理統計學,三民, 1976. 張輝煌編譯,實驗設計與變異分析,建興, 1986.
10
10 Time Management Emergency Importance III IIIIV
11
11 Some Important Laws First things first 80 – 20 Law Fast prototyping and evolution
12
12 Major Uses of Multivariate Analysis Data reduction or structural simplification Sorting and grouping Investigation of the dependence among variables Prediction Hypothesis construction and testing
13
13 Array of Data
14
14 Descriptive Statistics Summary numbers to assess the information contained in data Basic descriptive statistics –Sample mean –Sample variance –Sample standard deviation –Sample covariance –Sample correlation coefficient
15
15 Sample Mean and Sample Variance
16
16 Sample Covariance and Sample Correlation Coefficient
17
17 Standardized Values (or Standardized Scores) Centered at zero Unit standard deviation Sample correlation coefficient can be regarded as a sample covariance of two standardized variables
18
18 Properties of Sample Correlation Coefficient Value is between -1 and 1 Magnitude measure the strength of the linear association Sign indicates the direction of the association Value remains unchanged if all x ji ’ s and x jk ’ s are changed to y ji = a x ji + b and y jk = c x jk + d, respectively, provided that the constants a and c have the same sign
19
19 Arrays of Basic Descriptive Statistics
20
20 Example Four receipts from a university bookstore Variable 1: dollar sales Variable 2: number of books
21
21 Arrays of Basic Descriptive Statistics
22
22 Using SAS Create New Project –Name Project Insert Data –New: Name Data1 Change column name –Right button, select Properties … Enter data in the data grid Select Data1 under Project Analysis Descriptive –Summary statistics –Correlations
23
23 Summary Statistics Save –Personal Enterprise Guide Sample Data Data1.sas7bdat Columns Variables to assign Analysis variables Statistics Mean
24
24 Report on Means
25
25 Correlations Delete Summary statistics node Save –Personal Enterprise Guide Sample Data Data1.sas7bdat Columns Variables to assign Correlation variables Correlations Pearson Covariances Show Pearson correlations in results Divisor for variances (Number of rows)
26
26 Correlations Results Show results (uncheck) show statistics for each variable (uncheck) show significance probabilities associated with correlations
27
27 Report on Correlations
28
28 Scatter Plot and Marginal Dot Diagrams
29
29 Scatter Plot and Marginal Dot Diagrams for Rearranged Data
30
30 Effect of Unusual Observations
31
31 Effect of Unusual Observations
32
32 Paper Quality Measurements
33
33 Lizard Size Data *SVL: snoutvent length; HLS: hind limb span
34
34 3D Scatter Plots of Lizard Data
35
35 Female Bear Data and Growth Curves
36
36 Utility Data as Stars
37
37 Chernoff Faces over Time
38
38 Euclidean Distance Each coordinate contributes equally to the distance
39
39 Statistical Distance Weight coordinates subject to a great deal of variability less heavily than those that are not highly variable
40
40 Statistical Distance for Uncorrelated Data
41
41 Ellipse of Constant Statistical Distance for Uncorrelated Data x1x1 x2x2 0
42
42 Scattered Plot for Correlated Measurements
43
43 Statistical Distance under Rotated Coordinate System
44
44 General Statistical Distance
45
45 Necessity of Statistical Distance
46
46 Necessary Conditions for Statistical Distance Definitions
47
47 Reading Assignments Text book –pp. 50-60 –pp. 84-97
48
48 Outliers Observations with a unique combination of characteristics identifiable as distinctly different from the other observations Impact –Limiting the generalizability of any type of analysis –Must be viewed in light of how representative it is of the population to be retained or deleted
49
49 Sources of Outliers Procedure error Extraordinary event –e,g., hurricane for daily rainfall analysis Extraordinary observations –Researcher has no explanation Unique in their combinations of values across the variables –Falls within ordinary range of values of variables –Retain it unless proved invalid
50
50 Rules of Thumb for Univariate Outlier Detection Small samples (80 or fewer observations) –Cases with standard scores of 2.5 or greater Larger sample size –Threshold increases up to 4 Standard score not used –Cases falling outside the range of 2.5 versus 4 standard deviations, depending on the sample size
51
51 Rules of Thumb for Bivariate and Multivariate Outlier Detection Bivariate –Use scatterplots with confidence intervals at a specified alpha level Multivariate –Threshold levels for the D 2 /df measure should be conservative (0.005 or 0.001) resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples –D 2 : Mahalanobis measure –df : degrees of freedom
52
52 Outlier Description and Profiling Generate profiles of each outlier observation Identify the variable(s) responsible for its being an outlier Discriminant analysis or multiple regression can be applied to identify the differences between outliers and other observations
53
53 Examples of Outliers
54
54 Example of Bivariate Outliers
55
55 Missing Data Valid values on one or more variables are not available for analysis Affects the generalizability of the results Remedy is applied –to maintain as close as possible the original distribution of values
56
56 Missing Data Process Systematic event that leads to missing values –Event external to the respondent (data entry errors or data collection problem) –Action on the part of the respondent (such as refusal to answer) Causes some patterns and relationships underlying the missing data
57
57 Impact of Missing Dara Practical impact –Reduction of the sample size available for analysis (adequate inadequate) Substantive perspective –Statistical results based on data with a non-random data process could be biased –e.g., individuals did not provide their income tended to be almost exclusively in the high income bracket
58
58 Hypothetical Example of Missing Data
59
59 Practical Considerations Complete data required –Only 5 cases are usable (too few) A possible remedy: Eliminate V 3 –12 cases have complete data –Eliminate cases 3, 13, 15 –Total number of missing data is reduced to 7.4% for all values
60
60 Substantive Impact 5 still with missing data, all occur in V 4 These cases compared with those with valid V 4 data –The 5 cases with missing V 4 data have the five lowest scores on V 2 –Affects any analysis in which V 2 and V 4 are both included –e.g, mean for V 2 = 8.4 if cases with missing V 4 data are excluded, =7.8 if included
61
61 Dealing with Missing Data Determine the type of missing data Determine the extent of missing data Diagnose the randomness of the missing data Select the imputation method
62
62 Ignorable Missing Data Expected and part of the research design The missing data process is operating at random Specific remedies are not needed
63
63 Examples of Ignorable Missing Data Taking a sample of the population rather than gathering data from the entire population Missing data due to the design of the data collection instrument –e.g., respondents skip sections of questions that are not applicable
64
64 Missing Data Process Known to the Researchers Can be identified due to procedural factors –Data entry errors –Disclosure restrictions –Failure to complete the entire questionnaire –Morbidity of the respondents Little control over the process Some remedies may be applicable
65
65 Missing Data Process Unknown to the Researchers Most often are related directly to the respondent Examples –Refusal to respond to certain questions –Respondents have no opinion or insufficient knowledge to answer the question Should anticipated and minimized in the research design and data collection stages Some remedies may be applicable
66
66 Assessing the Extent and Patterns of Missing Data Tabulate –Percentage of variables with missing data for each case –Number of cases with missing data for each variable Look for non-random patterns in the data Determine the number of cases without missing data on any variables –sample size available for analysis if remedies are not applied)
67
67 Rules of Thumb to Ignore Missing Data Missing data under 10% for an individual case can generally be ignored, except when the missing data occurs in a specific non-random fashion The number of cases with no missing data must be sufficient for the selected analysis technique if replacement values will not be substituted (imputed) to the missing data
68
68 Rules of Thumb for Deletions Variables with as little as 15% missing data are candidates for deletion –Higher levels of missing data (20%~30%) can often be remedied Be sure the overall decrease in missing data is large enough to justify deleting an individual variable or case
69
69 Rules of Thumb for Deletions Cases with missing data for dependent variables typically are deleted When deleting a variable, ensure that alternative variables, hopefully highly correlated, are available to represent the intent of the original variable
70
70 Levels of Randomness Missing at Random (MAR) –e.g., missing data of gender are random for both male and female, but those of household income occur at a higher frequency for males than females Missing Completely at Random (MCAR) –e.g., missing data for household income were randomly missing in equal proportions for both male and female
71
71 Modeling approach for MAR Process Involves maximum likelihood estimation techniques –e.g., EM approach
72
72 Imputation Using Only Valid Data Complete Case Approach –Include only those observations with complete data Using All-Available Data –Use only valid data –Imputes the distribution characteristics (e.g., means or standard deviation) or relationship (e.g., correlation) from every valid value
73
73 Imputation Using Known Replacement Values Hot deck imputation –Use the value from another observation in the sample that is deemed similar Cold deck imputation –Derive the replacement value from an external source (e.g., prior studies, other samples, etc.) Case substitution –Choose another nonsampled observation
74
74 Imputation by Calculating Replacement Values Mean Substitution –Use the mean value of that variable calculated from all valid responses Regression imputation –Predict the missing values of a variable based on its relationship to other variables in the data set
75
75 Summary of Imputation Using Only Valid Data
76
76 Summary of Imputation Using Known Replacement Values
77
77 Summary of Imputation by Calculating Replacement Values
78
78 Summary of Model-Based Methods
79
79 Rule of Thumbs for Imputation of Missing Data Under 10% –Any imputation method 10% to 20% –All-available, hot deck case substitution, regression for MCAR –Model-based for MAR Over 20% –Regression for MCAR –Model-based for MAR
80
80 Summary Statistics of Missing Data for Original Sample
81
81 Comparison of Four Imputation Methods
82
82 Comparison of Four Imputation Methods
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.