1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.

1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and Multimedia

2 What Is Multivariate Analysis? Statistical methodology to analyze data with measurements on many variables Process controllable factors uncontrollable factors input output

3 Why to Learn Multivariate Analysis? Explanation of a social or physical phenomenon must be tested by gathering and analyzing data Complexities of most phenomena require an investigator to collect observations on many different variables

4 Application Examples Is one product better than the other? Which factor is the most important to determine the performance of a system? How to classify the results into clusters? What are the relationships between variables?

5 Course Outline Introduction Matrix Algebra and Random Vectors Sample Geometry and Random Samples Multivariate Normal Distribution Inference about a Mean Vector Comparison of Several Multivariate Means Multivariate Linear Regression Models

6 Course Outline Principal Components Factor Analysis and Inference for Structured Covariance Matrices Canonical Correlation Analysis Discrimination and Classification Clustering, Distance Methods, and Ordination Multidimensional Scaling* Structural Equation Modeling*

7 Text Book and Website R. A. Johnson and D. W. Wichern, Applied Multivariate Statistical Analysis, 5th ed., Prentice Hall, 2002. ( 雙葉 ) http://cc.ee.ntu.edu.tw/~skjeng/ MultivariateAnalysis2006.htm MultivariateAnalysis2006.htm

8 References J. F. Hair, Jr., B. Black, B. Babin, R. E. Anderson, and R. L. Tatham, Multivariate Data Analysis, 6th ed., Prentice Hall, 2006. ( 華泰 ) D. C. Montgomery, Design and Analysis of Experiments, 6th ed., John Wiley, 2005. ( 歐亞 ) D. Salsberg 著, 葉偉文譯, 統計, 改變了世界, 天下遠見, 2001.

9 References 張碧波，推理統計學，三民， 1976. 張輝煌編譯，實驗設計與變異分析，建興， 1986.

10 Time Management Emergency Importance III IIIIV

11 Some Important Laws First things first 80 – 20 Law Fast prototyping and evolution

12 Major Uses of Multivariate Analysis Data reduction or structural simplification Sorting and grouping Investigation of the dependence among variables Prediction Hypothesis construction and testing

13 Array of Data

14 Descriptive Statistics Summary numbers to assess the information contained in data Basic descriptive statistics –Sample mean –Sample variance –Sample standard deviation –Sample covariance –Sample correlation coefficient

15 Sample Mean and Sample Variance

16 Sample Covariance and Sample Correlation Coefficient

17 Standardized Values (or Standardized Scores) Centered at zero Unit standard deviation Sample correlation coefficient can be regarded as a sample covariance of two standardized variables

18 Properties of Sample Correlation Coefficient Value is between -1 and 1 Magnitude measure the strength of the linear association Sign indicates the direction of the association Value remains unchanged if all x ji ’ s and x jk ’ s are changed to y ji = a x ji + b and y jk = c x jk + d, respectively, provided that the constants a and c have the same sign

19 Arrays of Basic Descriptive Statistics

20 Example Four receipts from a university bookstore Variable 1: dollar sales Variable 2: number of books

21 Arrays of Basic Descriptive Statistics

22 Using SAS Create New Project –Name  Project Insert Data –New: Name  Data1 Change column name –Right button, select Properties … Enter data in the data grid Select Data1 under Project Analysis  Descriptive –Summary statistics –Correlations

23 Summary Statistics Save –Personal  Enterprise Guide Sample  Data  Data1.sas7bdat Columns  Variables to assign  Analysis variables Statistics  Mean

24 Report on Means

25 Correlations Delete Summary statistics node Save –Personal  Enterprise Guide Sample  Data  Data1.sas7bdat Columns  Variables to assign  Correlation variables Correlations  Pearson  Covariances  Show Pearson correlations in results  Divisor for variances (Number of rows)

26 Correlations Results  Show results  (uncheck) show statistics for each variable  (uncheck) show significance probabilities associated with correlations

27 Report on Correlations

28 Scatter Plot and Marginal Dot Diagrams

29 Scatter Plot and Marginal Dot Diagrams for Rearranged Data

30 Effect of Unusual Observations

31 Effect of Unusual Observations

32 Paper Quality Measurements

33 Lizard Size Data *SVL: snoutvent length; HLS: hind limb span

34 3D Scatter Plots of Lizard Data

35 Female Bear Data and Growth Curves

36 Utility Data as Stars

37 Chernoff Faces over Time

38 Euclidean Distance Each coordinate contributes equally to the distance

39 Statistical Distance Weight coordinates subject to a great deal of variability less heavily than those that are not highly variable

40 Statistical Distance for Uncorrelated Data

41 Ellipse of Constant Statistical Distance for Uncorrelated Data x1x1 x2x2 0

42 Scattered Plot for Correlated Measurements

43 Statistical Distance under Rotated Coordinate System

44 General Statistical Distance

45 Necessity of Statistical Distance

46 Necessary Conditions for Statistical Distance Definitions

47 Reading Assignments Text book –pp. 50-60 –pp. 84-97

48 Outliers Observations with a unique combination of characteristics identifiable as distinctly different from the other observations Impact –Limiting the generalizability of any type of analysis –Must be viewed in light of how representative it is of the population to be retained or deleted

49 Sources of Outliers Procedure error Extraordinary event –e,g., hurricane for daily rainfall analysis Extraordinary observations –Researcher has no explanation Unique in their combinations of values across the variables –Falls within ordinary range of values of variables –Retain it unless proved invalid

50 Rules of Thumb for Univariate Outlier Detection Small samples (80 or fewer observations) –Cases with standard scores of 2.5 or greater Larger sample size –Threshold increases up to 4 Standard score not used –Cases falling outside the range of 2.5 versus 4 standard deviations, depending on the sample size

51 Rules of Thumb for Bivariate and Multivariate Outlier Detection Bivariate –Use scatterplots with confidence intervals at a specified alpha level Multivariate –Threshold levels for the D 2 /df measure should be conservative (0.005 or 0.001) resulting in values of 2.5 (small samples) versus 3 or 4 in larger samples –D 2 : Mahalanobis measure –df : degrees of freedom

52 Outlier Description and Profiling Generate profiles of each outlier observation Identify the variable(s) responsible for its being an outlier Discriminant analysis or multiple regression can be applied to identify the differences between outliers and other observations

53 Examples of Outliers

54 Example of Bivariate Outliers

55 Missing Data Valid values on one or more variables are not available for analysis Affects the generalizability of the results Remedy is applied –to maintain as close as possible the original distribution of values

56 Missing Data Process Systematic event that leads to missing values –Event external to the respondent (data entry errors or data collection problem) –Action on the part of the respondent (such as refusal to answer) Causes some patterns and relationships underlying the missing data

57 Impact of Missing Dara Practical impact –Reduction of the sample size available for analysis (adequate  inadequate) Substantive perspective –Statistical results based on data with a non-random data process could be biased –e.g., individuals did not provide their income tended to be almost exclusively in the high income bracket

58 Hypothetical Example of Missing Data

59 Practical Considerations Complete data required –Only 5 cases are usable (too few) A possible remedy: Eliminate V 3 –12 cases have complete data –Eliminate cases 3, 13, 15 –Total number of missing data is reduced to 7.4% for all values

60 Substantive Impact 5 still with missing data, all occur in V 4 These cases compared with those with valid V 4 data –The 5 cases with missing V 4 data have the five lowest scores on V 2 –Affects any analysis in which V 2 and V 4 are both included –e.g, mean for V 2 = 8.4 if cases with missing V 4 data are excluded, =7.8 if included

61 Dealing with Missing Data Determine the type of missing data Determine the extent of missing data Diagnose the randomness of the missing data Select the imputation method

62 Ignorable Missing Data Expected and part of the research design The missing data process is operating at random Specific remedies are not needed

63 Examples of Ignorable Missing Data Taking a sample of the population rather than gathering data from the entire population Missing data due to the design of the data collection instrument –e.g., respondents skip sections of questions that are not applicable

64 Missing Data Process Known to the Researchers Can be identified due to procedural factors –Data entry errors –Disclosure restrictions –Failure to complete the entire questionnaire –Morbidity of the respondents Little control over the process Some remedies may be applicable

65 Missing Data Process Unknown to the Researchers Most often are related directly to the respondent Examples –Refusal to respond to certain questions –Respondents have no opinion or insufficient knowledge to answer the question Should anticipated and minimized in the research design and data collection stages Some remedies may be applicable

66 Assessing the Extent and Patterns of Missing Data Tabulate –Percentage of variables with missing data for each case –Number of cases with missing data for each variable Look for non-random patterns in the data Determine the number of cases without missing data on any variables –sample size available for analysis if remedies are not applied)

67 Rules of Thumb to Ignore Missing Data Missing data under 10% for an individual case can generally be ignored, except when the missing data occurs in a specific non-random fashion The number of cases with no missing data must be sufficient for the selected analysis technique if replacement values will not be substituted (imputed) to the missing data

68 Rules of Thumb for Deletions Variables with as little as 15% missing data are candidates for deletion –Higher levels of missing data (20%~30%) can often be remedied Be sure the overall decrease in missing data is large enough to justify deleting an individual variable or case

69 Rules of Thumb for Deletions Cases with missing data for dependent variables typically are deleted When deleting a variable, ensure that alternative variables, hopefully highly correlated, are available to represent the intent of the original variable

70 Levels of Randomness Missing at Random (MAR) –e.g., missing data of gender are random for both male and female, but those of household income occur at a higher frequency for males than females Missing Completely at Random (MCAR) –e.g., missing data for household income were randomly missing in equal proportions for both male and female

71 Modeling approach for MAR Process Involves maximum likelihood estimation techniques –e.g., EM approach

72 Imputation Using Only Valid Data Complete Case Approach –Include only those observations with complete data Using All-Available Data –Use only valid data –Imputes the distribution characteristics (e.g., means or standard deviation) or relationship (e.g., correlation) from every valid value

73 Imputation Using Known Replacement Values Hot deck imputation –Use the value from another observation in the sample that is deemed similar Cold deck imputation –Derive the replacement value from an external source (e.g., prior studies, other samples, etc.) Case substitution –Choose another nonsampled observation

74 Imputation by Calculating Replacement Values Mean Substitution –Use the mean value of that variable calculated from all valid responses Regression imputation –Predict the missing values of a variable based on its relationship to other variables in the data set

75 Summary of Imputation Using Only Valid Data

76 Summary of Imputation Using Known Replacement Values

77 Summary of Imputation by Calculating Replacement Values

78 Summary of Model-Based Methods

79 Rule of Thumbs for Imputation of Missing Data Under 10% –Any imputation method 10% to 20% –All-available, hot deck case substitution, regression for MCAR –Model-based for MAR Over 20% –Regression for MCAR –Model-based for MAR

80 Summary Statistics of Missing Data for Original Sample

81 Comparison of Four Imputation Methods

82 Comparison of Four Imputation Methods

1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.

Similar presentations

Presentation on theme: "1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking.

Similar presentations

Presentation on theme: "1 Multivariate Statistical Analysis Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking."— Presentation transcript:

Similar presentations

About project

Feedback