Presentation on theme: "Associate Collaborator for LISA Department of Statistics, VT"— Presentation transcript:
1 Associate Collaborator for LISA Department of Statistics, VT Analyzing SurveysMarcos CarzolioAssociate Collaborator for LISAPhD StudentDepartment of Statistics, VTLaboratory for Interdisciplinary Statistical Analysis
2 Outline Data Cleaning and Preprocessing Outlier DetectionMissing Value ImputationVisualizing and Understanding DataBoxplots, Histograms, and ScatterplotsCorrelation MatricesAnalyzing DataContingency TablesAnalysis of Variance (ANOVA)Regression
3 Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers benefitfrom the use of StatisticsExperimental Design • Data Analysis • Interpreting Results Grant Proposals • Software (R, SAS, JMP, SPSS...)Our goal is to improve the quality of research and the use of statistics at Virginia Tech.
4 How can LISA help? Formulate research question. Screen data for integrity and unusual observations.Implement graphical techniques to showcase the data – what is the story?Develop and implement an analysis plan to address research question.Help interpret results.Communicate! Help with writing the report or giving the talk.Identify future research directions.
5 Collaboration Walk-In Consulting Short Courses Laboratory for Interdisciplinary Statistical AnalysisLISA helps VT researchers benefit from the use of StatisticsDesigning Experiments • Analyzing Data • Interpreting Results Grant Proposals • Using Software (R, SAS, JMP, Minitab...)CollaborationFrom our website request a meeting for personalized statistical adviceGreat advice right now: Meet with LISA before collecting your dataWalk-In ConsultingMonday—Friday 1-3 pm in 401 HutchesonAlso, Tuesdays 1-3 pm in ICTAS Café X& Thursdays 1-3 pm in GLC Video Conf. Roomfor questions requiring <30 minsShort CoursesDesigned to help graduate students apply statistics in their researchAll services are FREE for VT researchers.
6 Some Useful Resources R Statistical Computing Software Can be downloaded for free from:R Studio, a free Integrated Development Environment:For a more interactive and user-friendly experience, try JMPDownloadable from the Virginia Tech software library: /jmp/index.htmlAmelia II: A Program for Missing DataVisit:
7 Types of Survey Data Data Type Description Examples Statistics Nominal Data with no intrinsic relative meaning behind labelsStrawberry, Banana, HispanicModeOrdinalData with an ordered structureSmall, Extra Large, Likert Scale*Median and PercentilesInterval (continuous or discrete)Data with meaningful difference relationsDegrees in Celsius, Birthdates, GPS CoordinatesMean, Standard Deviation, CorrelationRatio (continuous or discrete)Data with scale relationsWeight, Income, Length
8 Outlier Detection and Handling Outliers are data points that deviate far from the main body of data so as to arouse suspicion about their originsVisualize your dataBoxplots, histograms, and scatterplotsOnly remove outliers that are verifiable errorsExtremeness in observations is not in itself cause for data removalR Package ‘outliers’Outlier
9 Missing Value Imputation Imputation is the process of filling in the missing values of a datasetBefore considering imputation, try going after respondents for their true answersCan be very tricky (Come to LISA for help)If only one or two missing values are present in a vast dataset, use the mean of available values as a “best guess”Honaker, James et al., AMELIA II: A Program for Missing Data
10 Visualizing Your Data Boxplots SAS/GRAPH(R) 9.2: Statistical Graphics Procedures Guide, Second Edition
14 Contingency Tables Tabulates the number of responses in each category Helps to visualize the distribution of dataUse χ2 approximate test for independencePearson's Chi-squared testdata: tabX-squared = , df = 2, p-value =Warning message:In chisq.test(tab) : Chi-squared approximation may be incorrect
15 Analysis of VarianceTechnique used to test the differences between groupsAlways plot your data before doing analysesCall:aov(formula = resp_height ~ gender)Terms:gender ResidualsSum of SquaresDeg. of Freedom
16 Regression Actually a generalization of ANOVA Again, always plot your dataCall:lm(formula = exercise ~ dad_height)Residuals:Min Q Median Q MaxCoefficients:Estimate Std. Error t value Pr(>|t|)(Intercept)dad_heightResidual standard error: on 37 degrees of freedom(8 observations deleted due to missingness)Multiple R-squared: , Adjusted R-squared:F-statistic: on 1 and 37 DF, p-value:
17 Other Useful Resources A PowerPoint on more automated outlier detection techniques:2010/kdd10-outlier-tutorial.pdfR Package ‘outliers’:project.org/web/packages/outliers/outliers.pdfOn multiple imputation: