Presentation is loading. Please wait.

Presentation is loading. Please wait.

Global PaedSurg Research Training Fellowship

Similar presentations


Presentation on theme: "Global PaedSurg Research Training Fellowship"— Presentation transcript:

1 Global PaedSurg Research Training Fellowship
6 Data Cleaning and Analysis Dr. Emily Smith 4/25/2019

2 At the end of the day, you want a clean and precise dataset.
The question is – How to get there? Especially in global health Overarching Goal

3 Next Steps Think about the data that you have collected or will collect as part of your research project What is your research question? What are you trying to get your data to “say”? Which statistical tests will best help you answer your research question? Contact the research coordinator to discuss how to analyze your data!

4 Data Analysis Process

5 Step 1: Creating an analysis plan
Formulate your plan according to your research objective and setting Collect the data Now that you have your data, how to you assess its accuracy and precision?

6 Create a data dictionary
Step 2: Managing Data Create a data dictionary

7 Excel is a great resource for a data dictionary
Make sure to add a column describing how you code the missing values

8 Few datasets are free of errors and missing values.
It is important to review the dataset to identify errors before beginning analysis. Document, document, document ANY changes you make. This is an iterative process. Before you begin, make a copy of the original dataset!! KEY POINT! Step 3: Cleaning Data

9

10

11 Few datasets are 100% complete or accurate.
Usually there are a few weird or missing values. Sometimes missing data occurs randomly and sometimes it occurs in patterns. Step 4: Detecting and Correcting Missing, Miscoded or Out-of-Range Values

12 Types of missing data MCAR (Missing completely at random): Missing are independent of variables and occur at random. MAR (Missing at random): Missing-ness is related to a particular variable, but is not related to the value of the variable that has missing data (Accidently omitting an answer on a questionnaire). MNAR (Missing not at random): Missing for a reason! How do you know?

13

14 Identifying Missing Values
Best way to identify missing values is assessing the frequency distributions for each variable.

15 HANDLING MISSING DATA Complete-case analysis (complete-participant analysis) Remove everyone with missing data on one or more variables. Often reduces precision Unbiased in a wide range of circumstances Imputation Default value imputation Mean imputation Regression imputation Multiple imputation Inverse probability weighting

16 Identifying Records with Out-of-Range Values
Some variables may contain values that are outliers (or out of range) compared to the responses of the other participatns. Often, these are numerical values that may have been incorrectly coded. To identify these, make a scatter-plot of the variables. Identifying Records with Out-of-Range Values

17 Making a scatterplot illustrates the value of one variable on the X axis and the value of the other on the Y axis.

18 Now that you have cleaned data (missing data, outliers, etc), you’re ready to analyze!
Remember you should at this point have an original dataset (pre-cleaned) and the dataset you have cleaned. The analysis should be done on this dataset! Step 5: Data Analysis

19 Types of Statistics/Analyses
Descriptive Statistics Describing an association/relationship How many? How much? Inferences about an association/relationship Proving or disproving theories Associations between phenomena If sample relates to the larger population Frequencies Basic measurements Inferential Statistics Hypothesis Testing Correlation Confidence Intervals Significance Testing Prediction

20 Descriptive Statistics
Descriptive statistics can be used to summarize and describe a single variable – univariate Frequencies (counts) & Percentages Categorical (nominal) data Means & Standard Deviations Continuous (interval/ratio) data

21 Frequencies & Percentages
How to display frequencies and percentages: Pie chart Table Bar chart

22 Distributions The distribution can be displayed using Box and Whiskers Plots and Histograms

23 Continuous  Categorical
You can aggregate data into categories from continuous data. Collect continuous if you can, rather than categories in your raw data!

24 INFERENTIAL STATISTICS
Inferential statistics can be used to test theories, determine associations between variables, and determine if findings are significant Types include: Correlation T-tests/ANOVA Chi-square Logistic Regression INFERENTIAL STATISTICS

25 Analysis of Categorical/Nominal Data
Chi-square Logistic Regression Analysis of Continuous Data Correlation T-tests T-tests Type of Data & Analysis

26 Correlation When to use it? What does it tell you?
When you want to know about the association or relationship between two continuous variables Example: blood pressure and medication What does it tell you? If a linear relationship exists between two variables through the Pearson’s r, and how strong that relationship is Ranges from -1 to +1

27 Correlation 0 – 0.25 = Little or no relationship
Interpreting strength of correlations: 0 – 0.25 = Little or no relationship 0.25 – 0.50 = Fair degree of relationship = Moderate degree of relationship 0.75 – 1.0 = Strong relationship 1.0 = perfect correlation

28 T-tests What does a t-test tell you?
If there is a statistically significant difference between the mean score (or value) of two groups What do the results look like? Student’s t Look at at corresponding p-value If p < .05, means are significantly different from each other If p > 0.05, means are not significantly different from each other

29 Chi-square When to use it? How to interpret?
When you want to know if there is an association between an exposure and outcome Ex) Mortality (yes/no) and lung cancer (yes/no) How to interpret? If the observed frequencies of occurrence in each group are significantly different from expected frequencies (i.e., a difference of proportions) Usually, the higher the chi-square statistic, the greater likelihood the finding is significant, but you must look at the corresponding p-value to determine significance

30 Logistic Regression When to use it? How do you interpret the results?
When you want to measure the strength and direction of the association between two variables (exposure and outcome) Where the dependent or outcome variable is categorical (e.g., yes/no) When you want to predict the likelihood of an outcome while controlling for confounders How do you interpret the results? Significance can be inferred using by looking at confidence intervals: If the confidence interval does not cross 1 then the result is significant If OR > 1  The outcome is that many times MORE likely to occur – Risk factor 2.0 = twice as likely If OR < 1  The outcome is that many times LESS likely to occur – Protective factor 0.50 = 50% less likely to experience the event

31 Summary of Statistical Tests
Statistic Test Type of Data Needed Test Statistic Example Correlation Two continuous variables Pearson’s r Are blood pressure and weight correlated? T-tests/ANOVA Means from a continuous variable taken from two or more groups Student’s t Do normal weight (group 1) patients have lower blood pressure than obese patients (group 2)? Chi-square Two categorical variables Chi-square X2 Are obese individuals (obese vs. not obese) significantly more likely to have a stroke (stroke vs. no stroke)? Logistic Regression A dichotomous variable as the outcome Odds Ratios (OR) & 95% Confidence Intervals (CI) Does obesity predict stroke (stroke vs. no stroke) when controlling for other variables?

32 Descriptive statistics can be used with nominal, ordinal, interval and ratio data
Frequencies and percentages describe categorical data and means and standard deviations describe continuous variables Inferential statistics can be used to determine associations between variables and predict the likelihood of outcomes or events Inferential statistics tell us if our findings are significant and if we can infer from our sample to the larger population Summary

33 References Essential Medical Statistics. Kirkwood & Sterne, 2nd Edition 1/Association.aspx?Tutorial=AP Background to Statistics for Non-Statisticians. Powerpoint Lecture. Dr. Craig Jackson , Prof. Occupational Health Psychology , Faculty of Education, Law & Social Sciences, BCU. ww.hcc.uce.ac.uk/craigjackson/Basic%20Statistics.ppt.

34 Thank you for listening, any questions?
Naomi Wright: @PaedsSurgeon @GlobalPaedSurg #GlobalPaedSurg


Download ppt "Global PaedSurg Research Training Fellowship"

Similar presentations


Ads by Google