Presentation is loading. Please wait.

Presentation is loading. Please wait.

RESEARCH STATISTICS Jobayer Hossain, PhD Larry Holmes, Jr, PhD October 16, 2008.

Similar presentations


Presentation on theme: "RESEARCH STATISTICS Jobayer Hossain, PhD Larry Holmes, Jr, PhD October 16, 2008."— Presentation transcript:

1 RESEARCH STATISTICS Jobayer Hossain, PhD Larry Holmes, Jr, PhD October 16, 2008

2 Data Summarization In research, the first step of data analysis is to describe the distribution of the variables included in the study. The advantages of data descriptions are- – To get quick over all idea of the study – To get quick idea of the difference among comparing groups – To check the balance of the distribution of the demographic and other prognostic variables that influences the outcome unduly.

3 Is the balance of the distribution of prognostic factors in comparing groups important? Example: Primary biliary cirrhosis trial (a chronic and and fatal liver disease) – A randomized double-blind trial – Study treatment groups: Azathioprine vs placebo – Objective: To compare the survival time of two treatment groups – Primary end point: Time to death from randomization Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai

4 Is the balance of the distribution of prognostic factors in comparing groups important? Example … contd. Bilirubin is a strong predictor of survival time. Table 1: Summary stat for Bilirubin level (mol/L) at baseline Is the baseline imbalanced? You may expect a higher mortality rate in the Azathioprine group- why? How does it affect the primary end point of survival time? Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai

5 Is the balance of the distribution of prognostic factors in comparing groups important? Table3: Adjusted and Unadjusted Hazard ratios of death from the Cox proportional Hazards model There was no significant difference between two treatment groups (p-value=0.455) before adjustment for the covariate Bilirubin But after adjustment for Bilirubin, a significant difference was found (p-value < 0.001) between treatment groups Example from: Clinical Trials- A practical guide to Design, Analysis, and Reporting by Duolo Wang and Ameet Bakhai

6 Looking at Data How are the data distributed? – Where is the center? – What is the range? – What is the shape of the distribution (symmetric, skewed) Are there outliers? Are there data points that don’t make sense?

7 Distribution of a variable Distribution - (of a variable) tells us what values the variable takes and how often it takes these values. E.g. distribution of some 26 pediatric patients of ages 1 to 6 at AIDHC are as follows- Age123456 Frequency537542

8 Statistical Description/Summarization of Data Statistics describes the distribution of a numeric set of data by its Center (mean, median, mode etc) Variability (standard deviation, range etc) Shape (skewness, kurtosis etc) Statistics describes distribution of a categorical set of data by Frequency, percentage or proportion of each category

9 Statistical Description/summarization of Data Examples of numerical and categorical variables- – Numerical variable: Age, blood pressure (systolic and diastolic), time, weight, height, bmi (body mass index) – Categorical variable: Treatment group, disease status, race, gender, blood type (O, A, B, AB), age groups (such as 1-5 years, 6-9 years etc)

10 Statistical Presentation of Data Two types of statistical presentation of data - graphical and numerical. Graphical Presentation: We look for the overall pattern and for striking deviations from that pattern. Over all pattern usually described by shape, center, and spread of the data. An individual value that falls outside the overall pattern is called an outlier. Bar diagram and Pie charts are used for categorical variables. Histogram, stem and leaf and Box-plot are used for numerical variable.

11 Statistical Presentation of Data Statistics presents the data either graphically or numerically In graphical presentation, we look for the overall pattern (distribution) and for striking deviations from that pattern An individual value that falls outside the overall pattern is called an outlier. – Over all pattern of numerical data usually described by shape, center, and spread of data. Commonly used graphs are histogram, stem and leaf plot, and boxplot – Overall pattern of a categorical data usually described by frequency and percentages. Commonly used graphs are bar plot and pie chart

12 Data Presentation –Categorical Variable Bar Diagram: Lists the categories and presents the percent or count of individuals who fall in each category. Treatment Group FrequencyProportionPercent (%) 115(15/60)=0.2525.0 225(25/60)=0.33341.7 320(20/60)=0.41733.3 Total601.00100

13 Data Presentation –Categorical Variable Pie Chart: Lists the categories and presents the percent or count of individuals who fall in each category. Treatment Group FrequencyProportionPercent (%) 115(15/60)=0.2525.0 225(25/60)=0.33341.7 320(20/60)=0.41733.3 Total601.00100

14 Data Presentation –Categorical Variable (Frequency Distribution) Age123456 Frequency537542 Frequency Distribution of Age Grouped Frequency Distribution of Age: Age Group1-23-45-6 Frequency8126 Consider a data set of 26 children of ages 1-6 years. Then the frequency distribution of variable ‘age’ can be tabulated as follows:

15 Data Presentation –Categorical Variable (Frequency Distribution) Age Group1-23-45-6 Frequency8126 Cumulative Frequency82026 Age123456 Frequency537542 Cumulative Frequency5815202426 Cumulative frequency of data in previous page

16 Data Presentation –Numerical Variable Histogram: Overall pattern can be described by its shape, center, and spread. The following age distribution is right skewed. The center lies between 80 to 100. No outliers. Mean90.41666667 Standard Error3.902649518 Median84 Mode84 Standard Deviation30.22979318 Sample Variance913.8403955 Kurtosis-1.183899591 Skewness0.389872725 Range95 Minimum48 Maximum143 Sum5425 Count60

17 Graphical presentation- Numerical Variable Boxplot : – A boxplot is a graph of the five number summary. The central box spans the quartiles. – A line within the box marks the median. – Lines extending above and below the box mark the smallest and the largest observations (i.e. the range). – Outlying samples may be additionally plotted outside the range.

18 Graphical Presentation –Numerical Variable Box-Plot: Box contains middle 50% of the data. The upper and lower whisker contains top 25% and bottom 25% of the ordered data. Figure 3: Distribution of Age Box Plot Mean90.41666667 Standard Error3.902649518 Median84 Mode84 Standard Deviation30.22979318 Sample Variance913.8403955 Kurtosis-1.183899591 Skewness0.389872725 Range95 Minimum48 Maximum143 Sum5425 Count60 75 th percentile Median 25 th percentile Minimum Maximum The shape of the distribution is right skewed as the upper part of the box and the whisker are longer the corresponding lower parts

19 Side by Side Boxplot Trt 3Trt 2Trt 1

20 75th percentile 25th percentile maximum interquartile range minimum median 0.0 33.3 66.7 100.0 Box Plot: Age of patients Years

21 Numerical Presentation To understand how well a central value characterizes a set of observations, let us consider the following two sets of data: A: 30, 50, 70 B: 40, 50, 60 The mean of both two data sets is 50. But, the distance of the observations from the mean in data set A is larger than in the data set B. Thus, the mean of data set B is a better representation of the data set than is the case for set A. A fundamental concept in summary statistics is that of a central value for a set of observations and the extent to which the central value characterizes the whole set of data. Measures of central value such as the mean or median must be coupled with measures of data dispersion (e.g., average distance from the mean) to indicate how well the central value characterizes the data as a whole.

22 Methods of Center Measurement Commonly used methods are mean, median, mode, geometric mean etc. Mean: Summing up all the observation and dividing by number of observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30. Center measurement is a summary measure of the overall level of a dataset

23 Methods of Center Measurement Median: The middle value in an ordered sequence of observations. That is, to find the median we need to order the data set and then find the middle value. In case of an even number of observations the average of the two middle most values is the median. For example, to find the median of {9, 3, 6, 7, 5}, we first sort the data giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the number of observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the median is the average of the two middle values from the sorted sequence, in this case, (5 + 6) / 2 = 5.5.

24 Mean or Median The mean is affected by outlier (s) but median is not. 0 1 2 3 4 5 6 7 8 9 10 Mean = 3 0 1 2 3 4 5 6 7 8 9 10 Mean = 4 Slide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall Median = 3

25 Mean or Median The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions, e.g. family income. For example mean of 20, 30, 40, and 990 is (20+30+40+990)/4 =270. The median of these four observations is (30+40)/2 =35. Here 3 observations out of 4 lie between 20-40. So, the mean 270 really fails to give a realistic picture of the major part of the data. It is influenced by extreme value 990.

26 Methods of Center Measurement Mode: The value that is observed most frequently. The mode is undefined for sequences in which no observation is repeated. A variable with single mode is unimodal, with two modes is bimodal

27 A bimodal histogram A modal class Slide from Zhengyuan Zhu, UNC, http://www.unc.edu/~zhuz

28 Methods of Variability Measurement Commonly used methods: range, variance, standard deviation, interquartile range, coefficient of variation etc. Range: The difference between the largest and the smallest observations. The range of 10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability. Variability (or dispersion) measures the amount of scatter in a dataset.

29 Methods of Variability Measurement Variance: The variance of a set of observations is the average of the squares of the deviations of the observations from their mean. In symbols, the variance of the n observations x 1, x 2,…x n is Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is Standard Deviation (SD) : Square root of the variance. The SD of the above example is 2. If the distribution is bell shaped (symmetric), then the range is approximately (SD x 6)

30 Standard deviation of different distributions with the same center Mean = 15.5 S = 3.338 11 12 13 14 15 16 17 18 19 20 21 Data B Data A Mean = 15.5 S = 0.926 11 12 13 14 15 16 17 18 19 20 21 Mean = 15.5 S = 4.570 Data C SSlide from: Statistics for Managers Using Microsoft® Excel 4th Edition, 2004 Prentice-Hall

31 0.0 62.5 125.0 187.5 250.0 0.00.51.01.52.0 Std Dev of Shock Index SI Count Std. dev is a measure of the “average” scatter around the mean. Estimation method: if the distribution is bell shaped, the range is around 6 SD, so here rough guess for SD is 1.4/6 =.23 Slide from: Kristin L. Sainani, Stanford University, http://www.stanford.edu/~kcobbhttp://www

32 Methods of Variability Measurement Quartiles: Quartiles are values that divides the sorted dataset in to four equal parts so that each part contains 25% of the sorted data The first quartile (Q1) is the value from which 25% observations are smaller and 75% observations are larger. This is the median of the 1 st half of the ordered dataset. The second quartile (Q2) is the median of the data. In notations, quartiles of a data is the ((n+1)/4)q th observation of the data, where q is the desired quartile and n is the number of observations of data. The third quartile (Q1) is the value from which 75% observations are smaller and 25% observations are larger. This is the median of the 2nd half of the ordered dataset.

33 Methods of Variability Measurement An example with 15 numbers 3 6 7 11 13 22 30 40 44 50 52 61 68 80 94 Q1 Q2 Q3 The first quartile is Q1=11. The second quartile is Q2=40 (This is also the Median.) The third quartile is Q3=61. Inter-quartile Range: Difference between Q3 and Q1. Inter-quartile range of the previous example is 61- 40=21. The middle half of the ordered data lie between 40 and 61. In the following example Q1= ((15+1)/4)1 =4 th observation of the data. The 4 th observation is 11. So Q1 is of this data is 11.

34 Methods of Variability Measurement 25% Symmetric Left Skewed Right Skewed 25% Q1 Q2 Q3

35 Deciles and Percentiles Deciles: If data are ordered and divided into 10 parts, then cut points are called Deciles Percentiles: If data are ordered and divided into 100 parts, then cut points are called Percentiles. 25th percentile is the Q1, 50th percentile is the Median (Q2) and the 75th percentile of the data is Q3. Suppose PC= ((n+1)/100)p, where n=number of observations and p is the desired percentile. If PC is an integer than pth percentile of a data set is the (PC)th observation of the ordered set of that data. Otherwise let PI be the integer part of PC and f be the fractional part of PC. Then pth percentile= OI + (OII -OI)x`f where OI is the (PI)th observation of the ordered set of data and OII is the (PI +1)th observation of the ordered set of data. For example, Consider the following ordered set of data: 3, 5, 7, 8, 9, 11, 13, 15. PC= (9/100)p For 25 th percentile, PC=2.25 (not an integer), then 25 th percentile = 5 + (7-5)x.25= 5.5

36 Coefficient of Variation Coefficient of Variation: The standard deviation of data divided by it’s mean. It is usually expressed in percent. Coefficient of Variation=

37 Five Number Summary Five Number Summary: The five number summary of a distribution consists of the smallest (Minimum) observation, the first quartile (Q1), the median(Q2), the third quartile, and the largest (Maximum) observation written in order from smallest to largest.

38 Choosing a Summary The five number summary is usually better than the mean and standard deviation for describing a skewed distribution or a distribution with extreme outliers. The mean and standard deviation are reasonable for symmetric distributions that are free of outliers. In real life we can’t always expect symmetry of the data. It’s a common practice to include number of observations (n), mean, median, standard deviation, and range as common for data summarization purpose. We can include other summary statistics like Q1, Q3, Coefficient of variation if it is considered to be important for describing data.

39 Shape of Data Shape of data is measured by – Skewness – Kurtosis

40 Skewness Measures of asymmetry of data – Positive or right skewed: Longer right tail – Negative or left skewed: Longer left tail – Symmetric: Bell shaped

41 Right skewed Left skewed Slide from Zhengyuan Zhu, UNC, http://www.unc.edu/~zhuz

42 Bell-shaped Histograms Slide from Zhengyuan Zhu, UNC, http://www.unc.edu/~zhuz

43 Kurtosis Formula

44 Kurtosis Kurtosis relates to the relative flatness or peakedness of a distribution. A standard normal distribution (blue line: µ = 0;  = 1) has kurtosis = 0. A distribution like that illustrated with the red curve has kurtosis > 0 with a lower peak relative to its tails.

45 Normal Distribution The Normal Distribution is a density curve based on the following formula. It’s completely defined by two parameters: mean; and standard deviation. A density function describes the overall pattern of a distribution. The total area under the curve is always 1.0. mmetrical. The normal distribution is symmetrical. The mean, median The mean, median, and mode are all the same.

46 Normal Distribution The 68-95-99.7 Rule In the normal distribution with mean µ and standard deviation σ: 68% of the observations fall within σ of the mean µ. 95% of the observations fall within 2σ of the mean µ. 99.7% of the observations fall within 3σ of the mean µ.

47 68-95-99.7 Rule 68% of the data 95% of the data 99.7% of the data Slide from: Kristin L. Sainani, Stanford University, http://www.stanford.edu/~kcobbhttp://www

48 Normal Distribution Standardizing and z-Scores Standardizing and z-Scores If x is an observation from a distribution that has mean µ and standard deviation σ, the standardized value of x is, A standardized value is often called a z-score. If x is a normal variable with mean µ and standard deviation σ, then z is a standard normal variable with mean 0 and standard deviation 1.

49 Normal Distribution Let x 1, x 2, …., x n be n random variables each with mean µ and standard deviation σ, then sum of them ∑xi be also a normal with mean nµ and standard deviation σ√n. The distribution of mean is also a normal with mean µ and standard deviation σ/√n. The standardized score of the mean is, The mean of this standardized random variable is 0 and standard deviation is 1.

50 SPSS demo- Data Summarization Categorical variable Frequencies/percentages: Analyze -> Frequencies -> Select variables (sex, grp, shades, ped) -> ok

51 SPSS demo- Data Summarization Categorical variable

52 SPSS demo- Bar Chart Analyze -> Frequencies -> Select variables (sex, grp, shades, ped) then select option chart - > Select Chart type (Bar, histogram, Piechart) and select percentages or frequencies- > Continue-> ok Or Graphs ->Bar -> Select type (Select type, clustered, stacked) -> Define -> Select Bars represents (n of cases, % of cases) -> select variable for category axis (e.g. grp) and click titles for writing titles -> continue -> ok

53 SPSS demo- Bar Chart

54 SPSS demo- Data Summarization Numerical variable Analyze -> Descriptive Statistics -> Descriptive -> Select variable (s) (e.g. Age, hgt) and click on radio button to transfer the variable(s) in the other window and then select options -> continue -> ok Or Analyze -> Compare means -> means ->select variable (s) for dependent (age, hgt) and independent (grp, sex) list and then select options -> Continue -> ok

55 SPSS demo- Data Summarization Numerical variable

56 SPSS demo – Boxplots Graph -> Boxplots -> Simple -> Define -> Select variables ( e.g. PLUC_pre) and category axis (e.g. grp) -> OK

57 MS Excel demo: Summary Statistics- Categorical Variable Frequency: Type bins -> Insert -> Function -> Statistical -> Frequency -> Select ranges for data (grp) and bins -> take the curser left of equal sign and then press simultaneously Ctrl, Shift, and Enter. Pie Chart: Select Frequency -> Chart -> Pie -> Series : write category labels (1,2,3) -> next Click title and write title, click data labels and select show percent then click on next.

58 MS Excel demo: Summary Statistics- numerical variable

59 Questions


Download ppt "RESEARCH STATISTICS Jobayer Hossain, PhD Larry Holmes, Jr, PhD October 16, 2008."

Similar presentations


Ads by Google