EHS 655 Lecture 4: Descriptive statistics, censored data

EHS 655 Lecture 4: Descriptive statistics, censored data

What we’ll cover today Descriptive analysis and visualization
Distribution Central tendency Dispersion Censored data Stata – basic commands

DESCRIPTIVE ANALYSIS Before we can make inference from data we must thoroughly examine variables Catch mistakes Look for patterns Find violations of statistical assumptions Generate hypotheses Avoid headaches later

Scope of dataset/analysis
Univariate Measurements on one variable per subject Bivariate Measurements on two variables per subject Multivariate Measurements on many variables per subject Today’s focus

UNIVARIATE ANALYSES Characteristics of single variable Typically
Distribution (frequency distribution) Central tendency (mean, median, mode) Dispersion (range, quartiles, absolute deviation, variance, standard deviation)

Distribution: categorical (table)
Stata: “tab varname”

Distribution: categorical (ordinal)
Stata: “graph bar (percent), over(varname)”

Distribution – quantitative (ratio) histogram
Stata: “histogram varname, freq” (add “normal” to superimpose normal curve)

Distribution: cumulative distribution
Stata: “cumul varname, gen (newvar) line newvar varname, sort”

Distribution: exceedance fraction

Central tendency: mean, median, mode
Use: identify “center” around which data are distributed Mean: Best for symmetric, non-skewed distributions Median: Best for skewed distribution or data with outliers Mode: Dataset may be bimodal, or may lack mode

Examples of central tendency
Symmetrical, unimodal Symmetrial, bimodal Positively skewed, unimodal Negatively skewed, unimodal

When to use mean, median, mode
Stata: “tabstat varname, stat (mean median)” Note: Stata does not have an easy way to identify mode Type of variable Best measure of central tendency Nominal Mode Ordinal Median Interval/ratio (not skewed) Mean Interval/ratio (skewed)

Dispersion Measures which identify spread of data (i.e., how far measurements are from “center”) Range Quartiles Standard deviation (SD) Variance Coefficient of variation Stata: “sum varname” Provides n, range, SD or Stata: “sum varname, detail” Provides n, range, quartiles, SD, variance Stata: “tabstat varname, stat(mean sd median range iqr cv)

Dispersion: range Simplest measure of dispersion Range
Maximum - minimum Range

Dispersion: quartiles
3 points that divide data set into 4 equal groups 1st quartile (Q1) marks lowest 25% of data = 25th percentile 2nd quartile (Q2) splits data set in half = 50th percentile 3rd quartile (Q3) marks highest 25% of data = 75th percentile Upper – lower quartile is interquartile range (IQR)

Dispersion: boxplot Stata: “graph box varname1, over(varname2)”

Dispersion: standard deviation
Variation of data Not dependent on n Not affected by number of measurements Expressed in same units as data Commonly used in exposure analysis

Variance Square of standard deviation
Squaring eliminates negative values Unit is square of measurement unit (!) Values farther from mean contribute more to variance Commonly used in exposure analysis

Coefficient of variation
Normalized measure of dispersion Dimensionless, often expressed as % Allows comparison of datasets with different units or means Unlike σ, cannot be used to construct confidence intervals around mean Mean close to 0 = Cv will approach infinity

BIVARIATE ANALYSES Allow us to begin to explore relationships between variables Scatter plot Correlation Cross-tabulation

Bivariate: scatter plot
Stata: “scatter varname1 varname2”

Bivariate: Pearson correlation
r (Pearson’s correlation coefficient) is amount of change in one value you expect from change in another value Assumptions: Both variables interval or ratio data Both variables normally distributed Absence of outliers Linear relationship Homeskedasticity Stata: “pwcorr varname1 varname2, sig”

Bivariate: correlation

Bivariate: Spearman correlation
Spearman’s rank correlation coefficient (rs or ρ) Pearson correlation coefficient between ranked variables Raw scores Xi, Yi converted to ranks xi, yi Nonparametric (no distributional assumptions) Assumptions: Ordinal, interval, or ratio data Monotonic relationship Stata: “spearman varname1 varname2, stats(rho p)”

Bivariate: spearman vs. Pearson correlation coefficient examples

Bivariate: Cross-tabulation
Stata: tab varname1 varname2

CENSORED DATA Uncensored/complete Left censored Interval censored
Value of each sample unit observed/known Default assumption Left censored Data <max value Interval censored Data between min and maxi value Right censored Data >max value

Exercise Come up with one example of exposure data where you might find each type of censoring Right censoring Left censoring Interval censoring

Common approach #1 to dealing with censored data
Assign all censored data ½ LOD Assumes data uniformly distributed below LOD

Common approach #2 to dealing with censored data
Hornung and Reed (1990) More accurate than LOD/2 when data normally or lognormally distributed Okay if <~50% data censored if low to moderate variability Less accurate than LOD/2 for highly skewed data

On to Stata Basic data manipulation commands
Define label/name for a variable (“label variable”) Create labels (“label define”) Assign labels to variable (“label values”) Rename variable (“rename”) Generate a new variable (“generate”) Replace an existing variable (“replace”)

On to Stata Anyone have to use the “Break” button?

EHS 655 Lecture 4: Descriptive statistics, censored data

Similar presentations

Presentation on theme: "EHS 655 Lecture 4: Descriptive statistics, censored data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EHS 655 Lecture 4: Descriptive statistics, censored data

Similar presentations

Presentation on theme: "EHS 655 Lecture 4: Descriptive statistics, censored data"— Presentation transcript:

Similar presentations

About project

Feedback