Presentation on theme: "Programming in R Describing Univariate and Multivariate data."— Presentation transcript:
Programming in R Describing Univariate and Multivariate data
Describing univariate data In this session I will explain: Measures of central tendency and variation How to use figures to summarize a single variable (univariate data) How to create these in R.
Characteristics of numeric variables Center, or where do we find most of the data Distribution or shape, such as a bell shaped curve Variation or dispersion, how far spread out is the data, on average, how far are observations from the center? Outliers…have we got Bill Gates in our salary sample?
Measure of central tendency The “center” of a data set can be described using two different measures: 1. Mean – the commonly known “average” 2. Median – the midpoint
The mean The sample mean is sometimes called “x bar” Translation, add up all the values and divide by the number of values Usually, this is what people call the average x = n x x
The median The middle of the data is called the median –Sort the data from smallest to largest –If there are an odd number of observations, the middle number is the median –For even number of observations, the median is the midpoint between the two middle numbers
The mode The most commonly occurring value –There can be more than one mode (multimodal, bimodal) –Sometimes there is no mode –For categorical variables, the mode is the only possible measure of central tendency
The median and the mode for table are both 62, while the mean is 61. Table may be a fairly symmetrical variable, with a slight left skew
Normal variables and standard deviation In a symmetric, bell shaped distribution, we are able to describe the entire distribution using only two numbers, the mean and the standard deviation The standard deviation is roughly the average distance that observations are from their mean
Calculating the standard deviation Standard deviation= Translation: Find the difference between the mean and each value in the dataset, square each difference, add these up, divide by the total number of values minus 1, then take the square root of that (or, get R to do it for you)
And we care because? The Empirical Rule For any normal curve, approximately 68% of the values fall within 1 standard deviation of the mean 95% of the values fall within 2 standard deviations of the mean 99.7% of the values fall within 3 standard deviations of the mean
Other things to describe How many modes? The range, minimum and maximum This histogram shows a bimodal shape. The data has a minimum of 1.67 minutes and a maximum of 4.93 minutes, for a range of 3.26 minutes. http://wps.aw.com/wps/media/objects/15/15719/projects/ch3_faithful/index.html
The five number summary Minimum, maximum, median, lower quartile and upper quartile MinimumMaximumMedianLower Quartile Upper Quartile The visual representation of the five number summary is the box or box and whiskers plot
Other ways to visualize data When developing a visual representation of a single variable, the most common tools are – Histograms, Pie Charts, Bar Charts, Box Plots and Stem and Leaf Plots. We’ve already seen a histogram and a box plot
Pie charts Excellent for categorical variables with 5 or fewer categories.
Bar charts Can be used to illustrate categories, or means and medians by categories
How to produce these in R The function summary() to get mean, median, first quartile, third quartile, minimum, and maximum. table() to get frequency counts prop.table() to get percentages Plus, pie(), barplot(), hist(), and boxplot() to get pie, bar plots, histograms, and box plots, respectively.