# Exploratory Data Analysis I

## Presentation on theme: "Exploratory Data Analysis I"— Presentation transcript:

Exploratory Data Analysis I
STAT 101 Exploratory Data Analysis I 1/25/12 One Categorical Variable Two Categorical Variables One Quantitative Variable – Center Section 2.1, 2.2 Professor Kari Lock Morgan Duke University

Announcements Textbooks are here! My office hours: (Old Chemistry 216)
Wednesday 3-5 pm Friday 1-3pm Lecture slides, assignments, labs, etc. will be posted at Complete lecture slides to be posted after each class

Exploratory Data Analysis
The Big Picture Population Sampling Sample Statistical Inference Exploratory Data Analysis

Class Survey Data Data from both STAT 101 classes and STAT 10

Data In order to make sense of this data, we need ways to summarize and visualize it Summarizing and visualizing variables and relationships between two variables is often known as exploratory data analysis (also known as descriptive statistics) Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)

One Categorical Variable
Display the number or proportion of cases that fall in each category “What is your favorite day of the week?”

Frequency Table A frequency table shows the number of cases that fall in each category: Monday Tuesday Wednesday Thursday Friday Saturday Sunday 1 6 12 106 71 R: table(fav_day)

Proportion The sample proportion of students in each category is

Proportion Monday Tuesday Wednesday Thursday Friday Saturday Sunday 1 6 12 106 71 The sample proportion of students in this class who prefer Friday is Proportion and percent can be used interchangeably: 0.51 or 51%

Relative Frequency Table
A relative frequency table shows the proportion of cases that fall in each category All the numbers in a relative frequency table sum to 1 Monday Tuesday Wednesday Thursday Friday Saturday Sunday 0.005 0.029 0.057 0.507 0.340 R: round(table(fav_day)/209,3)

Bar Chart/Plot/Graph In a barplot, the height of the bar corresponds to the number of cases falling in each category R: barplot(table(fav_day))

Pie Chart In a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category R: pie(table(fav_day))

Summary: One Categorical Variable
Summary Statistics Proportion Frequency table Relative frequency table Visualization Barplot Pie chart

Two Categorical Variables
Look at the relationship between two categorical variables Relationship status Gender

It’s Complicated / Other
Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 It doesn’t matter which variable is displayed in the rows and which in the columns R: table(gender, relationship)

It’s Complicated / Other
Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 42/60 42/151 42/215 151/215 60/215 What proportion of females in intro stat are in a relationship?

It’s Complicated / Other
Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 42/60 42/151 42/215 151/215 60/215 What proportion of intro stat students in a relationship are female?

Two-Way Table CAUTION: The proportion of females in a relationship is NOT THE SAME AS the proportion of people in a relationship who are female!

It’s Complicated / Other
Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 42/60 42/151 42/215 151/215 60/215 What proportion of intro stat students are in a relationship and female?

Side-by-Side Bar Chart
The height of each bar is the number of the corresponding cell in the two-way table colors = c("pink", "blue") barplot(table(gender, relationship), beside=TRUE, col=colors, legend=TRUE)

Side-by-Side Bar Chart
colors = c("red", "green","blue") barplot(table(relationship, gender), beside=TRUE, col=colors, legend=TRUE)

Segmented Bar Chart A segmented bar chart is like a side-by-side bar chart, but the bars are stacked instead of side-by-side R: barplot(table(relationship, gender), legend=TRUE, col=c(“red”, “green”, “blue”))

Mosaic Plot Columns are the width of the proportion of the column category, and each column’s bar is colored according to the corresponding proportions of the row variable within each column category R: mosaicplot(table(Music, Gender), col=c("pink", "blue"))

Mosaic Plot colors = c("red", "green","blue")
mosaicplot(table(gender, relationship), col=colors, legend=TRUE,cex.axis=.7,main="")

Mosaic Plot This tells us…
Most people who are in favor of the new housing model are in (or plan to be in) a selected living group Most people who are in (or plan to be in) a selected living group are in favor of the new housing model Both (a) and (b) Neither (a) nor (b)

Difference in Proportions
A difference in proportions is a difference in proportions for one categorical variable (e.g. proportion for whom “it’s complicated”) calculated for different levels of the other categorical variable (e.g. gender)

It’s Complicated / Other
Two-Way Table Female Male Total In a Relationship 42 18 60 Single 98 45 143 It’s Complicated / Other 11 1 12 151 64 215 0.833 0.066 –0.003 0.057 0.047 What is the difference in proportions 11/151 – 1/64

Summary: Two Categorical Variables
Summary Statistics Two-way table Difference in proportions Visualization Side-by-side bar chart Segmented bar chart Mosaic plot

Kidney Stones Which treatment is better at removing kidney stones?
Success Failure Treatment A 273 77 Treatment B 289 61 Which treatment is better at removing kidney stones? (a) Treatment A (b) Treatment B R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy". Br Med J (Clin Res Ed) 292 (6524): 879–882

Kidney Stones Small Stones Success Failure Treatment A 81 6 Treatment B 234 36 Which treatment is better at removing small kidney stones? (a) Treatment A (b) Treatment B

Kidney Stones Large Stones Success Failure Treatment A 192 71 Treatment B 55 25 Which treatment is better at removing large kidney stones? (a) Treatment A (b) Treatment B

Kidney Stones Treatment A is more effective for all kidney stones, but the data shows Treatment B to be effective overall! How is this possible!?!?

Kidney Stones ALL STONES Success Failure Treatment A 273 77
Treatment B 289 61 Small Stones Success Failure Treatment A 81 6 Treatment B 234 36 Large Stones Success Failure Treatment A 192 71 Treatment B 55 25

Kidney Stones Treatment A is used more often on large stones, which are harder to treat. This is an example of Simpson’s Paradox: an observed relationship between two variables can change (or even reverses!) when a third variable is considered

Small Stones Treatment A Treatment B Successful 81 (93%) 234 (87%)
Slope = # successful / # unsuccessful = odds Small Stones Treatment A Treatment B Successful 81 (93%) 234 (87%) Unsuccessful 6 36

Large Stones Treatment A Treatment B Successful 192 (73%) 55 (69%)
Slope = # successful / # unsuccessful = odds Large Stones Treatment A Treatment B Successful 192 (73%) 55 (69%) Unsuccessful 71 25

Combined Treatment A Treatment B Successful 81+192=273 289 Unsuccessful 6+71=77 61

Combined Treatment A Treatment B Successful 273 (78%) 289 (83%) Unsuccessful 77 61

Combined Treatment A Treatment B Successful 273 (78%) 289 (83%) Unsuccessful 77 61

One Quantitative Variable
We’ll look at how to analyze a quantitative variable such as Times checking Facebook per day Average hours of sleep per night Average hours of exercise per week GPA Average hours of spent on extracurricular activities per week Number of piercings

Dotplot In a dotplot, each case is represented by a dot and dots are stacked. Average number of times checking Facebook per day Easy way to see each case

Histogram The height of the each bar corresponds to the number of cases within that range of the variable R: hist(exercise)

Histogram Although they look similar, a histogram is not the same as a bar plot A bar plot is for categorical data, and the x-axis has no numeric scale A histogram is for quantitative data, and the x-axis is numeric For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed For a quantitative variable, the number of bars in a histogram is up to you (or the software you use), and the appearance can differ with different number of bars

Shape Long right tail Symmetric Right-Skewed Left-Skewed

Notation The sample size, the number of cases in the sample, is denoted by n We often let x or y stand for any variable, and x1 , x2 , …, xn represent the n values of the variable x Example: x = Average hours of sleep x1 = 5, x2 = 9, x3 = 7, x4 = 7, …

Mean The sample mean is the average, and is computed by adding up all the numbers and dividing by the number of cases R: mean()

Median The sample median is the middle value when the data is ordered
If there are an even number of values, the median is the average of the two middle values The sample median is denoted as m R: median()

Outliers An outlier is a value that is notably different from the other values Hours spent on extracurricular activities per week

Resistance Statistics are resistant if they are not heavily affected by outliers The median is resistant, the mean is not Average hours of extracurricular activities per week: Mean Median With Outlier 59.6 6 Without Outlier 9.2

Outliers When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake If not, you have to decide whether the outlier is part of your population of interest or not Usually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results

Groups You will be put into groups of 4 or 5 based on common lab time and similar interests These groups will be used for discussion and group activities in class, and will be your groups for the final project at the end of the course They could be a natural study group for outside of class as well, but only if you want it to be

To Do Homework 1 (due Monday)