Presentation is loading. Please wait.

Presentation is loading. Please wait.

STAT 101 Exploratory Data Analysis I 1/25/12 One Categorical Variable Two Categorical Variables One Quantitative Variable – Center Section 2.1, 2.2 Professor.

Similar presentations


Presentation on theme: "STAT 101 Exploratory Data Analysis I 1/25/12 One Categorical Variable Two Categorical Variables One Quantitative Variable – Center Section 2.1, 2.2 Professor."— Presentation transcript:

1 STAT 101 Exploratory Data Analysis I 1/25/12 One Categorical Variable Two Categorical Variables One Quantitative Variable – Center Section 2.1, 2.2 Professor Kari Lock Morgan Duke University

2 Announcements Textbooks are here! My office hours: (Old Chemistry 216) – Wednesday 3-5 pm – Friday 1-3pm Lecture slides, assignments, labs, etc. will be posted at Complete lecture slides to be posted after each class

3 The Big Picture Population Sample Sampling Statistical Inference Exploratory Data Analysis

4 Class Survey Data Data from both STAT 101 classes and STAT 10

5 Data In order to make sense of this data, we need ways to summarize and visualize it Summarizing and visualizing variables and relationships between two variables is often known as exploratory data analysis (also known as descriptive statistics) Type of summary statistics and visualization methods depend on the type of variable(s) being analyzed (categorical or quantitative)

6 One Categorical Variable Display the number or proportion of cases that fall in each category “What is your favorite day of the week?”

7 Frequency Table MondayTuesdayWednesdayThursdayFridaySaturdaySunday R: table(fav_day) A frequency table shows the number of cases that fall in each category:

8 Proportion The sample proportion of students in each category is

9 Proportion The sample proportion of students in this class who prefer Friday is Proportion and percent can be used interchangeably: 0.51 or 51% MondayTuesdayWednesdayThursdayFridaySaturdaySunday

10 Relative Frequency Table A relative frequency table shows the proportion of cases that fall in each category All the numbers in a relative frequency table sum to 1 R: round(table(fav_day)/209,3) MondayTuesdayWednesdayThursdayFridaySaturdaySunday

11 Bar Chart/Plot/Graph In a barplot, the height of the bar corresponds to the number of cases falling in each category R: barplot(table(fav_day))

12 Pie Chart In a pie chart, the relative area of each slice of the pie corresponds to the proportion in each category R: pie(table(fav_day))

13 Summary: One Categorical Variable Summary Statistics – Proportion – Frequency table – Relative frequency table Visualization – Barplot – Pie chart

14 Two Categorical Variables Look at the relationship between two categorical variables 1.Relationship status 2.Gender

15 Two-Way Table FemaleMaleTotal In a Relationship Single It’s Complicated / Other11112 Total It doesn’t matter which variable is displayed in the rows and which in the columns R: table(gender, relationship)

16 Two-Way Table What proportion of females in intro stat are in a relationship? FemaleMaleTotal In a Relationship Single It’s Complicated / Other11112 Total a)42/60 b)42/151 c)42/215 d)151/215 e)60/215

17 Two-Way Table What proportion of intro stat students in a relationship are female? FemaleMaleTotal In a Relationship Single It’s Complicated / Other11112 Total a)42/60 b)42/151 c)42/215 d)151/215 e)60/215

18 Two-Way Table CAUTION: The proportion of females in a relationship is NOT THE SAME AS the proportion of people in a relationship who are female!

19 Two-Way Table What proportion of intro stat students are in a relationship and female? FemaleMaleTotal In a Relationship Single It’s Complicated / Other11112 Total a)42/60 b)42/151 c)42/215 d)151/215 e)60/215

20 Side-by-Side Bar Chart colors = c("pink", "blue") barplot(table(gender, relationship), beside=TRUE, col=colors, legend=TRUE) The height of each bar is the number of the corresponding cell in the two-way table

21 Side-by-Side Bar Chart colors = c("red", "green","blue") barplot(table(relationship, gender), beside=TRUE, col=colors, legend=TRUE)

22 Segmented Bar Chart R: barplot(table(relationship, gender), legend=TRUE, col=c(“red”, “green”, “blue”)) A segmented bar chart is like a side-by-side bar chart, but the bars are stacked instead of side-by-side

23 Mosaic Plot Columns are the width of the proportion of the column category, and each column’s bar is colored according to the corresponding proportions of the row variable within each column category R: mosaicplot(table(Music, Gender), col=c("pink", "blue"))

24 Mosaic Plot colors = c("red", "green","blue") mosaicplot(table(gender, relationship), col=colors, legend=TRUE,cex.axis=.7,main="")

25 Mosaic Plot This tells us… a)Most people who are in favor of the new housing model are in (or plan to be in) a selected living group b)Most people who are in (or plan to be in) a selected living group are in favor of the new housing model c)Both (a) and (b) d)Neither (a) nor (b)

26 Difference in Proportions A difference in proportions is a difference in proportions for one categorical variable (e.g. proportion for whom “it’s complicated”) calculated for different levels of the other categorical variable (e.g. gender)

27 Two-Way Table What is the difference in proportions FemaleMaleTotal In a Relationship Single It’s Complicated / Other11112 Total a)0.833 b)0.066 c)–0.003 d)0.057 e) /151 – 1/64

28 Summary: Two Categorical Variables Summary Statistics – Two-way table – Difference in proportions Visualization – Side-by-side bar chart – Segmented bar chart – Mosaic plot

29 Kidney Stones R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (1986). "Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy". Br Med J (Clin Res Ed) 292 (6524): 879–882"Comparison of treatment of renal calculi by open surgery, percutaneous nephrolithotomy, and extracorporeal shockwave lithotripsy" SuccessFailure Treatment A27377 Treatment B28961 Which treatment is better at removing kidney stones? (a) Treatment A (b) Treatment B

30 Kidney Stones Small StonesSuccessFailure Treatment A816 Treatment B23436 Which treatment is better at removing small kidney stones? (a) Treatment A (b) Treatment B

31 Kidney Stones Large StonesSuccessFailure Treatment A19271 Treatment B5525 Which treatment is better at removing large kidney stones? (a) Treatment A (b) Treatment B

32 Kidney Stones Treatment A is more effective for all kidney stones, but the data shows Treatment B to be effective overall! How is this possible!?!?

33 Kidney Stones Large StonesSuccessFailure Treatment A19271 Treatment B5525 Small StonesSuccessFailure Treatment A816 Treatment B23436 ALL STONESSuccessFailure Treatment A27377 Treatment B28961

34 Kidney Stones Treatment A is used more often on large stones, which are harder to treat. This is an example of Simpson’s Paradox: an observed relationship between two variables can change (or even reverses!) when a third variable is considered

35

36 Small Stones Treatment A Treatment B Successful81 (93%)234 (87%) Unsuccessful636 Slope = # successful / # unsuccessful = odds

37 Large Stones Treatment A Treatment B Successful192 (73%)55 (69%) Unsuccessful7125 Slope = # successful / # unsuccessful = odds

38 Combined Treatment A Treatment B Successful = Unsuccessful6+71=7761

39 Combined Treatment A Treatment B Successful273 (78%)289 (83%) Unsuccessful7761

40 Combined Treatment A Treatment B Successful273 (78%)289 (83%) Unsuccessful7761

41

42 One Quantitative Variable We’ll look at how to analyze a quantitative variable such as – Times checking Facebook per day – Average hours of sleep per night – Average hours of exercise per week – GPA – Average hours of spent on extracurricular activities per week – Number of piercings

43 Dotplot In a dotplot, each case is represented by a dot and dots are stacked. Easy way to see each case Average number of times checking Facebook per day

44 Histogram The height of the each bar corresponds to the number of cases within that range of the variable R: hist(exercise)

45 Histogram Although they look similar, a histogram is not the same as a bar plot A bar plot is for categorical data, and the x-axis has no numeric scale A histogram is for quantitative data, and the x-axis is numeric For a categorical variable, the number of bars equals the number of categories, and the number in each category is fixed For a quantitative variable, the number of bars in a histogram is up to you (or the software you use), and the appearance can differ with different number of bars

46 Shape SymmetricLeft-SkewedRight-Skewed Long right tail

47 Notation The sample size, the number of cases in the sample, is denoted by n We often let x or y stand for any variable, and x 1, x 2, …, x n represent the n values of the variable x Example: x = Average hours of sleep x 1 = 5, x 2 = 9, x 3 = 7, x 4 = 7, …

48 Mean The sample mean is the average, and is computed by adding up all the numbers and dividing by the number of cases R: mean()

49 Median The sample median is the middle value when the data is ordered If there are an even number of values, the median is the average of the two middle values The sample median is denoted as m R: median()

50 Outliers An outlier is a value that is notably different from the other values Hours spent on extracurricular activities per week

51 Resistance Statistics are resistant if they are not heavily affected by outliers The median is resistant, the mean is not Average hours of extracurricular activities per week: MeanMedian With Outlier Without Outlier 9.26

52 Outliers When using statistics that are not resistant to outliers, stop and think about whether the outlier is a mistake If not, you have to decide whether the outlier is part of your population of interest or not Usually, for outliers that are not a mistake, it’s best to run the analysis twice, once with the outlier(s) and once without, to see how much the outlier(s) are affecting the results

53 Groups You will be put into groups of 4 or 5 based on common lab time and similar interests These groups will be used for discussion and group activities in class, and will be your groups for the final project at the end of the course They could be a natural study group for outside of class as well, but only if you want it to be

54 To Do Homework 1 (due Monday) Homework 1 Buy a clicker! (clicker grading starts on Monday)


Download ppt "STAT 101 Exploratory Data Analysis I 1/25/12 One Categorical Variable Two Categorical Variables One Quantitative Variable – Center Section 2.1, 2.2 Professor."

Similar presentations


Ads by Google