STAT131 Week 2 Lecture 1b Making Sense of Data

Slides:



Advertisements
Similar presentations
Numbers Treasure Hunt Following each question, click on the answer. If correct, the next page will load with a graphic first – these can be used to check.
Advertisements

Chapter 4 Sampling Distributions and Data Descriptions.
AP STUDY SESSION 2.
1
Lecture Slides Elementary Statistics Eleventh Edition
Copyright © 2003 Pearson Education, Inc. Slide 1 Computer Systems Organization & Architecture Chapters 8-12 John D. Carpinelli.
Copyright © 2014 by McGraw-Hill Higher Education. All rights reserved.
David Burdett May 11, 2004 Package Binding for WS CDL.
Copyright © 2007 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Slide 5- 1.
Describing Data: Measures of Dispersion
Continuous Numerical Data
Whiteboardmaths.com © 2004 All rights reserved
Math 6 SOL Review Pt Released Test 26. Which solid could not have two parallel faces? A. A. Cube B. B. Rectangular prism C. C. Pyramid D. D.
Process a Customer Chapter 2. Process a Customer 2-2 Objectives Understand what defines a Customer Learn how to check for an existing Customer Learn how.
CALENDAR.
Mean, Median, Mode & Range
Copyright © 2010 Pearson Education, Inc. Slide Suppose that 30% of the subscribers to a cable television service watch the shopping channel at least.
Copyright © 2010 Pearson Education, Inc. Slide The number of sweatshirts a vendor sells daily has the following probability distribution. Num of.
Copyright © 2010 Pearson Education, Inc. Slide
Study question: distribution of IQ
Lecture 7 THE NORMAL AND STANDARD NORMAL DISTRIBUTIONS
CS1512 Foundations of Computing Science 2 Lecture 20 Probability and statistics (2) © J R W Hunter,
Multiple-choice example
Box and Whiskers with Outliers. Outlier…… An extremely high or an extremely low value in the data set when compared with the rest of the values. The IQR.
Biostatistics Unit 5 Samples Needs to be completed. 12/24/13.
St. Edward’s University
1 1 Slide © 2003 South-Western/Thomson Learning TM Slides Prepared by JOHN S. LOUCKS St. Edwards University.
1 1 Slide © 2014 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole.
Break Time Remaining 10:00.
Turing Machines.
Very simple to create with each dot representing a data value. Best for non continuous data but can be made for and quantitative data 2004 US Womens Soccer.
Table 12.1: Cash Flows to a Cash and Carry Trading Strategy.
PP Test Review Sections 6-1 to 6-6
The Frequency Table or Frequency Distribution Table
Statistics for Managers Using Microsoft® Excel 5th Edition
Data Distributions Warm Up Lesson Presentation Lesson Quiz
Exarte Bezoek aan de Mediacampus Bachelor in de grafische en digitale media April 2014.
A bar chart of a quantitative variable with only a few categories (called a discrete variable) communicates the relative number of subjects with each of.
Copyright © 2012, Elsevier Inc. All rights Reserved. 1 Chapter 7 Modeling Structure with Blocks.
Introduction Our daily lives often involve a great deal of data, or numbers in context. It is important to understand how data is found, what it means,
Quantitative Analysis (Statistics Week 8)
Adding Up In Chunks.
1 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt 10 pt 15 pt 20 pt 25 pt 5 pt Synthetic.
AQA - Business Statistics , Quantitative Analysis Peter Matthews
Take out the homework from last night then do, Warm up #1
1 hi at no doifpi me be go we of at be do go hi if me no of pi we Inorder Traversal Inorder traversal. n Visit the left subtree. n Visit the node. n Visit.
Essential Cell Biology
Converting a Fraction to %
Clock will move after 1 minute
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 1 Overview and Descriptive Statistics.
PSSA Preparation.
Copyright © 2013 Pearson Education, Inc. All rights reserved Chapter 11 Simple Linear Regression.
Chapter 2 Tutorial 2nd & 3rd LAB.
Physics for Scientists & Engineers, 3rd Edition
Select a time to count down from the clock above
Describing Quantitative Variables
C. D. Toliver AP Statistics
Chapter 2 Exploring Data with Graphs and Numerical Summaries
Descriptive Measures MARE 250 Dr. Jason Turner.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
Descriptive Statistics  Summarizing, Simplifying  Useful for comprehending data, and thus making meaningful interpretations, particularly in medium to.
1 Further Maths Chapter 2 Summarising Numerical Data.
More Univariate Data Quantitative Graphs & Describing Distributions with Numbers.
Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:
Exploratory Data Analysis
Description of Data (Summary and Variability measures)
Displaying Distributions with Graphs
Displaying and Summarizing Quantitative Data
Honors Statistics Review Chapters 4 - 5
Presentation transcript:

STAT131 Week 2 Lecture 1b Making Sense of Data Anne Porter

Review Learning and Writing what why how when Statistics is a study of variation throughout a process

Review Statistical Process Ethics The nature of the question to be answered Expertise Design Sampling Measurement Description and Analysis (Making sense of data) Conclusions & Decision Making

Where we do what! In lectures the focus - What are we doing? Why are we doing it? When do we do it? In labs the focus - How do we do it? Check definitions, Do by hand (simple) and SPSS Making choices about what to use In this lecture we will explore a number of tools for exploring and displaying data. In doing this we may be highlighting variation in the data or eliminating too much variation. In the tutorial and laboratory classes you will be learning how to construct the various graphs and numerical summaries. In the learning context established in the first lecture this may be considered to be the 'how to do it' associated with the various techniques. However the focus of this lecture will be using these tools to make sense of data. 'What are we doing?' and 'why are we doing it?' are the foci. We begin today's lecture with a look at some data that has been collected on STAT131 students. But before we attempt to make sense out of it we have struck the need to transform the data to common units of measurement.

Making sense of raw data A shoe seller sets up on campus & collects some data about what size shoes students wear. What do you see in this data? This is a data set collected from STAT131 students. We collect some similar data this year. I collected data on a number of variables (height, shoe size, sex and whether or not students had mobile phones. We are going to use this data to explore the possibility of setting up a shoe shop on campus. Let us firstly look at our data about shoe size and see what it conveys.

Making sense of raw data What might we do to to make sense out of the shoe size data? This is a data set collected from STAT131 students. We collect some similar data this year. I collected data on a number of variables (height, shoe size, sex and whether or not students had mobile phones. We are going to use this data to explore the possibility of setting up a shoe shop on campus. Let us firstly look at our data about shoe size and see what it conveys.

What might we do to make sense? Order the data Calculate the centre Mean average score Median middle score of ordered values Mode most common score Find the spread Range from minimum to maximum Look for outliers unusual values

Descriptive Statistics (mean, range) What do these statistics tell us? N Minimum Maximum Mean Std. Deviation SHOESIZE 150 4.0000 42.0000 9.816667 3.2291752 For many data sets the centre (mean) and spread (range) are the first things that seem to be of apparent interest. We can calculate them here with shoe size. I have used SPSS to do this. Why would the shopkeeper want to know the average shoe size? Is this what the shoe seller needs to know?

Descriptive Statistics (mean, range) What do these statistics tell us? There is an error in the data! Minimum size 4, Maximum 42 Average is 9.81 Range= Maximum less minimum =42-4 =38 N Minimum Maximum Mean Std. Deviation SHOESIZE 150 4.0000 42.0000 9.816667 3.2291752 For many data sets the centre (mean) and spread (range) are the first things that seem to be of apparent interest. We can calculate them here with shoe size. I have used SPSS to do this. Why would the shopkeeper want to know the average shoe size? Is this what the shoe seller needs to know? No

Five number summary SHOESIZE N Valid 150 Percentiles 25 8.000000 50 9.500000 75 11.000000 Five number summary Minimum Maximum Lower quartile or 25th Percentile: shoe size with 25% of shoe sizes below it Median, 5oth percentile or middle shoe size Upper quartile 75th Percentile with 75% shoe sizes below it (ie 25% above it) The interquartile range shoe size 75th percentile-shoe size 25th percentile What is a percentile? How do you calculate quartiles? And is this what the shoe seller wants?

Five number summary SHOESIZE N Valid 150 Percentiles 25 8.000000 50 9.500000 75 11.000000 Five number summary Minimum 4 Maximum 42 Lower quartile , 25% of shoe sizes below = 8 Median, 50% of shoe sizes below it = 9.5 Upper quartile, 75% of shoe sizes below it =11 The interquartile range 75th percentile-shoe size 25th percentile 11-8 (50% of sizes between 8 and 11) What is a percentile? Does the shoeseller have what is needed?

Percentiles - definition The kth percentile is a number that has k percent of the scores at or below it and (100-k)% above it The lower quartile has 25% of scores at or below that score

Quartiles Q1 is value of the (n+3)/4th observation, and Q3 is the value of the (3n+1)/4th observation. Interpolate if necessary. There are other approaches to calculating which may give different answers. If the answers are similar there is no problem The interquartile range= Q3 - Q1 If we have 17 heights what observation do we need to get the upper and lower quartile? What observation will give the median?

Quartiles - 19 The upper quartile is? The lower quartile is? So in accord with UTSS & Heckard Median the middle score , the (17+1)/2=9th score of 150 centimetres when scores are ordered There are 9 scores at or below the median, so the median of these is the (9+1)/2=5th score of 147 centimetres There are 9 scores at or above the median, so the 5th score, counting from the median will give the upper quartile. - 19 The upper quartile is? The lower quartile is? The interquartile range is? 166 147

What other statistics or graphs might inform the shoe-seller? Centre - mean, median Spread Maximum-Minimum = Range Upper Quartile-Lower quartile = Interquartile range 75th percentile-25th percentile= Interquartile range Outliers

Ordering the data Shoe Size 4 5 6 : 42 Ordering is often useful but we can do better This may give an indication of smallest, largest or .... but with a large data set there may be better information to gather.

Frequency or relative frequency table What is wrong with this display?

Frequency or relative frequency table What is wrong with this display? The data has been treated as if it were continuous. Some packages will do this but we want the data to be treated as discrete data

Frequency distribution (order plus count) We still have an error (42) But we have the frequency (count) of each shoe size. What might be better for the shoeseller? As the shop keeper is interested in how many shoes to order. The data provided by a relative frequency histogram may be of most use. However with the data being treated as continuous and sizes 4-7.5 included in the one bin, information in the frequency table is not particularly useful.

Percentages of each size Why might this be useful rather than frequency?

Percentages of each size Why might this be useful rather than frequency? We only had a sample so this would suggest the percentage or even proportion of each size. Is this all the shoeseller needs?

There are better ways of looking at distributions What else might we do?

Stem-and leaf plot (with error) Some packages (SPSS) cut off the outliers and lists them as extremes. See if you can find a definition for an extreme as used in SPSS and an outlier from the text. Different packages, different procedures may use different definitions - check Not all outliers are errors. The error in this instance was that the measuring units were not the same for all students. In this instance it would be perfectly acceptable to convert the 42 shoe size to the English (or was it American shoe size). Perhaps there are many more errors in the data! although these are not detectable except by going back to the students and asking them to indicate which measuring system they were using. The stem and leaf plot with the error removed, shows a bell shaped distribution, with size 9 and 9.5 forming the modal stem. Half sizes are of course important for the shop keeper so more bins might be useful.

What do we do with outliers? Not all outliers are errors. The error in this instance was that the measuring units were not the same for all students. In this instance it would be perfectly acceptable to convert the 42 shoe size to the English (or was it American shoe size). Perhaps there are many more errors in the data! although these are not detectable except by going back to the students and asking them to indicate which measuring system they were using. The stem and leaf plot with the error removed, shows a bell shaped distribution, with size 9 and 9.5 forming the modal stem. Half sizes are of course important for the shop keeper so more bins might be useful.

What do we do with outliers? Know the context to see what values are possible Check the original data to see if it is a data entry error See if it is in different units and transform to the appropriate unit If an error and you do not know what it should be delete it and make a note If there is no reason to conclude it is an error leave it in Sometimes analyse with the point in and the point out of the the data set Not all outliers are errors. The error in this instance was that the measuring units were not the same for all students. In this instance it would be perfectly acceptable to convert the 42 shoe size to the English (or was it American shoe size). Perhaps there are many more errors in the data! although these are not detectable except by going back to the students and asking them to indicate which measuring system they were using. The stem and leaf plot with the error removed, shows a bell shaped distribution, with size 9 and 9.5 forming the modal stem. Half sizes are of course important for the shop keeper so more bins might be useful.

Stem-and-leaf plot (42 removed) What does it reveal? Could it be better?

Stem-and-leaf plot (42 removed) What does it reveal? Could it be better? Change stems to focus on whole and half sizes. We should have transformed the 42. This is the difference between a lecture and data analysis, I deleted!

Stem-and-leaf with different stems What do we notice now? Do we have what the shoe seller needs?

Stem-and-leaf with different stems What do we notice now? There is a distribution within a distribution with fewer half sizes Do we have what the shoe seller needs? We need male and female data (Next lecture)

Graphical Excellence Convey the message about the data Axes, units, variable names, figure labels DO NOT Distort the data Use pie charts (there is always a better chart) More dimensions than necessary, 3D instead of 2D Unnecessary pattern, fill, ink, decoration

To reveal Centre Spread Outliers Distribution Patterns Anything unusual Comparisons (next lecture) And more But there are choices to be made

Centre Mean Median Mode Trimmed Mean Median, FIRST arrange the sample values from smallest to largest. N odd : Median of 8, 7, 9 is the middle of ordered scores 8 N even: Median of 4,7,8,9 =(7+8)/2=7.5 The centre might be a way of thinking of the typical value in a data set. The average number of emails, the most common number of emails, the middle number of emails received on a day, or as in the Olympics when judging the diving - the top x% and bottom x% of scores may be removed and the mean calculated. Mode is the most common score in the data set eg for 1,2,3,3,4,5,6 The mode is 3 Trimmed Mean Eg. Diving at the Olympics is the average of the judges scores after having tossed out the highest and the lowest scores

Question: mean vs median Data A: 60, 2, 3, 5 Data B: 6, 2, 3, 5 Mean A = 17.5 Mean B = 4 Median A = 4 Median B = 4 Which measure best typifies the data A? Why? Which measure best typifies the data set B? Why?

Question: mean vs median Data A: 60, 2, 3, 5 Data B: 6, 2, 3, 5 Mean A = 17.5 Mean B = 4 Median A = 4 Median B = 4 Which measure best typifies the data A? Why? Which measure best typifies the data set B? Why? For A the outlier 60 suggests the median (4) as the Mean (17.5) is dragged up by the outlier 60 For B both are the same. The median (4) used 2 points the mean (4) uses all the data

Question: mean vs median In what sense are the mean and median the same? In what sense are the mean and median different?

Question: mean vs median In what sense are the mean and median the same? In what sense are the mean and median different? They are both measures of the centre They may give different numerical values and for different data sets one may be better as a measure than the other or both may be required

Making Choices between mean & median The mean uses all the information in the sample, because each value is added in the sum. mean subject to error if spurious values are entered. median is less affected by “wild” values, we say it is robust. If the mean is similar to median use the mean as it uses all data. often easier to work with the mean If they are different because of non-symmetric distribution Can be useful to report both The context of what the data are are used for may also determine what is an appropriate measure In real data analysis, rather than the simple calculation of formulae, the context of what the data are are used for may also determine what is an appropriate measure of the centre of data. If five people are measuring the weight of an object and one value is spurious then the median or an average based on a data set with the discarded value might be best.

Measures of Spread Range= maximum value - minimum value Interquartile range = Upper Quartile-Lower quartile =Q1- Q3 Sums of Squares Variance (S2) Standard Deviation

Use of standard deviation The mean and std deviation gives information about where most of the distribution of values is to be found. For many distributions, the range mean - 2 standard dev’s to mean + 2 standard dev’s (mean + 2SD) contains approx 95% of the distribution. (The very least that this spread can contain is 75% of the distribution.)

Criteria for a good measure of spread Whatever measure of variability (or spread) the measure should not be affected by adding a constant to each value so as to change the centre (or location) If there is spread in the data it should indicate this Should make sense in the context used Should be robust, not influenced by outliers or extreme points Rather than simply learn the definition of these different measures of spread we are going to work through the sort of logic that is needed to develop such statistic. We will start with two criteria as specified on the slide

Undesirable features of measures of spread Sensitive to outliers Does not use all data eg range based only on two scores Difficult to understand Eg sum of squares in this context as the answer is very big and gets bigger with every additional data point. But useful in other contexts

Revealing distributions Frequency Distribution Table Stem-and-Leaf Histograms Box-and-whiskers

Box-and-Whiskers plots Often just called box plots, they give a pictorial summary of the data for a single variable. They use the five-number summary: minimum value, Q1, median, Q3, maximum value

Example: If minimum = 3, Q1 = 6, median=10, Q3 = 12, maximum = 16, the box plot would look like You must draw a scale for the box plot. 2 4 6 8 10 12 14 16

In a horizontal box plot, a horizontal axis shows the scale In a horizontal box plot, a horizontal axis shows the scale. The box’s left and right boundaries are Q1 and Q3, and an inner line shows the median. Whiskers are drawn outwards from the box to the minimum and maximum values. Often the sample mean is also shown.

What values given rise to the box plot below: If minimum= , Q1= , median= , Q3 = , maximum= , the box plot would look like You must draw a scale for the box plot. 2 4 6 8 10 12 14 16

What do you want to see in data? Information Meaning We must turn data into information in order to have meaning

What can we see in data? Location (centre) Spread Shape Outliers Unusual patterns Gaps, clusters How do batches differ

Tools for making meaning from data Ordering data Dot plots & jittered dot plots Stem-and-leaf plots Histograms, Boxplots, Bar charts Pie charts Frequency tables Numerical summaries

Selecting the tool depends on The question asked How the variable is measured The structure of the data Utility of the tool More in the next lecture and labs

Homework Textbook reading Utts & Heckard (2004) Chapter 2 Or Textbook reading Moore and McCabe pp38-55. Textbook reading, Griffiths, Stirling and Weldon, 1998, Chapters 1, 2, 6 (pp. ) Complete lab and preparation for next weeks lab.