# Probability and Statistics

## Presentation on theme: "Probability and Statistics"— Presentation transcript:

Probability and Statistics
Representation of Data Measures of Center for Data Simple Analysis of Data

Overview In this module you’ll be learning about the basics of statistics: Statistical Displays – Data can be displayed graphically in different ways. You will learn how to choose displays by the type of date and the message to be delivered to the audience. Some “do not” examples will also be covered. Measures of Center – A single number or data is commonly used to describe an entire set of data. You will explore the different types of “averages” and learn why you might choose one over another. Analysis – This module covers the simple analysis of the data. You will look at what information can be obtained from the data and how to make comparisons of various data sets.

Topics An Introduction to statistics Types of data Displaying data
How NOT to display data Arithmetic Mean Median Mode Weighted Mean Types of distributions Measures of center vs. variation

Statistics: An Introduction
Introduction to Statistics Statistics: An Introduction The first page at this site gives an explanation of how statistics is used. Clicking on the “Continue” link on the bottom right of the page will take you to the section on “Revealing Patterns Using Descriptive Statistics”. It may be worthwhile to read the first page of this section to review some of the common terms/vocabulary used in describing data. Return to this presentation when you are ready. (On the right of the web page, you will notice additional information that is beyond the scope of this module. Feel free to come back at a later time to explore further.) Statistics is the set of mathematical tools for collecting, organizing and analyzing data; and then interpreting the information to make decisions

Displaying Data

Types of Data What is Data?
Read the text on the web site. Answer the nine questions at the bottom of the page to check your understanding of the topic. Return to this presentation when finished. Data can be qualitative, describing distinct categories, or quantitative, describing numerical counts or measurements. Qualitative data can be nominal, where no natural order exists between the categories, or ordinal, meaning an order does exist. Quantitative date can be divided into continuous, when data are values within a range, and discreet, when the measurements are integers.

Another explanation of Types of Data
Types of Data (cont.) Another explanation of Types of Data If you are still unsure about recognizing qualitative and quantitative data, click on the link above to review how to distinguish between these two variables. When you have completed the “Progress check” at the bottom of the web page, return to this presentation. You should now be able to classify data as qualitative nominal, qualitative ordinal, quantitative discrete or quantitative continuous, and are ready to explore how to display data. Continue to the next slide!

Types of Graphs Common Graphs
Visit the above website for a brief description and representation of ten of the most common graphs. You should have a basic idea of the types of graphs that can be created to display data. As you move on through the slides, you will learn how to create these graphs and how to determine which graph gives the best representation of the data you want to display.

Types of Graphs (cont.) Bar graph Histogram Pie graph Line graph
Math Dictionary This web site is home to a mathematics dictionary. It has examples to the graphs listed below. Return to this presentation when finished. There are various ways to display your data. The differences arise from the type of data and the information and/or message you want to deliver. Following is a list of the more traditional types of graphs. Bar graph Histogram Pie graph Line graph Box plot Scatter plot Line plot/Dot plot Pictograph Stem & Leaf plot

Bar Graphs Bar Graphs Click on the link above to learn how to create a bar graph. After reading the information on bar graphs, answer the ten questions in the “Your Turn” section at the bottom of the page. One way to graphically represent data is by using a bar graph/chart. What type of data is best represented by a bar graph? What information about a data set should you be able to interpret from a bar graph?

Histograms Histograms
Read about histograms by following the link above. Check your understanding by answering the ten questions in the “Your turn” section at the bottom of the web page. Histograms can be used to represent continuous data. You should be able to identify data that is continuous and be able to create a histogram to represent that data.

Create a Histogram (video)
Histograms Create a Histogram (video) This video demonstrates how to take a data set and create a histogram. Histograms are best used when the data variable on the x-axis is quantitative. The bars most often represent a range of values. Each bar could also represent an individual value. In this case, the histogram would more accurately be called a frequency distribution graph.

A Histogram is NOT a Bar Chart
Histograms (cont.) A Histogram is NOT a Bar Chart It is important to distinguish the difference between a histogram and a bar chart. This is the first of several sites that will help you determine when to display data as one or the other. Read the information on the first page and then return to this presentation. Histograms and bar charts can look similar even though they display very different representations of the data. After reading the information on the web page linked above, you should be able to identify three differences between histograms and bar charts.

Bar Charts and Histograms (includes a video)
Histograms (cont.) Bar Charts and Histograms (includes a video) This webpage has additional information on when to use a bar chart or histogram to display your data. You can also view the video which shows how to create a bar chart and histogram.

Histograms vs. Bar Graphs
Histograms (cont.) Histograms vs. Bar Graphs Click on the link above to read more about the differences between histograms and bar charts. The information is set up as a conversation between a teacher and student reasoning through how each graph can be used to display specific types of data. When you have finished reading the discussion, please return to this presentation. Can you answer the following questions? What type of data would be best represented by a histogram? What information should you be able to identify when data is represented in a bar graph?

Pie Charts Pie Charts Read about how to create a pie chart and what type of data displays best in this format. Be sure to complete the questions at the bottom of the web page in the “Your turn” section and then return to this presentation! Pie charts represent data as a part-to-whole relationship.

Pie Charts (cont.) Pie Charts
This site looks at how NOT to use pie charts, along with showing many examples found in the news, in business reports and other media. You should be able to answer the following question regarding pie charts: What is the best type of data to represent graphically in a pie chart? When interpreting information from a pie chart, what are three areas you should pay attention to in the representation?

Scatterplots What is a Scatterplot?
This site will introduce you to scatterplots. Click the blue “View Video” button to see how to make and read scatterplots. Once you have watched the video and read through the information on this webpage, return to this presentation. Main points: A scatterplot is used to graph the relationship between two quantitative variables or bivariate data; Scatterplots may show patterns – weak or strong, positive or negative correlations; Correlation does not indicate cause and effect.

Scatterplots and Correlation
Scatterplots (cont.) Scatterplots and Correlation This site presents another view of scatterplots and correlation. After reading this information, answer the nine question in the “Your turn” section at the bottom of the page. Explore further… At the above website, under the correlation graphs, is a link More About Correlation. Here you will see how correlation is calculated. In most cases, you will use a calculator or software function for this; however, it’s beneficial to know how the correlation coefficient is derived.

Line Graph Line Graph This website gives many examples of line graphs and explains what makes a line graph different from a scatter plot. Read through this information and then return to this presentation. Main ideas: Line graphs help to determine the relationship between two sets of values; Value sets represent an independent variable and an independent variable; Line graphs are useful in showing trends and making predictions.

Line Graph (cont.) Line Graphs
Check your understanding in interpreting line graphs by answering the ten “Your turn” questions at the bottom on this webpage. You should now be able to answer to following questions: What are the main differences between scatterplots and line graphs? What type of data is best represented in a line graph?

This video introduces you to Box Plots, as it demonstrates how to create a box plot and defines the vocabulary terms listed below. When you have finished viewing the video, return to this presentation. Vocabulary to understand box plots: Distribution Median Average (Mean) Extremes Quartiles Interquartile Range

Quartiles / Interquartile Range / Box and Whisker Plot
Box Plot Quartiles / Interquartile Range / Box and Whisker Plot This webpage gives another look at the breakdown of Box Plots. Once you have read through the information, try answering the ten “Your turn” questions at the end of the page. (Tip: It will be helpful to have scrap paper available) At this point, you should be able to: determine the lower, middle and upper quartiles of a data set; calculate interquartile range; construct a Box and Whisker Plot to represent the data; compare box plots from two data sets and make observations about the distributions.

Boxplot (aka, Box and Whisker Plot)
If you need additional information to understand boxplots, click on the link above and “View Video”, which gives more details on how to read a boxplot. When you have finished reading through Boxplots Basics and How to Interpret a Boxplot, return to this presentation.

Stemplots (aka, Stem and Leaf Plots)
Stem & Leaf Plot Stemplots (aka, Stem and Leaf Plots) Click the blue button to View Video and then read the information on stem and leaf plots. For additional explanation about this type of graph, proceed to the next slide. Use to display quantitative data Best used with small sets of data Shows shape of distribution Stem values can have any number of digits Leaves can only be represented by one digit Limitations displaying decimals

Stem & Leaf Plot Stem and Leaf Plots
This site provides additional details on “splitting the stems” and “splitting stems using decimal values”. You should now know: • under what circumstances stems should be split; • how to organize decimal data in a stem and leaf plot; • how to interpret data by looking at a stem plot.

Line Plot / Dot Plot Line Plot (YouTube Video) View this video on how to make a Line Plot then return to this presentation. Vocabulary: Clusters Gaps Outliers

Dot Plot vs. Line Plot (YouTube Video)
Line Plot / Dot Plot Dot Plot vs. Line Plot (YouTube Video) This YouTube video does a good job describing the similarities and differences between a line plot and dot plot. Then return to this slide and click here to re-enforce what you have have learned about Line and Dot plots.

Picture Graph / Pictograph
Pictographs Read the information on Pictographs and then answer the nine “Your turn” questions at the bottom of this webpage. In a Pictograph, symbols are used to display statistical data. Symbols can be misleading if not accurately proportioned or if the symbols can not be divided evenly to represent fractional parts.

Types of Graphs Comparing Graphs
Test your understanding of the graphs covered in this unit. At this website, read through the problems and decide which graph most clearly represents the data and what information is to be conveyed to the reader. Also, work through the five questions at the bottom of the page. Most data can be represented using multiple graphs. Decisions on the most appropriate display should be make based on what information you want the reader to draw from the graph.

Hans Rosling Probably one of the most informative and modern displays of data can be seen from the work of Hans Rosling. The link above shows a video of his TED talk in It is a 20 minute video and it gets very interesting about 4 minutes into the video. Watch it all if you have time but we recommend at least 10 minutes. The point of this experience is not that we expect you to duplicate this extraordinary presentation, but that you appreciate the power of displaying data in a clear and understandable method. Any enhancement of the display should be for the purpose of clarity and not just distracting visuals.

How NOT to Display Data Misleading Graphs by Wikipedia The above link by Wikipedia, shows various ways a graphic display can mislead its intended audience. Return to this presentation when finished. Typically the displays we see are technically accurate but they use visual “tricks” to mislead the reader who may not pay close attention to the details of the graphic display.

Measures of Center

Definitions of these terms
Measures of Center Definitions of these terms If you are not familiar with the terms listed below, follow the link above to familiarize yourself with these terms. (The above site includes other measures of center that are beyond the scope of this presentation) Different ways to measure the center of data Arithmetic mean (commonly called average or just mean) Median Mode Weighted mean

Measures of Center Central Values
The link above gives some simple examples of measures of center and compares the mean, median, and mode. Check your understanding with the ten questions at the bottom of the page before returning. What is meant by “Measure of Center”? Sometimes we want to describe a group of data (numbers, values) by a single number. The advantage of this is the ability to more easily compare different groups of data. The disadvantage is when you describe a data set by a single number you lose the details and could mislead someone.

Arithmetic Mean The mean of a set of data is found by adding all the data values and dividing that answer by the number of points. (often referred to as “n”) Strengths Its calculation includes all the data It is common and more likely understood by others It is often used in other statistical formulas

Arithmetic Mean Weaknesses
Sometimes you don’t know all the data points needed to calculate the mean (data may be in a graph only) An extremely large data set may be difficult to calculate. It can be influenced by outliers, those values much larger or smaller than the rest of the data. It is often a value that is different than any of the data values When best to use The mean is best used when you data is continuous and symmetrical. Often necessary for use in other statistical measures.

Lessons on Arithmetic Mean
How to Find the Mean Visit the web site above to learn more about the arithmetic mean. After reading the lesson make sure and check your understanding by answering the ten questions at the end. In case you missed it, make sure and check out the “mean machine”. Run this virtual machine to see the relationship between the data points and the mean value.

Wikipedia defines median
The web site above give a very detailed definition of median. (Many of the examples are beyond the scope of this presentation) The median of a set of data is found by arranging all the data in numerical order and then selecting the data point in the middle. If the data has an even number of values the median is the mean of the two central values. Strengths Requires little if any mathematical calculation It is not effected by outliers (large or small data points) It can be approximated from a frequency distribution or a distribution graph

Median Weaknesses When best to use
Arranging a large set of data in order can be very difficult. When best to use The median is usually preferred when the data distribution is skewed It is used with ordinal data when the mean cannot be used

How to Find the Median Value
Lessons on Median How to Find the Median Value Visit the following web site to learn more about the median. After reading the lesson make sure and check your understanding by answering the ten questions at the end.

Comparing Mean & Median
Mean / Median Applet The link above gives you the ability to see how the mean and median change as the data points change. The applet allows you to drag data points on the line or move data points on the line. Take some time and play with this applet and see how the mean and median change and compare. You can also check the box for “box plot” to see how a boxplot would look with the data that shows on the line. When you have finished, jot down the patterns you have observed and then return to this presentation

Comparing Mean & Median
Seeing Statistics Use the link above for a more comprehensive lesson on the attributes and differences between the mean and median. The link will take you to an introduction of the web interphase. When you think you are familiar with how to navigate the system, click on the icon in the left column. When the table of contents show, click on lesson #3 “Describing the Center”. You can advance from one page to the next by clicking on the icon in the top, right corner of the page. Return to this presentation when you finish.

Wikepedia defines mode
The web site above give a very detailed definition of mode. (Many of the examples are beyond the scope of this presentation) The mode of a set of data is found by identifying the data element that occurs most often. Many people remember this by associating the word “most” with mode. Strengths Depending on the display of the data or the size of the data, it is often easy to identify It is the ONLY measure of center you can use for non-numeric data (nominal data). Example: What is the best measure of center for the eye color of this group of people?

Mode Weaknesses When best to use
Sometimes the data set could have more than one mode or even multiple modes. Often the data does not have any data element that is more numerous than any other. Sometimes the mode is nowhere near the center of the data. When best to use It is the only measure of center valid with nominal data (Example: data on student’s eye color) It can support the validity of the mean and median if it has a similar value. If the data is perfectly normal, mean=median=mode

How to Find the Mode Value

Wikipedia defines weighted mean
The web site above give a very detailed definition of weighted mean. (Many of the examples are beyond the scope of this presentation) Sometimes certain values in a data set contribute more to a measure of center than other values. In this situation, we calculate a weighted mean.

Dr. Math explains weighted mean (weighted average)
The web site above gives examples of calculating a weighted mean or weighted average. A simple example: Consider a university that teaches two classes. One class has 10 students, the other has 100 students. If you ask the university the average (mean) class size they respond with 55. (100+10)/2. However, if you ask every student what size class they are in to find the mean you would get [(100 * 100) + (10 * 10)] / 110 The 100 students in the larger class carry more WEIGHT that the 10 students in the smaller class.

Weighted Mean Weighted Mean Visit the web site above to learn more about the weighted mean. After reading the lesson make sure and check your understanding by answering the ten questions at the end.

Simple Analysis of Data

Simple Exploration and Analysis of Data
Exploring Data Read the first page in the link above to learn about the importance of thinking carefully about how to interpret what a data set can reveal. Measures of center, like the mean, median, and mode, give useful information about a data set, but is hidden by such single number summaries. To understand the information in a complex or large data set, it is important to examine the integrity of the data, to look out for interesting and useful patterns, and to summarize the data skillfully. Patterns are likely to be found more easily in visual, rather than numerical, representations of the data. A single number is unlikely to summarize the data effectively.

Data Integrity - Outliers
Outliers are numbers in a data set that are very different from all the others. Read the information in the link above to learn more about outliers. Then work the problems at the end to test your understanding. Why do some data sets have outliers? Have these numbers been recorded wrongly? Do they correspond to bad mistakes, e.g. in measurements? You should have found from the problems you worked that Outliers don’t have much effect on the median; Outliers can have a big effect on the mean. How can we tell when a “suspicious” number is a “genuine” outlier rather than just being at the limits of what is “normal”? What, if anything, should we do about outliers? If we ignore them, will they make problems for how we interpret the data? We’ll discuss some of these issues in the pages ahead.

Patterns of Data Patterns of Data Open the link above to read about various ways to describe and identify patterns in data sets. To begin with, we will concentrate on finding ways to measure the spread of a data set. This will give us two numbers (a measure of center and a measure of spread) to use when we summarize a data set. Two is better than one. The interaction of center and spread is important. If a data set has small spread, the values will be clustered closely around the center, so the center will represent the data values well.

Range Range Discover what is meant by the range of a quantitative data set by reading the explanation in the link above and working through the activities. The range of a data set is the difference between the largest and smallest values. It is the simplest measure of the spread of the data. Strength: Very simple to compute. Weaknesses: Very sensitive to unreliable data or outliers; it is easy for an inaccurate measurement to be much bigger or smaller than the others; this could have a big and misleading impact on the range. Only uses two data values, so a lot of information about the data set is lost.

Quartile Definition and Computation
Quartiles Quartile Definition and Computation Click the link above to read about quartiles and how to compute them. Then check your understanding by working the problem at the end. Half the values in a data set are at least as big as the median – and half the values are no bigger than the median. The first (or lower) quartile Q1 is essentially the median of the lower half of the data set; The third (or upper) quartile Q3 is essentially the median of the upper half of the data set. Controversy: There is no generally accepted agreement about how precisely to compute upper and lower quartiles. In fact, different calculators or software packages will give different results for the quartiles of the same data set.

Interquartile Range Interquartile Range Open the link above to learn what interquartile range (IQR) means and how to compute it. Don’t miss the imbedded YouTube video. Then read more here and check your understanding by answering the ten questions at the end. The IQR tells you about the spread of a data set through focusing on the middle 50% of the data. Its value is the length of the box in a box and whisker plot. Advantages of the IQR as a measure of spread: Easy to compute; Much less likely than the range to be affected by outliers. When best to use: When you use the median as a measure of center, the IQR is a good measure of the spread of a distribution.

Interquartile Range and Outliers
IQR and Outliers Visit the link above to find how to use the interquartile range to identify outliers in a data set. Then try your hand at the problems in this set. A standard convention is that a number in a data set is an outlier if it is at least 1.5 IQRs away from the median. It is important to understand that the choice of 1.5 IQRs is not specified by any theory. It is an arbitrary convention – but it has worked well for many years. An outlier can easily be spotted in a box and whisker plot – the end of a whisker that is more than one and a half times as long as the box.

Five Number Summary Five number summary The link above explains how a box and whisker plot provides five numbers that conveniently summarize a data set. Five summary numbers allow us a much richer analysis of a data set than a single measure of center. Practice problems here. The five number summary of a data set is based on a box and whisker plot. It consists of The biggest value; The third quartile; The median; The first quartile; The smallest value. It provides representative information about a data set that easily leads to a measure of center and a measure of spread at the same time as making outlier detection a simple matter of checking for over-long whiskers.

Standard Deviation Standard Deviation So far we have measured the spread of a data set in ways(range, IQR) that associate well with the median measure of center. Read the link above to see how to measure spread in a way (standard deviation) that is compatible with the mean measure of center. Then work the problems at the end. The deviation of a data value from the mean is just the difference between the two. The variance of a data set is the average of the squares of the deviations (with a slight adjustment if the data consists of a selection from all possible values.) The standard deviation of a data set is the square root of its variance. It is usually better to avoid computing variance or standard deviation by hand. Many calculators have these computations pre-programmed, so it is easy to get the information once the data is entered.

Standard Deviation Video
Standard Deviation and Outliers Standard Deviation Video Here’s a video to help you understand the concept of standard deviation. Standard deviation behaves like the average distance of the data values from their mean. It is a measure of the spread of the data set: When the average distance is small, many of the data values will be clustered around the mean, and the spread will be small; When the average distance is large, many of the data values will be far from the mean, and the spread will be large. We viewed a data value as an outlier when it was “far away” in IQR terms from the median. We shall also label a data value as an outlier when it is far away from the mean, as measured in terms of the standard deviation. One convention is that an outlier is at least 3 standard deviations from the mean – but this is a matter of debate, as it what you should do about outliers..

Distinguishing Between Data Sets
Anscombe's Quartet It is important to realize that very different data sets can have identical means and standard deviations. In other words, they have identical center and spread, but look very different. The link above is to Anscombe’s famous examples. Our objective is to understand and analyze our data sets. Because of Anscombe’s and similar examples, simple number summaries cannot give complete answers to our questions. It is necessary to look for patterns in other ways.

Challenging Problems Challenging Problems Before changing direction, here are some problems that you will need to think about carefully. If you get stuck, there are solutions posted.

Frequency Distributions
Consult the link above as well as this one to find out about frequency distributions. Make sure you confirm your understanding by working the problems. Sometimes a value occurs more than once in a data set. Its frequency is the number of times it appears in the list of values. By putting the values into bins if appropriate and counting up the total frequency in each bin, it is not difficult to create a frequency table that can be represented as a histogram.

Relative Frequency Distributions
To create a relative frequency distribution, we proceed as for the frequency distributions, but scale each of the frequencies by dividing by the total count of data values. This scaled frequency is the relative frequency. Read the link above and here, working the problems. Relative frequency distributions are actually distributions of probabilities. Using relative frequency distributions allows us to compare data sets of different sizes on an equal basis. If we selected one data set of 100 measurements and another data set of 1000 measurements by taking samples from a massive collection, there shouldn’t be too much difference between how each value compares with the others, regardless of which data set we examine. However, we should expect the frequency (e.g. 49) of one value in the second data set to be roughly 10 times its frequency (e.g. 5) in the first set. On the other hand, the relative frequencies should be similar (e.g. 4.9 and 5.0.)

Describing Data Patterns
Describing Data Patterns I Describing Data Patterns The link above describes some basic patterns that can occur in frequency distributions of data. These descriptions go beyond measuring center and spread, and focus as well on shape and unusual features. Some distributions are symmetric around the mean – which must then coincide with the median. (Why is this the case?) Some distributions are skewed with a tail to the right or the left. Some distributions have more than one mode.

Describing Data Patterns II
Data Pattern Video See whether you can use your knowledge of the previous slide to answer the questions in the YouTube video linked above. It turns out that distributions that are symmetric around the mean/median and that have a single mode that also coincides with the mean/median are the most important of all.

Characteristics of a Normal Distribution
Consult the link above for basic facts about the normal distribution and how it arises. As always, work the problems at the end. Check out the YouTube video here for ways to test carefully and systematically whether or not your data really is normal. The continuous normal distribution has a distinctive bell shape, with the mode, mean and median all at the same place. The bell is wider when the standard deviation is greater, but the basic form is always the same. Approximately 68% of all values are within 1 standard deviation of the mean, 95% are within 2 standard deviations of the mean, and 99.7% are within 3 standard deviations of the mean. Normal distributions are continuous distributions that tend to arise in measuring heights, sizes, pressures, temperatures, and so on.

Standard Normal Distribution
Consult the link above for basic facts about the standard normal distribution and work the problems at the end. A normal distribution is standard if it has mean 0 and standard deviation 1. All normally distributed data sets can be converted simply to a standard normal distribution. If the original data set has mean μ and standard deviation σ, the new data set created by replacing the old data values x by new data values z = (x-μ)/σ will be a standard normal distribution. It is common and useful to convert normal distributions to standard form; this makes it much easier to make comparison.

Characteristics of a Binomial Distribution
Poisson Distribution Consult the Wikipedia link above to seek out basic facts about the Binomial distribution. It is a discrete probability distribution that “expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event. A Poisson distribution is a discrete distribution that takes on values that are 0 or a positive integer. The mean and variance of a Poisson distribution are always the same. When the mean/variance is small, the Poisson distribution is skewed wit a long tail to the right. When the mean/variance is large, the Poisson distribution closely resembles the normal distribution with the same mean and variance.

Binomial Distribution
Characteristics of a Binomial Distribution Binomial Distribution Consult the Wikipedia link above to fish for basic facts about the Poisson distribution. It is a discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability p. A binomial distribution B(n,p) is a discrete distribution that takes on values that are 0 or a positive integer no greater than n. The mean is np and variance is np(1-p). A Poisson distribution is not symmetric, except when p = ½. When n is large and both np and n(1-p) are not too small, the binomial distribution closely resembles the normal distribution with the same mean and variance.