Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource.

Similar presentations


Presentation on theme: "1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource."— Presentation transcript:

1 1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource Kit a collection of resources used by faculty in Penn State's Department of Statistics in teaching introductory statistics courses. Page maintained by Laura J. Simon, Sept. 2003 – Statistics: Making Sense of Data (MIT) William Stout, John Marden and Kenneth Travers http://www.introductorystatistics.com/ Sept. 2003 http://www.introductorystatistics.com/ – Tom Maze, stat course prepared for KDOT, 2003

2 2 Outline Overview of statistics Types of data Describing data numerically and graphically Probability and random variables

3 3 Probability and Statistics Probably is the likelihood of an event occurring relative to all other events – Example: If a coin is flipped, what is the probability of getting a heads – 0.5 Given that the last flip was a heads what is the probability that the next will be heads – 0.5 Statistics is the measurement and modeling of random variables – Example: If our state averages 200 fatal crashes per year, what is the probability of having one crash today. Poisson distribution – k = average per time period. 200/365 = 0.55 – P(1 = x) = ((kt) x /x!)e -kt =(0.55*1) 1 /1!)e -0.55(1) = 0.32

4 4 Data Collection Designing experiments – Does aspirin help reduce the risk of heart attacks? Observational studies – Polls - Clinton’s approval rating

5 5 Variable Types Deterministic – Assume away variation and randomness – Known with certainty – One to one mapping of independent variable to dependent variable Relationship X1X1 Y1Y1

6 6 Variable Types Continued Random or Stochastic – Recognized uncertainty of an event – One to one distribution mapping of independent variable to dependent variable Probability that it could be any of these values Most Likely Less Likely

7 7 Population The set of data (numerical or otherwise) corresponding to the entire collection of units about which information is sought

8 8 Sample A subset of the population data that are actually collected in the course of a study.

9 9 WHO CARES? In most studies, it is difficult to obtain information from the entire population. We rely on samples to make estimates or inferences related to the population.

10 10 Organization and Description of Data Qualitative vs. Quantitative data Discrete vs. Continuous Data Graphical Displays Measures of Center Measures of Variation

11 11 Qualitative (Categorical) Data The raw (unsummarized) data are merely labels or categories Quantitative (Numerical) Data The raw (unsummarized) data are numerical

12 12 Qualitative Data Examples Class Standing (Fr, So, Ju, Sr) Section # (1,2,3,4,5,6) Automobile Make (Ford, Chevrolet, Nissan) Questionnaire response (disagree, neutral, agree)

13 13 Quantitative Data Examples (measures) Voltage Height Weight SAT Score Number of students arriving late for class Time to complete a task

14 14 Discrete Data Only certain values are possible (there are gaps between the possible values) Continuous Data Theoretically, any value within an interval is possible with a fine enough measuring device

15 15 Discrete Data Examples Number of students late for class Number of crimes reported to SC police Number of times the word number is used (generally, discrete data are counts)

16 16 Discrete Variable Model Poisson Distribution (0.55*t) x /x!)e -0.55(t)

17 17 Continuous Data Examples Voltage Height Weight Time to complete a homework assignment

18 18 Continuous Variable Model Exponential Distribution Probability of first Fatal at time t = k e -t k

19 19 Continuous Probability Function Cumulative Probability of Time Till First Fatal t = 1 - e -t k

20 20 Nominal Data A type of categorical data in which objects fall into unordered categories, for example: – Hair color blonde, brown, red, black, etc. – Race Caucasian, African-American, Asian, etc. – Smoking status smoker, non-smoker

21 21 Ordinal Data A type of categorical data in which order is important. For example … – Class fresh, sophomore, junior, senior, super senior – Degree of illness none, mild, moderate, severe, …, going, going, gone – Opinion of students about riots ticked off, neutral, happy

22 22 Binary Data A type of categorical data in which there are only two categories. Binary data can either be nominal or ordinal, for example … – Smoking status smoker, non-smoker – Attendance present, absent – Class lower classman, upper classman

23 23 Interval and Ratio Data Interval – Interval is important, but no meaningful zero – e.g, temperature in farenheit Ratio – has a meaningful zero value – e.g., temperature in Kelvin, crash rate

24 24 Who Cares? The type(s) of data collected in a study determine the type of statistical analysis used.

25 25 Proportions Categorical data are commonly summarized using “percentages” (or “proportions”). – 11% of students have a tattoo – 2%, 33%, 39%, and 26% of the students in class are, respectively, freshmen, sophomores, juniors, and seniors

26 26 Averages Measurement data are typically summarized using “averages” (or “means”). – Average number of siblings Fall 1998 Stat 250 students have is 1.9. – Average weight of male Fall 1998 Stat 250 students is 173 pounds. – Average weight of female Fall 1998 Stat 250 students is 138 pounds.

27 27 Descriptive statistics Describing data with numbers: measures of location

28 28 Mean Another name for average. If describing a population, denoted as , the greek letter “mu”. If describing a sample, denoted as x, called “x-bar”. Appropriate for describing measurement data. Seriously affected by unusual values called “outliers”. _

29 29 Calculating Sample Mean Formula: That is, add up all of the data points and divide by the number of data points. Data (# of classes skipped): 2 8 3 4 1 Sample Mean = (2+8+3+4+1)/5 = 3.6 Do not round! Mean need not be a whole number.

30 30 Population Mean The mean of a random variable X is called the population mean and is denoted It is also called the expected value of X or the expectation of X and is denoted E(X).

31 31 Median Another name for 50th percentile. Appropriate for describing measurement data. “Robust to outliers,” that is, not affected much by unusual values.

32 32 Calculating Sample Median Order data from smallest to largest. If odd number of data points, the median is the middle value. Data (# of classes skipped): 2 8 3 4 1 Ordered Data: 1 2 3 4 8 Median

33 33 Calculating Sample Median Order data from smallest to largest. If even number of data points, the median is the average of the two middle values. Data (# of classes skipped): 2 8 3 4 1 8 Ordered Data: 1 2 3 4 8 8 Median = (3+4)/2 = 3.5

34 34 Mode The value that occurs most frequently. One data set can have many modes. Appropriate for all types of data, but most useful for categorical data or discrete data with only a few number of possible values.

35 35 Most appropriate measure of location Depends on whether or not data are “symmetric” or “skewed”. Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.

36 36 Symmetric and Unimodal

37 37 Symmetric and Bimodal

38 38 Skewed Right

39 39 Skewed Left

40 40 Choosing Appropriate Measure of Location If data are symmetric, the mean, median, and mode will be approximately the same. If data are multimodal, report the mean, median and/or mode for each subgroup. If data are skewed, report the median.

41 41 Descriptive statistics Describing data with numbers: measures of variability

42 42 Range The difference between largest and smallest data point. Highly affected by outliers. Best for symmetric data with no outliers.

43 43 Interquartile range The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values. IQR = Q3-Q1 Robust to outliers or extreme observations. Works well for skewed data.

44 44 Variance 1. Find difference between each data point and mean. 2. Square the differences, and add them up. 3. Divide by one less than the number of data points.

45 45 Variance If measuring variance of population, denoted by  2 (“sigma-squared”). If measuring variance of sample, denoted by s 2 (“s-squared”). Measures average squared deviation of data points from their mean. Highly affected by outliers. Best for symmetric data. Problem is units are squared.

46 46 Population Variance The variance of a random variable X is called the population variance and is denoted

47 47 Standard deviation Sample standard deviation is square root of sample variance, and so is denoted by s. Units are the original units. Measures average deviation of data points from their mean. Also, highly affected by outliers.

48 48 Population Standard Deviation The population standard deviation is the square root of the population variance and is denoted

49 49 What is the variance or standard deviation? (MPH)

50 50 Variance or standard deviation Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 06.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3 female 65.00 120.00 85.00 98.25 male 75.00 162.00 95.00 118.75 Females: s = 11.32 mph and s 2 = 11.32 2 = 128.1 mph 2 Males: s = 17.39 mph and s 2 = 17.39 2 = 302.5 mph 2

51 51 Coefficient of Variation (COV) – not covariance! Ratio of sample standard deviation to sample mean multiplied by 100. Measures relative variability, that is, variability relative to the magnitude of the data. Unitless, so good for comparing variation between two groups.

52 52 Coefficient of variation (MPH) Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 106.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3 female 65.00 120.00 85.00 98.25 male 75.00 162.00 95.00 118.75 Females: CV = (11.32/91.23) x 100 = 12.4 Males: CV = (17.39/106.79) x 100 = 16.3

53 53 Choosing Appropriate Measure of Variability If data are symmetric, with no serious outliers, use range and standard deviation. If data are skewed, and/or have serious outliers, use IQR. If comparing variation across two data sets, use coefficient of variation.

54 54 Descriptive Statistics Summarizing data using graphs

55 55 Which graph to use? Depends on type of data Depends on what you want to illustrate Depends on available statistical software

56 56 Bar Chart Summarizes categorical data. Horizontal axis represents categories, while vertical axis represents either counts (“frequencies”) or percentages (“relative frequencies”). Used to illustrate the differences in percentages (or counts) between categories.

57 57 Histogram Divide measurement up into equal-sized categories. Determine number (or percentage) of measurements falling into each category. Draw a bar for each category so bars’ heights represent number (or percent) falling into the categories. Label and title appropriately.

58 58 Use common sense in determining number of categories to use. (Trial-and-error works fine, too.) Number of ranges (see Tufte)

59 59 Dot Plot Summarizes measurement data. Horizontal axis represents measurement scale. Plot one dot for each data point.

60 60 Stem-and-Leaf Plot Summarizes measurement data. Each data point is broken down into a “stem” and a “leaf.” First, “stems” are aligned in a column. Then, “leaves” are attached to the stems.

61 61 Boxplot smallest observation = 3.20 Q 1 = 43.645 Q 2 (median) = 60.345 Q 3 = 84.96 largest observation = 124.27 0 10 20 30 40 50 60 70 80 90 100 110 120 130.....

62 62 Box Plot “Whiskers” are drawn to the most extreme data points that are not more than 1.5 times the length of the box beyond either quartile. – Whiskers are useful for identifying outliers. “Outliers,” or extreme observations, are denoted by asterisks. – Generally, data points falling beyond the whiskers are considered outliers. Useful for comparing two distributions

63 63 Using Box Plots to Compare

64 64 Scatter Plots Summarizes the relationship between two measurement variables. Horizontal axis represents one variable and vertical axis represents second variable. Plot one point for each pair of measurements.

65 65 No relationship

66 66 Closing comments Many possible types of graphs. Use common sense in reading graphs. When creating graphs, don’t summarize your data too much or too little. When creating graphs, label everything for others. Remember you are trying to communicate something to others!

67 67 Probability You’ll probably like it!

68 68 Before we begin … What is the probability that 2 or more people share the same birthday if … – 5 people are in the sample? – 23 people? – 50 people? – This class?

69 69 Probability Properties The probability of an event “A” (the proportion of times the event is expected to occur in repeated experiments), is denoted P(A). All probabilities are between 0 and 1. (i.e. 0 < P(A) < 1) The sum of the probabilities of all possible outcomes must be 1.

70 70 Probability Basics Given that a crash has occurred, what is the probability that it is a fatal crash? – Possible events – Fatal, injury, and property damage only Fatal 37,000 P(F) = 0.58% Injury 2,026,000 P(I) = 32.16% PDO 4,226,000 P(D) = 67.08% Total Crashes 6,300,000

71 71 Complement The complement of an event A, denoted by A, is the set of outcomes that are not in A A means A does not occur P(A) = 1 - P(A) Some texts use Ac to denote the complement of A

72 72 Union The union of two events A and B, denoted by A U B, is the set of outcomes that are in A, or B, or both If A U B occurs, then either A or B or both occur

73 73 Intersection The intersection of two events A and B, denoted by AB, is the set of outcomes that are in both A and B. If AB occurs, then both A and B occur

74 74 Combinations of Events Union of fatal speed related and run-off the road crashes Single Vehicle Crash Speed Related Crashes Intersection of Fatal and Run-off the Road Crashes All Fatal Crashes (37,795) 21,052 13,357

75 75 Addition Law P(A U B) = P(A) + P(B) - P(AB) (The probability of the union of A and B is the probability of A plus the probability of B minus the probability of the intersection of A and B)

76 76 Mutually Exclusive Events Two events are mutually exclusive if their intersection is empty. Two events, A and B, are mutually exclusive if and only if P(AB) = 0 P(A U B) = P(A) + P(B)

77 77 Conditional Probability The probability of event A occurring, given that event B has occurred, is called the conditional probability of event A given event B, denoted P(A|B)

78 78 Multiplication Rule General form P(A/B) = P(A,B)/P(B) e.g., what is the probability of a single vehicle accident given that it was speed related?

79 79 Conditional Probability Example Total fatal crashes - 37,795 Total speed related crashes – 13,357 Total single vehicle crashes – 21,052 Total single vehicle, speed related crashes - 8,600 If the crash was speed related, what is the probability that it was a single vehicle crash? – P(sv/sp) = 8600/13357 = 64.38% If the crash was speed related, what is the probability that it was not a single vehicle crash? – P(sv/sp) = 1 – 0.6438 = 35.62% Single Vehicle Crashes Speed Related Crashes 21,052 13,357 All Fatal Crashes 37,795 SR+SV 8,600

80 80 Conditional Probability Example (Cont) Probability that a fatal crash was speed related = P(sp) – 13,357/ 37,795 = 35.34% Probability that a fatal crash was a single vehicle = P(sv) – 21,052/37,795 = 55.70% Probability that a fatal crash is both speeding related and a single vehicle = P(sv,sp) – 8,600/37,795 = 22.74% Single Vehicle Crashes Speed Related Crashes 21,052 13,357 All Fatal Crashes 37,795 SR+SV 8,600

81 81 Bayes’ Theorem P(A/B)P(B) = P(B/A)P(A) P(B/A) = P(A/B)P(B)/P(A) P(sv) = 55.70% P(sp) = 35.34% P(sv/sp) = 64.38% P(sp/sv) = ? P(sp/sv) = ((0.6438)*(0.3534))/0.5570 = 0.3854 Single Vehicle Crashes Speed Related Crashes 21,052 13,357 All Fatal Crashes 37,795 SR+SV 8,600

82 82 Bayes’ Theorem Problem Given – There were 11,696 off-road fixed object fatal crashes involving a single vehicle – There were 13,357 fatal crashes involving a speeding vehicle – There were 8,600 fatal crashes involving speeding and single vehicles – There were 5,400 fatal crashes involving single vehicles, speeding, and off-road fixed object crashes – The total number of fatal crashes is 37,795 – Given that a crash is speeding related, what is the probability that it will be an off-road single vehicle crash

83 83 Bayes’ Problem Answer What we need to know P(or,sv/sp) What we know – P(or,sv) = 30.95% – P(sp) = 35.34% – P(sv,sp) = 55.70% – P(sv,sp) = 22.75% – P(sp,or,sv) = 14.29% – P(or,sv/sv) = 55.56%

84 84 Answer Continued Multiplication Rule – P(sp/or,sv)P(or,sv) = P(sp,or,sv) – P(sp/or,sv) = P(sp,or,sv)/P(or,sv) – 46.17% =0.1429/0.3095 Bayes’ Theorem – P(or,sv/sp)= (P(sp/or,sv)*P(or,sv))/P(sp) – 40.43% = (0.4617*0.3095)/0.3534

85 85 Independence Two events A and B are independent if P(A|B) = P(A) or P(B|A) = P(B) or P(AB) = P(A)P(B)

86 86 Probability Concepts Randomness Independence

87 87 Thought Question 1 What does it mean to say that a deck of cards is “randomly” shuffled?  Every ordering of the cards is equally likely There are 8 followed by 67 zeros possible orderings of a 52 card deck  Every card has the same probability to end up in any specified location

88 88 The question continued A 52 card deck is randomly shuffled How often will the tenth card down from the top be a Club?  1/4 of the time  Every card has the same chance to end up 10th. There are 13 clubs and 13 / 52 = 1/4

89 89 Law of Large Numbers Relative frequency of an event gets closer to true probability as number of trials gets larger

90 90 Probability values Probabilities are between 0 and 1 Total probabilities of all possible outcomes = 1 Probability = 1  means an event always happens Probability = 0  means an event never happens

91 91 Does a prior event matter? A fair coin is flipped four times. First three flips are heads What’s the probability that the fourth flip is heads? 1/2 assuming flips are independent  Results of first three flips don’t matter

92 92 Independence The chance that B happens is not affected by whether A had happened.

93 93 Does prior event matter? Ten card drawn without replacement from 52 card deck. 2 Aces are among these 10 cards What’s the probability the tenth card is an Ace? 2/42 = 1/21  After ten draws, 42 cards remain, 2 of them are Aces

94 94 Dependence The chance that B happens is affected by whether A has happened.

95 95 Sequence of Events You guess at five True False questions. What’s the probability you get them right?

96 96 Five right in five guesses For each question, Pr(correct) = 1/2 Multiply probabilities  (1/2) x (1/2) x (1/2) x (1/2) x (1/2) = 1/32 = 0.031

97 97 Card Example Two cards are taken from normal 52 card deck. What’s the probability that both are Hearts? Note - there’s dependence between the two cards Answer = (13/52) x (12/51) = 1/17 = 0.059

98 98 The Birthday Problem What is the probability that at least two people in this class share the same birthday?

99 99 Assumptions Only 365 days each year. Birthdays are evenly distributed throughout the year, so that each day of the year has an equal chance of being someone’s birthday.

100 100 Take group of 5 people…. Let A = event no one in group shares same birthday. Then A C = event at least 2 people share same birthday. P(A) = 365/365 × 364/365 × 363/365 × 362/365 × 361/365 = 0.973 P(A C ) = 1 - 0.973 = 0.027 That is, about a 3% chance that in a group of 5 people at least two people share the same birthday.

101 101 Take group of 23 people…. Let A = event no one in group shares same birthday. Then A C = event at least 2 people share same birthday. P(A) = 365/365 × 364/365 × … × 343/365 = 0.493 P(A C ) = 1 - 0.493 = 0.507 That is, about a 50% chance that in a group of 23 people at least two people share the same birthday.

102 102 Take group of 50 people…. Let A = event no one in group shares same birthday. Then A C = event at least 2 people share same birthday. P(A) = 365/365 × 364/365 × … × 316/365 = 0.03 P(A C ) = 1 - 0.03 = 0.97 That is, “virtually certain” that in a group of 50 people at least two people share the same birthday.

103 103 Two-way Tables And various probabilities...

104 104 Two-way table of counts Rows: gender Columns: pierced ears N Y All M 71 19 90 F 4 84 88 All 75 103 178 Cell Contents -- Count

105 105 Joint (“  ”) probabilities Rows: gender Columns: pierced ears N Y All M 71 19 90 39.89 10.67 50.56 F 4 84 88 2.25 47.19 49.44 All 75 103 178 42.13 57.87 100.00 Cell Contents -- Count % of Tbl

106 106 Row conditional probabilities Rows: gender Columns: pierced ears N Y All M 71 19 90 78.89 21.11 100.00 F 4 84 88 4.55 95.45 100.00 All 75 103 178 42.13 57.87 100.00 Cell Contents -- Count % of Row

107 107 Column conditional probabilities Rows: gender Columns: pierced ears N Y All M 71 19 90 94.67 18.45 50.56 F 4 84 88 5.33 81.55 49.44 All 75 103 178 100.00 100.00 100.00 Cell Contents -- Count % of Col

108 108 Expected Value Coincidences

109 109 Roulette Color Bet 18 black, 18 red, and 2 green numbers Bet on one of black or red If correct, win $1 If wrong, lose $1

110 110 Is the bet fair? Fair game : expected value is 0 Expected value = sum of (outcome x prob) Exp Val. = (+1)(18/38)+(-1)(20/38) = -2/38 Not fair since expected value is not 0.

111 111 Color Bet versus Number bet Both have same expected value How are the bets the same? Long run result is same How are they different? Short run results can be quite different

112 112 Prob of Five Straight Losses Color Bet = (20/38) 5 = 0.04, 4% Number Bet = (37/38) 5 = 0.88, 88%

113 113 A Spectacular Coincidence ? Many states draw four digit lottery numbers Several years ago Mass. and N.H. both drew the same number on the same night Associated Press wrote that this was a spectacular 1 in 100 million coincidence

114 114 Was Associated Press Right ? Only if number picked is specified in advance of the draws. Chance both pick the same pre-specified number, for example 2963, is (1/10,000) (1/10,000) This is 1 in 100 million But the match could have been on any of 10,000 possibilities

115 115 The correct analysis First state could have picked any number Chance the second state matches is 1/10,000 Answer for two specific states is 1/10,000 But there were 15 states doing this almost every night.

116 116 The prob that the 15 states all differ First state can be any number Prob second state differs = 9,999/10,000 Prob third state is unique = 9,998/10,000 And so on, for 15 states Multiply these prob.'s to get probability that all 15 differ Answer is about 0.99 that all picked different numbers

117 117 Prob at least two states are same Opposite from all different Prob at least two the same = 1-Prob(all differ) 1 - 0.99 = 0.01 About 1 in 100 ; a far cry from 1 in 100 million


Download ppt "1 TR 555 Statistics “Refresher” Lecture 1: Probability Concepts References: – Penn State University, Dept. of Statistics Statistical Education Resource."

Similar presentations


Ads by Google