Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Statistics

Similar presentations


Presentation on theme: "Introduction to Statistics"— Presentation transcript:

1 Introduction to Statistics
Section 1A William Christensen, Ph.D.

2 Isaac Newton (1642 - 1727) tells you how to succeed in this Statistics class.
“If I have ever made any valuable discoveries, it has been owing more to patient attention, than to any other talent.”

3 What is Statistics? Two Meanings
A “Statistic” can be a Specific Number A “Statistic” can be a Method of Analysis

4 A “Statistic” can be a Specific Number
A number that represents some measure of a set of data Example: The average hourly wage in Washington County is $15 / hour Example: 23% of the people polled believe there are too many polls

5 A “Statistic” can be a Method of Analysis
Methods of analysis have to do with: planning experiments, collecting data, organizing & summarizing data, analyzing & presenting data, and interpreting & drawing conclusions from data Example: Linear regression is one type of statistical analysis that is used to examine the relationship between things (variables)

6 Basic Definitions used in Statistics

7 Data – factual information
factual information (as measurements or statistics) used as a basis for reasoning, discussion, or calculation

8 Population - the complete collection (every member) of the things we are studying.
Example: If we are studying Grizzly Bears then the “population” is every Grizzly Bear everywhere. DO NOT confuse “population” with a “sample”. We rarely have data on every member of a population, so we often use “statistics” (methods of analysis) to analyze a “sample” in order to understand things (make inferences) about the population. For example, we might study 12 grizzly bears (a “sample”) in order to make inferences about all grizzly bears (the “population”)

9 Census - the collection of data from every member of a population.
Although it is rare, due to the time and expense involved, to collect data from every member of a “population,” when and if we do, this collection of data is called a “Census”. Example: Once every decade the U.S. government conducts a census of its citizens, attempting to collect data from everyone that lives in the United States.

10 Sample – a subcollection of members drawn from a “population” and used to draw conclusions or make inferences about the population. In statistics, the most common approach is to use data from a “sample” in order to make inferences or draw conclusions about the larger “population”. Example: We might give an experimental drug to 1,000 people (our “sample”) in order to draw conclusions about what would happen if we made the drug available to all people everywhere (our “population”).

11 Parameter – a numerical measurement describing something about an entire “population”.
Example: If our “population” consists of all DSC students, then one “parameter” would be the average GPA of all DSC students. Another “parameter” would be the percentage of female students among all DSC students. As long as we are talking about the entire “population” we are interested in, any numerical measurement (e.g., average, percentage, etc.) that describes something about the entire population (not just a sample from the population) would be considered a “parameter”.

12 population parameter

13 Statistic – a numerical measurement describing something about a “sample” (see definition of sample). Example: If our “sample” consists of 30 DSC students, then one “statistic” would be the average GPA of our sample of 30 DSC students. Another “parameter” would be the percentage of female students among our sample of 30 DSC students. As long as we are talking about only our sample, any numerical measurement that describes something about that sample (not the entire population) would be considered a “statistic”.

14 sample statistic

15 What is the difference between a “parameter” and a “statistic”?
If you don’t know, then GO BACK AND REVIEW

16 Types of Data: Quantitative & Qualitative Data

17 Quantitative Data – numbers that represent counts or measurements.
If you can express the data as a number then it is usually (not always) quantitative data Example: income levels, ages, weights, and lengths can all be expressed as meaningful numbers. These are all examples of quantitative data. Gender, opinions, and relationships cannot be expressed as numbers and ARE NOT quantitative data. Even some numbers, such as zip codes or phone numbers ARE NOT quantitative data because you cannot mathematically manipulate them (e.g., add or subtract them) Quantitative Data can be subdivided into two groups: Discrete data Continuous data

18 Quantitative Data Discrete data – when the number of possible values is ‘countable’ or finite Example: The number of eggs a chicken lays is “discrete”. You get 1, 2, 3, or more eggs. The number of eggs is always finite. You can never get an in-between number of eggs, like eggs. Continuous data – when the number of possible values is infinite. Scales that cover a range of values, without gaps, produce continuous data. Example: a thermometer is an example of a scale without gaps that covers a range of temperatures. There can be an infinite number of temperatures between say 0 degrees and 120 degrees Fahrenheit. This is because you can have any number of in-between temperatures such as degrees.

19 Qualitative Data – information that can be put into categories or distinguished by some nonnumeric characteristic. Examples: Gender (male/female) Age categories (not individual ages, but age brackets) Party affiliation (democrat/republican/independent) Zip codes Social security numbers

20 Qualitative (or categorical or attribute) data
can be separated into different categories that are distinguished by some nonnumeric characteristics (e.g., male / female)

21 Levels of Measurement: Nominal Ordinal Interval Ratio

22 Nominal data Data that consists of names, labels, or categories
Cannot be arranged in any meaningful order You cannot say that one value is bigger, better, or greater than any another value Examples: Gender (male/female) Party affiliation (democrat / republican / independent) Zip codes

23 Ordinal data Data that may be arranged in some order, but the precise differences between values either cannot be determined or are meaningless Examples: Poor / Average / Good / Excellent Letter grades ( A, B, C, D, F ) Subcompact, Compact, Mid-size, and Full-size Automobiles

24 Interval data Like ordinal data but with the additional property that the difference between any two data values is meaningful (evenly spaced). However, there is no natural zero starting point at which there is zero quantity. Example: Calendar years. The difference between the year 2000 and the year 1990 is the same as the difference between the year 1990 and the year 1980 (a difference of 10 years in each case). We can add and subtract years. However, we CANNOT SAY that the year 2000 is two times (or twice as much time) as the year Also, the year 0 does not represent the starting point of time. Fahrenheit temperature is another example of an interval scale because 100 degrees F is NOT twice as hot as 50 degrees F, and 0 degrees F does NOT represent the absence of all heat.

25 Ratio data Like interval data with the addition that there is now an absolute or true starting point where zero truly means there is no quantity present. Ratio data can be fully manipulated using mathematics. We can add/subtract or multiply/divide, or whatever. Examples: Money (e.g., prices of college textbooks). $50 is half of $100 and $0 is truly zero or no money. Distance (e.g., miles from home to school). 20 miles is really twice as far as 10 miles and zero distance is truly no distance.

26 ratio level of measurement
the interval level modified to include the natural zero starting point (where zero indicates that none of the quantity is present). For values at this level, differences and ratios are meaningful. Example: Prices of college textbooks

27 Can you name and explain the 2 types of data and the 4 levels of data measurement?
If not then GO BACK AND REVIEW

28 Uses and Abuses of Statistics

29 “There are three kinds of lies: lies, damned lies, and statistics.”
Benjamin Disraeli ( ) British Prime Minister

30 Why Statistics? Statistics is used to explore and explain things not explainable by the physical sciences. Things like: Human behavior Useful for understanding marketing, business, consumer behavior, social science, psychology and politics Nature Useful for understanding natural phenomenon and animal behavior Medicine Because we are all different, the physical science alone cannot fully explain how are bodies react to drugs and other stimuli, so statistics plays a valuable role in medicine

31 Why Statistics? #1 Practical Reason
People will try to sell you all kinds of things using statistics, from vitamins to investments to political agendas If you do not have a working knowledge of statistics you are fodder for the merciless.

32 Bad Statistics The self-selected survey (voluntary response sample)
respondents themselves decide whether to be included For example, you get an asking you to respond to a survey on something (often in order to save the world for evil) Studies that use small samples and/or samples that do not provide a true representation of the population being studied

33 Bad Statistics Surveys with confusing or misleading questions
Here is an example of two different ways to ask the “same” question and how the phrasing of the question can alter the response. Is it really the same question being asked? Should the president have the line item veto? (response was 57% yes) Should the president have the line item veto to eliminate waste? (response was 97% yes)

34 Bad Statistics Presentations that include misleading data and graphs
Take a look at the following graphs and pictures and see if you can find the deception in each Hint: focus on the numerical information given, which is usually accurate, versus the general shape of the graph/picture which often misleads us.

35 how does this effect your perception?)
Salaries of People with Bachelor’s Degrees compared to those with only High School Diplomas (same graph but using different starting points – how does this effect your perception?) Bachelor High School Degree Diploma $40,000 30,000 25,000 20,000 $40,500 $24,400 35,000 10,000

36 The caption for this graph might suggest that savings have doubled (i
The caption for this graph might suggest that savings have doubled (i.e., save twice as much) Yet, in fact, by doubling the length, width and height, the actual size (volume) has increased by a factor of eight!!

37 Bad Statistics Using precise numbers that are not accurate
For example, if I gave you a number like you might assume it was quite accurate because it seems so precise (not rounded). Yet, in fact, this number may have been generated from very poor/inaccurate data. Don’t be fooled, investigate the actual data.

38 Bad Statistics Distorted percentages
For example, if an airline was losing 80% of the luggage they processed and they improved this to “only” losing 40%, they might claim something like a 100% improvement in baggage handling. This may sound very impressive and make you think they are doing a great job. Yet, in fact, if they are still losing 40% of passenger’s luggage they are actually doing a very poor job. Don’t fall for misleading statistics, especially when someone is trying to sell you something.

39 Bad Statistics Partial Pictures (not the whole story)
Example: An overseas automaker makes the (true) claim that “90% of all our cars sold in the USA in the last 10 years are still on the road.” This claim is designed, of course, to make you think they have really good quality cars. And, in fact, there stated claim is true. However, what they don’t tell you is that they have only been selling cars in the USA for the last 3 years. Watch out for misleading statistics, especially when someone is trying to sell you something.

40 Bad Statistics Deliberate Distortions or outright lies
Unfortunately, there are people willing to tell outright lies and use statistics to make those lies appear legitimate. If you are interested in learning how to better detect these untruths, here are some references you can check out. Tainted Truth by Cynthia Crossen How to Lie with Statistics by Darrell Huff The Figure Finaglers by Robert Reichard

41 What’s Wrong Here? In a study of college campus crimes committed by students high on alcohol or drugs, a mail survey of 1875 students was conducted. A USA Today article noted, “8% of the students responding anonymously say they’ve committed a campus crime, and 62% of that group say they did so under the influence of alcohol or drugs.” Hint: They never told us the actual number of students who responded to the survey. By telling us they sent it to 1875 students it makes it sound like they had a large sample, but in fact, they never said how many responded. What if only 5 or 6 students responded – would that be representative of the population of all college students? Also, they use a percentage of a percentage (62% of 8% = 5%), but it kind of fools you, at first, into thinking that 62% of college students committed a crime while under the influence, while in fact only about 5% of those responding to the survey said they committed a crime while under the influence. This actually appeared in USA Today, yet it is quite misleading. Perhaps you can find even more things wrong with this study (e.g., what constitutes a “crime”? Does it include parking tickets?).

42 What’s Wrong Here? The Newport Chronicle, a newspaper in New England, reported that pregnant mothers can increase their chances of having healthy babies by eating lobsters. That claim is based on a study showing that babies born to lobster-eating mothers have fewer health problems than babies born to mothers who don’t eat lobster. Hint: In statistics we can “prove” a relationship between two things, in this case, healthy babies and lobster-eating mothers, but that DOES NOT mean that one causes the other. In fact, this study did show a statistical relationship, but as we will learn later in the course, that relationship does NOT IMPLY causality. Can you think of any reasons why lobster-eating mothers might have healthier-than-average babies besides the fact that they eat lobsters? One theory that might explain this relationship is that lobster is quite expensive and therefore, those that eat lobster are probably well-off and can probably afford the best health care. This might be a better explanation of the results than to suggest that eating lobsters is good prenatal care. Perhaps you can think of other reasons for these results.

43 What’s Wrong Here? A survey includes this item:
“Enter you height in inches _____” What might be some of the problems in asking this question? Hint: You can probably think of a number of reasons yourself. One problem might be that people usually think of their height in terms of feet and inches (e.g., 5’ 10”) and may have trouble figuring out how many inches that is. Also, many people tend to exaggerate their heights, so if you were really looking for accurate information, you might want to actually measure people rather than ask them how tall they are.

44 What’s Wrong Here? True story: A researcher at the Sloan-Kettering Cancer Research Center was once criticized for falsifying data. Among his data were figures obtained from six groups of mice, with 20 individual mice in each group. These values were given for the percentage of successes in each group: 53%, 58%, 63%, 46%, 48%, 67%. How did someone figure out that this researcher lied about their data? I’ll let you figure this one out on your own. You can check with me if you like to see if you got it.

45 Collecting Data

46 Where do we get data? Data usually comes for one of two sources:
Observation Experiments

47 Data from Observation The “observation” method of data collection suggests that we only observe what we are studying, without in any way trying to change or effect it. For example, we might use the “observation” method of study by making a survey and asking people in a mall questions about what kind of pizza they like best and why. Surveys and actually sitting there and watching things are types of “observation”

48 Data from an Experiment
An “experiment” requires that we apply some sort of treatment and then see what kind of effect we get. Experiments usually involve two groups; the treatment or test group, and the placebo or control group. (if you don’t understand what these or any other terms mean then you need to look them up in the dictionary) Example: we conduct an experiment to test the effectiveness of a new drug by giving the drug to one group (the treatment or test group), and giving a “sugar pill”, or something that looks like the drug but really isn’t any kind of drug at all, to another group called the placebo or control group. We then see if the test group did significantly better than placebo group. Experiments need to be carefully planned or designed in order to get accurate results.

49 Designing an Experiment
Steps in designing an experiment Identify your objective and identify the relevant population Collect (representative) sample data from your test group and placebo group Use a random procedure in selecting subjects for your treatment and placebo group to avoid bias Analyze the data and form conclusions

50 Controlling for Effects
When doing an experiment we must be careful to avoid any interference from things outside of what we are studying. This basically means that the ONLY thing we want to be different between our test group and our control group is the treatment itself. We don’t want anything (physical or psychological) to interfere. These “effects” or unwanted interference can be controlled through good experimental design

51 Controlling for Effects
One common “effect” or interference that can happen when doing an experiment is called the “Placebo effect” This is when the experiment itself causes some effect. In other words, the mere fact that someone or something knows they are part of an experiment alters there normal behavior and thus gives us inaccurate results. If you saw the 2nd Jurassic Park movie you might remember the comment about the “Heisenberg principle,” which, when applied to experiments suggests that it is impossible to study something without having some effect on what you are studying. This is another example of a sort of “placebo effect.” There are a couple of common methods for controlling for “effects” including the placebo effect Blinding (or “blind” study) – this is when the subjects do not know whether they are getting the real “treatment” or just the placebo (like a sugar pill). Double-Blind Study – this is when both the subject and those administering the study (i.e., giving out the drug or placebo) do not know whether the treatment is real or just a placebo. Only the researchers who do the statistics at the end of the study know who got the real treatment and who got the placebo.

52 Random Samples

53 Definition Random Sample
We have a “random sample” when we select our sample from members of the population in such a way that each has an equal chance of being selected Random Samples give us the best representation of a population. Therefore, only random samples are acceptable when doing good studies. The letter “n” is used to represent the number of subjects in a sample. If you have a sample that was NOT randomly selected (e.g., a voluntary response survey) the your data is unacceptable and totally useless Following are several approved methods for getting random samples

54 Methods for obtaining Random Samples
Simple Random Sampling Systematic Sampling Stratified Sampling Cluster Sampling

55 Simple Random Sample A sample selected in such a way that every possible sample of size n has the same chance of being chosen Example: if you wanted to have a simple random sample of 30 students to represent the population of all DSC students, you would have to get every student’s name and then randomly pick 30 students from the total student body (e.g., put all student names in a barrel, shake it up and then draw out 30 names). The key to a simple random sample is that you are taking the sample from all members of the population, which means you have to identify every member of the population (no shortcuts)

56 Systematic Sampling Start with a list of every member of the population. Next, randomly pick a starting point (e.g., close your eyes, open the list and randomly point to a name). Finally, select every Kth member (e.g., like every 10th name or every 100th name or whatever. Example: if you wanted to have a systematic sample of 100 people living in St. George, you might casually open the phone book, randomly point to a name, start there and then pick every 50th name until you get your sample of 100 people.

57 Stratified Sampling Start by dividing the population into least two subgroups or strata (example: divide people into male and female groups/strata). Next, draw a random sample from each of the subgroups (example: randomly select 30 men and 30 women).

58 Cluster Sampling Start by dividing the population into sections or clusters. Next, randomly select some of those clusters. Finally, include in your sample all members of the clusters you selected. Example: I want to take a poll to understand how people in St. George are going to vote in the next election. Using the Cluster Sampling technique, I divide up all of St. George into neighborhood blocks. Next, I randomly select some number of blocks, like say 10 blocks. Finally, I visit or call every single person or home that is in each of the blocks I selected. Note: You might notice that both Cluster Sampling and Stratified Sampling start by dividing the population into subgroups. The difference is that is Cluster Sampling, we randomly select groups/clusters and then include every member of those selected groups in our sample. Whereas, with Stratified Sampling, after we divide the population into subgroups we randomly select individual members from each group.

59 A nonrandom sampling method - Convenience Sampling – (this is bad statistics)
A convenience sample consists of subjects that were selected simply because they were readily available or easy to get a hold of. You see this all the time on TV News were the reporter is on the street asking people what they think about something or other. This is very bad statistics and you should not pay much attention to information or data collected in this manner. Hey! Do you believe in the death penalty?

60 Remember: When it comes to Statistics you can never be sure!
As you have already learned, because of the cost and time involved, it is usually not practical to collect data from an entire population. That is why we use Random Sampling – to have a smaller set of subjects that we study in order to understand the larger population. Sampling is cost-effective and convenient, but there is a problem we have to live with. That is; there is always some difference between the sample result and the true population result. For example: if we had a sample of 30 DSC students and took the average GPA for all 30 students, and then compared it to the average GPA for our population (all DSC students), we would surely find at least some small difference. This difference is called “Sampling Error”. Sampling error will exist no matter how well we do our study. Whenever we do a study, we also have what’s called “Non-sampling Error”. This is error that results from not doing a “perfect” study. We might get “non-sampling error” by incorrectly collecting data, making a mistake recording the data, doing a poor job analyzing the data, etc. This kind of error is controllable and we should try to avoid it by understanding how to do statistics and then doing it right.

61 Test Yourself Can you tell which of the following examples is an Observation Study and which is an Experiment? Cans of Coke are opened and the volume measured (observation) A new drug for treating insomnia is tested by recording its effects on students (experiment) The effectiveness of multimedia teaching is tested using a sample of students who complete a course of study using the multimedia approach (experiment)

62 Test Yourself ID the type of sampling used:
The Gallup Organization plans to conduct a poll of NYC residents living within the “212” area code. Computers are used to randomly generate telephone numbers that are automatically called (simple random sample) An ABC News reporter polls people as they pass him/her on the street (convenience) A GM researcher has partitioned all registered cars into categories of subcompact, compact, mid-size, intermediate, and full-size. She is surveying 200 randomly selected car owners from each category (stratified) The Washington County Commissioner of Jurors obtains a list of 42,763 car owners and constructs a pool of jurors by selecting every 100th name on that list (systematic) hint: choose from Random / Systematic / Stratified / Cluster / Convenience

63 You Need to Know Excel Just a reminder – you need to know how to use Excel in order to do well in this class. If you do not feel comfortable using Excel, I suggest you take a class in Excel before taking this class.

64 Introduction to Statistics
Section 1A E N D William Christensen, Ph.D.


Download ppt "Introduction to Statistics"

Similar presentations


Ads by Google