## Presentation on theme: "Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Misuses of Statistics."— Presentation transcript:

2 Questionable Sampling An instructor asks 10 students in the front row a question. 8 of the 10 are able to answer  Therefore 80% of the class “gets it?” This might be a “convenience sample,” confounded by  students with good study habits sit near the front?  more eye contact with instructor, so more learning?  less likely to fall asleep?

3 More bad sampling: CNN quick vote

Misuse # 1- Bad Samples  Voluntary response sample (or self-selected sample) one in which the respondents themselves decide whether to be included In this case, valid conclusions can be made only about the specific group of people who agree to participate. Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley.

Misuse # 2- Small Samples Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Conclusions should not be based on samples that are far too small. Example: Basing a school suspension rate on a sample of only three students

Misuse # 3- Graphs Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. To correctly interpret a graph, you must analyze the numerical information given in the graph, so as not to be misled by the graph’s shape.

Misuse # 4- Pictographs Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Part (b) is designed to exaggerate the difference by increasing each dimension in proportion to the actual amounts of oil consumption.

Misuse # 5- Percentages Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Misleading or unclear percentages are sometimes used. For example, if you take 100% of a quantity, you take it all. 110% of an effort does not make sense.

 Loaded Questions  Order of Questions  Refusals  Correlation & Causality  Self Interest Study  Precise Numbers  Partial Pictures  Deliberate Distortions Copyright © 2007 Pearson Education, Inc Publishing as Pearson Addison-Wesley. Other Misuses of Statistics

11

12 Doing it right: Simple Random Sample (SRS) ‏ A simple random sample (SRS) of size n consists of n individuals from the population chosen in such a way that every set of n individuals has an equal chance to be the sample actually selected.  Picture drawing names from a hat Method  Label w/ same # digits  Read from table B in successive groups of digits

13 Choosing a simple random sample

14 4) Choose a random sample of size 5 by reading through the list of tw0-digit random numbers, starting with line 103 and on. 5) The first five random numbers matching numbers assigned to people make the SRS

Simpler in practice to use a calculator for an SRS 15

16 Sample Surveys: other Sampling Designs: Stratified Random Sampling To select a stratified random sample – classify the population into groups of similar individuals, called strata. 1.Choose a separate SRS in each stratum 2.Combine each stratum to form a full sample  Stratum selection can help factor out bias elements

17 Stratified Sampling Example Should the town put in a city swimming pool? Upper and middle- income housing low-income housing 60% own pools 70% oppose building a city pool 7% own pools 95% want a city pool Two SRS should be considered if you wish to remove the bias of those who currently have pools

18 Cluster Sample Used when population falls into naturally occurring subgroups, each with similar characteristics  Divide population into groups called clusters  Select all of the members in one ore more (but not all) of the clusters instructor teaches 5 sections of a stats class –To determine if Statistics students like playing Sudoku, he could survey 1 section (using SRS). This section is a “cluster” of the larger population. Assumption is that all clusters are homogeneous

19 Multistage Sampling Real sample surveys will commonly use several stages of sampling, with SRS used at each stage. Example:

20 Sample Surveys: Telephone Most sample surveys are constructed over telephone Advantages:  Calls can be done from 1 central facility  Can use computer assisted interview methods  Since not face-to-face, interviewer effects are diminished  Sampling can be done via random digit dialing  Easy to do geographic strata since numbers = area

21 Sample Surveys: Telephone Disadvantages:  Shorter window of time allowed by interviewees  Small proportion do not have phones  Individuals with only cell-phones are harder to reach  In some areas, people put their phone numbers on a “no-call” lsit  Can't use visual-aids  Response rates tend to be lower

22 Sample Surveys: Problems Undercoverage  Groups in the population are left out CNN polls, phone polls (recall 1948 FDR presidential election) ‏ Global Non-response  An individual chosen can't be contacted, or refuses to participate Item Non-response  An individual may be unwilling to answer specific questions (shame?) ‏ Response bias  Race of interviewer  Knowing that you should have voted, so you lie

23 Response Bias due to wording 13% agreed that too much money is spent on “assistance to the poor” 44% agreed that too much money is spent on “welfare”

24 Learning about populations form samples The techniques of inferential statistics allow us to draw inferences or conclusions about a population from a sample.  Your estimate of the population is only as good as your sampling design Work hard to eliminate biases.  Your sample is only an estimate—and if you randomly sampled again, you would probably get a somewhat different result.  The bigger the sample the better. We’ll get back to it in later chapters.

25 Math 119: Elementary Statistics Producing Data: Experiments Monday, June 22, 2009

26 Overview Experiments Experimenting badly Randomized comparative experiments Cautions about experimentation Matched pairs and other block designs

27 Learning Outcomes Outline the design of a randomized experiment  Show sizes of groups, treatments, response variables Randomly assign subjects to groups Recognize placebo effect and when the double-blind technique should be used Explain why a randomized comparative experiment can give good evidence for cause-and-effect relationships

28 Experiment Comprised of subjects upon which/whom we impose treatments  Treatments = explanatory variables = factors

29 The Carolina Abecedarian Project http://www.fpg.unc.edu/~abc/ http://www.fpg.unc.edu/~abc/ It provided full-day, year-round, center-based educational child care from infancy through age five. Between 1972 and 1977, the program enrolled 111 infants at high risk for poor cognitive and academic outcomes due to environmental circumstances such as poverty. 57 were randomly assigned to receive early educational intervention and 54 were in a control group Home visits were conducted when children were 6, 18, 30, 42, and 54 months of age 104 participated in an age-21 follow-up study, which included mental health screening B Smith: watched 6 min video B Smith: watched 6 min video

30 Study Results Daycare, yes or no?  subjects had decreased measures of depression increased scores in IQ tests

31 The CAP http://www.fpg.unc.edu/~abc/ Policy Implications * The importance of high quality, educational childcare from early infancy is now clear. The Abecedarian study provides scientific evidence that early childhood education significantly improves the scholastic success and educational attainments of poor children even into early adulthood. * Welfare reform has increased the likelihood that poverty children will need early childcare. Steps must be taken to ensure that quality childcare is available and affordable for all families. This is especially critical for poor families. * Learning begins in infancy. Every child deserves a good start in an environment that is safe, healthy, emotionally supportive, and cognitively stimulating. * Childcare officials should be aware of the importance of quality care from the very first months of life. * Quality care requires sufficient well-trained staff to ensure that every child receives the kind of appropriate, individualized attention provided by the Abecedarian model. * Future research should concentrate on identifying the specific learning techniques most effective for all groups and types of young children. * Poverty is increasing among America's children. At the same time, more and more of them will require out of home care. We must not lose the opportunity to provide them with the early learning that will increase their chances for later success.

32 Experiment Design The design of an experiment specifies  the treatments  The manner in which the subjects are assigned to the treatments

33 Does learning Scratch lead to improvements in Math skills? How do we test the hypothesis that using and programming in Scratch develops transferable skills? Design the experiment scratch.mit.edu B Smith: pretty good ex for thinking about designing experiment. B Smith: pretty good ex for thinking about designing experiment.

34 Principles of Statistical Design of Experiments Principle 1:  Combat bias using control and randomization Principle 2:  Increase certainty by using “enough” subjects to observe variations and central tendencies

35 combating bias: randomization comparison placebo effect

36 Control: placebo effect PLACEBO EFFECT  In experiments,sometimes the very belief that you are receiving treatment has a non-negligible effect.  The belief becomes a confounding factor in the experiment What if you are aware of your participation in an experiment, but you are not told whether or not you're in the group receiving treatment? Controlling this very belief is a form of control

37 Control: Double-blind experiments and placebos Getting unbiased results via blinding  Blinding is a technique in experimentation where neither subject nor experimenter providing treatment know if treatment is a placebo

38 Control: forming blocks of individuals Completely Randomized Design:  Subjects are randomly assigned to different treatment groups  An alternative: blocks Blocks: groups of subjects with similar characteristics  Randomized block design: within each block, subjects are randomly assigned to treatment groups All subjec ts Control Treatment 30-39 years old 40-49 years old Over 50 years old Control Treatment Control Treatment

39 Control: matched pairs design matched-pairs design  subjects are paired up according to a similarity  one subject randomly chosen to receive one treatment, other subject gets a different treatment  Possible pairing criteria: age, weight, ethnicity/culture, geographical location

40 Check yourselves An experiment is being performed to test the effects of sleep deprivation on memory recall. Two hundred students volunteer for the experiment. The students will be placed in one of five different treatment groups, including the control group.  explain how you could design an experiment so that it uses a randomized block design  explain how you could design an experiment so that it uses completely randomized design.

41

42 Individuals and Variables “Individuals” are the individual things or people the data describe “Variables” are the characteristics of the individuals, possibly with different values for different individuals

43 Categorical and Quantitative Variables Categorical variables  places an individual into a group or category (e.g., color, gender) Quantitative variables  can be assigned a numeric value and operation (eg adding, averaging)

44 alligator attacks: categorical vs quantitative variables which is categorical, which is quantitative? B Smith: not clear B Smith: not clear

45 Categorical Variables: can use pie charts and bar graphs extracting the main features of data is called “exploratory data analysis” we attempt to provide a description of the data 2 basic principles  examine each variable by itself first  start with a graph, then add numerical summaries categorical variable distribution

46 Distribution of a variable the distribution of a variable is an indicator of how the data is placed in its categories of interest categorical variable distribution

47 looking at distribution using a pie chart categorical variable distribution

48 looking at distribution using a bar graph categorical variable distribution

49 Quantitative variables: from bar charts to histograms percent of individuals with BS degree If more people have college degrees, will that affect the state’s economy? Let’s look at the data. What’s wrong with this bar graph? Histograms help diminish the complexity of data “consumed”

50 Quantitative variables: from bar charts to histograms- II percent of individuals with BS degree in Excel: Tool->DataAnalysis->Histograms

51 histogram example: MDs per 100,000 How many MDs per 100,000 people?

52 Interpreting Histograms look for overall pattern  note striking deviations shape, center, and spread are the important characteristics so-called “outliers” are deviations Steps:  choose the classes, divide data into classes of equal width  count the individuals in each class  draw the histogram, each bar represents a “class” or “span”

53 Describing Distributions note bar graph below  distribution of the percent of college graduates with twice as many classes

54 Symmetric and Skewed Distributions symmetric  left and right sides are nearly mirror images skewed  leaning to left or right

55 Symmetric Distribution

56 Skewed Distributions left (negatively) skewed right (positively) skewed

57 Quantitative variables: stemplots stemplots  good for small data sets  presents more detailed information steps to making a stemplot  separate each observation into a stem and a leaf  write stems in a vertical column, smallest at top  write each leaf in the row to the right of its stem, in increasing order

58 Data for stemplot; examples to work at board

59 stemplots earnings of high school grads (in thousands) eg02-05.dat 0 | 56 1 | 29 2 | 01245 3 | 12 4 | 0337 5 | 6 | 7 0 | 56 1 | 29 2 | 01245 3 | 12 4 | 0337 5 | 6 | 7 this row represents the numbers 40000, 43000, 43000, and 47000,

60 time plot: water levels in the Everglades time plots present each observation against the time of measurement look for cycles and trends.

61 Summary Data sets contain information on a number of individuals For each individual, the data give values for one or more variables  variables describe a person’s height, sex, or salary Some variables are categorical, some quantitative  categorical variables: e.g., male or female  quantitative variables have numerical values like height in cm or salary in dollars Exploratory data analysis:  use graphs and numerical summaries to describe the variables in a data set and the relations among them Distribution:  describes values taken by a variable and how often  pie charts and bar graphs show distribution of a categorical variable.  Histograms and stemplots graph the distribution of a quantitative variable.  Shape, center, and spread describe patterns of distribution of a quantitative variable.  Outliers are observations outside the overall pattern of distribution Time plots  used when observations are taken over time. Note trends and cycles.