Download presentation
Presentation is loading. Please wait.
1
The Nature of Statistics:
07/16/96 The Nature of Statistics: The art of learning about and understanding our world through data. 1##
2
Essentials: The Nature of Statistics (a. k
Essentials: The Nature of Statistics (a.k.a: The bare minimum I should take along from this topic.) Definitions and relationships as presented on the Anatomy of the Basics: Statistical Terms and Relationships sheet Identification of variables and their characteristics Careful review of data and their presentation Providing a context for the data Why use percentages rather than numeric counts when making comparisons
3
69.2 80 35,000 What do you know about these numbers?
07/16/96 ,000 What do you know about these numbers? What do they mean to you? What is missing? 3##
4
Okay, so What is Statistics?
07/16/96 Okay, so What is Statistics? (or is that What ARE Statistics?) Statistics is the study of how to collect, organize, analyze, interpret and report numerical information in order to make decisions. Statistics are the numeric data we use to better understand our world. They may take the form of frequencies, means, percentages, variances, etc. 4##
5
07/16/96 What is a Study? 3 Types: Observational – observe and measure; can identify association, not causation. Experimentation – impose treatment and observe characteristics; can help establish causation. Simulation – using computers to simulate situations that are not practical to do in real time. What is a study? A search for information of interest to us. OBSERVATIONAL STUDY - Researchers simply observe characteristics and take measurements. E.g. Interested in diet and how it affects a child’s growth. Diet and growth are monitored and recorded over a specified period of time. Observation is passive. The observer records data without interfering with the process being observed. Observational studies can reveal only association (as opposed to causation) EXPERIMENTATION - Researchers impose treatments and controls and then observe characteristics. E.g. Interested in how a specific vitamin added to a child’s diet affects growth. A specific amount of a specific vitamin is added to one group of childrens’ diets, while another group receives no addition of the vitamin. All else remains the same. Growth is recorded over a specified period of time. Experimentation is active. The experimenter attempts to completely control the experimental situation. Experimentation can help establish causation. Okay, so we’ve decided to do a study. What is next???? 5## 5##
6
07/16/96 Basic Terminology DATA: Are numbers with a context - i.e. numbers with meaning. Examples: not 48.2, but 48.2 kg. not 5.23, but 5.23 inches) VARIABLE: A characteristic or property of an individual population unit that varies from one person or thing to another. Examples: age, square footage, and assessed value represent three variables associated with homes in Oneonta. Variables have Values. Example: The variable hair color has the values of brown, blonde, red, etc. UNIT (Element): Any individual member of the defined population. Examples: Each bottle of soda in a production run is a unit; each penny in a roll of pennies is a unit; each person enrolled in a class is a unit. Elements (aka units, individuals is studying people) VARIABLE: a characteristic of a unit of the population/sample for which data is collected. e.g. height, weight, car type, bathroom tissue. Put this all together: Suppose I wish to know the percentage of all students taking stat101 this semester who are doing so for the sheer interest in statistics. What is pop., variable, data collected, response values. Select a sample from pop. Following to a specified procedure, ask a question, collect answers, etc. 6##
7
Data: One variable (here unidentified, i. e
Data: One variable (here unidentified, i.e. no context), multiple values “Raw” Data (N=160) “Organized” raw data (N=160) Unit 73 “different” numbers
8
Time Period Otsego Lake was Frozen (days)
Raw Data Grouped Data
9
Time Period Otsego Lake was Frozen (days)
10
Data: Two Variables: year and days; multiple values
11
Time Period Otsego Lake was Frozen: Mean Days/Decade
12
Time Period Otsego Lake was Frozen: Mean Days/Decade
13
So is the Greenhouse Effect at work here?
To be studied through further statistical analysis, such as the use of ANOVA…
14
Anatomy of the Basics: Statistical Terms and Relationships
07/16/96 Anatomy of the Basics: Statistical Terms and Relationships Descriptive Statistics: methods for organizing and summarizing information. E.g. Number of students in this class by major, baseball standings, housing sales by month. Statistics is the study of how to collect, organize, analyze, interpret and report numerical information. Inferential Statistics: methods for drawing conclusions and measuring the reliability of those conclusions using sample results. E.g. Political views of all 4-year college students. Parameter: numerical characteristic of a population. Census: data collected from ALL members of the population. Population: all individuals, items, or objects whose characteristics are being studied. Population vs. Sample Sample: a portion of the population selected for study. Statistic: numerical characteristic of a sample. Qualitative: a variable that cannot be measured numerically E.g. Gender, eye color. Variable: a characteristic or property of an individual unit. Variables have values. Discrete: a variable whose values are countable. It can only assume certain values, with no intermediate values. E.g. Number of auto accidents in Oneonta in 1998. Quantitative: a variable that can be measured numerically. E.g. Income, height, number of siblings one has. Continuous: a variable that can assume any numerical value over an interval or intervals. E.g.Time. Nominal: grouping individual observations into qualitative categories or classes. E.g. Grouping individuals by whether they are left-handed or right-handed. No Arithmetic Operations: individual observations can only be categorized. Ordinal: individual observations are assigned a number or “ranking.” There is a sense of “more than,” but you cannot say “how much” more than. E.g. Military ranks. Scaling of Variables (Measurement Levels) Interval: variables have no true zero point. Cannot say how much more. E.g. Temperature ( F or C), IQ scores. Arithmetic Operations: individual observations have meaningful numeric values. Ratio: variables have a true zero point. Can say how much more. E.g. Weight, height. 14##
15
Population Basic Terminology
07/16/96 Population Basic Terminology POPULATION: Complete collection of all elements or units (usually people, objects, transactions, or events) that we are interested in studying. In terms of data, a population is the collection of all outcomes, responses, measurement, or counts that are of interest. CENSUS: A complete enumeration (or accounting) of the population (i.e. collecting data from every element (or unit) in the population). PARAMETER: A numeric value associated with a population. (e.g. - the average height of ALL students in this class, given that the class has been defined as a population) Elements (aka units, individuals is studying people) VARIABLE: a characteristic of a unit of the population/sample for which data is collected. e.g. height, weight, car type, bathroom tissue. Put this all together: Suppose I wish to know the percentage of all students taking stat101 this semester who are doing so for the sheer interest in statistics. What is pop., variable, data collected, response values. Select a sample from pop. Following to a specified procedure, ask a question, collect answers, etc. 15##
16
Sample Basic Terminology
07/16/96 Sample Basic Terminology SAMPLE: Taken from a population a sample is a subset from which information is collected. Example: 25 cans of corn (sample) randomly obtained from a full days production (population) STATISTIC: A numeric value associated with a sample. Example: the average height of 10 individuals randomly selected from the class (defined population). INFERENCE: An estimate, prediction, or some other generalization about a population based on information contained in a sample. Example: Based upon a randomly selected sample of 25 flights at JKF International Airport (the sample; individual flights are units) taken from all flights on Dec. 24, 2009 (defined population), we can state with a degree of confidence the mean delay for the population of the day’s flights was 35 minutes (sample statistic in context being inferred to the population). 16##
17
In Summary To include ALL units, you are looking at: POPULATION CENSUS
07/16/96 In Summary To include ALL units, you are looking at: POPULATION CENSUS PARAMETERS To work with a subset of all units, you are looking at: SAMPLE STATISTICS INFERENCES to a population Parameter Population Statistic Sample 17##
18
Example: Identifying Data Sets
In a recent survey, 1708 adults in the United States were asked if they think global warming is a problem that requires immediate government action. Nine hundred thirty-nine of the adults said yes. Describe the data set. Identify: The population: The sample: A variable being studied: Values of the Variable: Source; Adapted from: Pew Research Center; Larson/Farber 4th ed.
19
Examples: Populations & Samples
07/16/96 Examples: Populations & Samples Smoking: Identify the population and sample. A survey, 250 college students at Union College were asked if they smoked cigarettes regularly. Thirty-five of the students said yes. Identify the population and the sample. Student Income: Decide whether the numerical value describes a population parameter or a sample statistic. A survey of 450 Cornell University students reported their average weekly income from part-time employment was $325. For both of the above studies: What are the units of the population/sample? Identify a variable being studied. Identify values of the variable. 19##
20
Descriptive Statistics:
07/16/96 Descriptive Statistics: DESCRIPTIVE STATISTICS: Organize and summarize information using numerical and graphical methods. Examples: Summarizing the age of cars driven by students in a frequency table. Graphing the ages of students. Identifying the mean speed of cars driving in a 30 mph zone. A descriptive statement describes some aspect of the data. (Select a statistical measure and put it into sentence format.) Thirty-eight percent of the orange trees suffered damage due to the cold temperatures. The average weight for the 23 cars studied was 2,738 lb. The mean number of days Otsego Lake was frozen per winter was days. 20##
21
Descriptive Statistics at Work: SUNY Oneonta Car Registrations
07/16/96 Descriptive Statistics at Work: SUNY Oneonta Car Registrations Numeric tables, pictures (graphs & charts), and text are three methods used to present data. During the 2006 year there were cars registered at SUNY Oneonta. Car registrations contain many variables, such as car type, car color, year of car, and license plate number. Noted below are ways descriptive statistics are used to convey information about the selected variables: a frequency table of Registrant Type (i.e. who registered the car); a graphic presentation of Vehicle Age; and text (written descriptive statement) presenting the mean Vehicle Age, of the registered cars. Frequency Table: Graphic presentation (here a Histogram): Here, I am simply reporting my results. I am not inferring the results to a greater population. Mean & Median: The Mean age of cars driven by students was 7.45 years (vs yrs. for employees). The Median age of registered vehicles for students was 7.0 years (5.0 years for employees). 21##
22
Inferential Statistics:
07/16/96 Inferential Statistics: INFERENTIAL STATISTICS uses sample data to make estimates, decisions, predictions, or other generalizations about the population. The aim of inferential statistics is to make an inference about a population, based on a sample (as opposed to a population census), AND to provide a measure of precision for the method used to make the inference. An inferential statement uses data from a sample and applies it to a population. In what may be more understandable terms: We want to be able to make a statement about a group as a whole, by examining just a portion of the group, and we want to be able to say just how “good” or accurate our statement is. Let’s look at some terminology that we will be using throughout the course: 22##
23
Examples of Inferential Statistics:
07/16/96 Examples of Inferential Statistics: A Gallup Poll found that 57% of dating teens had been out with somebody of another race or ethnic group (+/- 4.5%; 95% CI) Interpretation: We are 95% confident that between 52.5% and 61.5% (57% +/- 4/5%) of dating teens have been out with someone of a different race/ethnicity. A Gallup Poll found that 40% of Americans would quit their job if they won the lottery (+/- 4%; 95% CI). Interpretation: We are 95% confident that the true population proportion of Americans who would quit their job if they were to win a lottery lies between 36% and 44%). 23##
24
Example: Descriptive and Inferential Statistics
Decide which part of the study represents the descriptive branch of statistics. What conclusions might be drawn from the study using inferential statistics? A large sample of men, aged 48, was studied for 18 years. For unmarried men, approximately 70% were alive at age 65. For married men, 90% were alive at age 65. Source: (The Journal of Family Issues) Larson/Farber 4th ed.
25
Characteristics of Data
07/16/96 Characteristics of Data Before conducting any data analysis the characteristics of the variable under study must be identified. This will result in utilizing appropriate tables, graphs and statistical analysis. Two Types of Data Qualitative Data can be separated into different categories (values) that are distinguished by some nonnumeric characteristic. Qualitative data are also referred to as categorical or attribute data. Examples include gender, eye color, and car brands Note that the values of this type of variable are differentiated by words rather than numeric values. Example: Eye Color values include blue, brown, hazel, etc. 25##
26
07/16/96 Quantitative Data are “number-based” and represent counts or measurements. This type of data may be subdivided into two categories... Discrete Data - result when the number of possible values is either a finite or a countably infinite number. Examples: Siblings, Cars, and Coins in a jar (think of whole number counts here; even if you cannot count them all). Continuous Data - result from infinitely many possible values corresponding to some continuous scale that covers a range of values without gaps, interruptions, or jumps. Continuous data can assume any value, including fractional parts. Examples: Height, Weight, Time N.B.: Qualitative data cannot be classified as discrete or continuous. 26##
27
Example: Classifying Data by Type
The base prices of several vehicles are shown in the table. Which data are qualitative data and which are quantitative data? (Source Ford Motor Company) Source: Larson/Farber 4th ed.
28
4 Levels of Measurement Nominal Ordinal Interval Ratio
07/16/96 4 Levels of Measurement The level of measurement determines which statistical calculations are meaningful. The four levels of measurement are: nominal, ordinal, interval, and ratio. Nominal Lowest to highest Levels of Measurement Ordinal Interval Ratio 28##
29
Levels of Measurement (cont.)
07/16/96 Levels of Measurement (cont.) Nominal – characterized by data that consist of names, labels, or categories only. The data cannot be arranged in an ordering scheme. Qualitative data. Examples: Gender, Yes/No, Political Party affiliation, names of students. Ordinal – characterized by data that can be arranged in some order, but the differences between data values either cannot be determined or are meaningless. These variables may be either qualitative (categorical) data or quantitative (numerical) data. Examples: Military Rank, Position in a race, Attitude scales. INTERVAL: temperature, years Start w/ ratio and work backwards. Is there a zero? If, not … 29##
30
Levels of Measurement (cont.)
07/16/96 Levels of Measurement (cont.) Interval – like the ordinal level, with the additional property that the difference between any two data values is meaningful. However, there is no natural zero starting point. Quantitative data. Examples: Temperature (F or C); longitude; Calendar Years. Ratio – is the interval level modified to include the natural zero starting point. At this level, differences and ratios are both meaningful. Quantitative data. Examples: Height, Weight, Time, Age. INTERVAL: temperature, years Start w/ ratio and work backwards. Is there a zero? If, not … 30## 30##
31
Summary of Levels of Measurement
07/16/96 Summary of Levels of Measurement Determine if one data value is a multiple of another Subtract data values Arrange data in order Put data in categories Level of measurement Nominal Yes No No No Ordinal Yes Yes No No Interval Yes Yes Yes No Ratio Yes Yes Yes Yes 31##
32
Example: Classifying Data by Level
Two data sets are shown. Which data set consists of data at the nominal level? Which data set consists of data at the ordinal level? (Source: Nielsen Media Research) Source: Larson/Farber 4th ed.
33
Example: Classifying Data by Level
Two data sets are shown. Which data set consists of data at the interval level? Which data set consists of data at the ratio level? (Source: Major League Baseball) Source: Larson/Farber 4th ed.
34
Anatomy of the Basics: Statistical Terms and Relationships
07/16/96 Anatomy of the Basics: Statistical Terms and Relationships Descriptive Statistics: methods for organizing and summarizing information. E.g. Number of students in this class by major, baseball standings, housing sales by month. Statistics is the study of how to collect, organize, analyze, interpret and report numerical information. Inferential Statistics: methods for drawing conclusions and measuring the reliability of those conclusions using sample results. E.g. Political views of all 4-year college students. Parameter: numerical characteristic of a population. Census: data collected from ALL members of the population. Population: all individuals, items, or objects whose characteristics are being studied. Population vs. Sample Sample: a portion of the population selected for study. Statistic: numerical characteristic of a sample. Qualitative: a variable that cannot be measured numerically E.g. Gender, eye color. Variable: a characteristic or property of an individual unit. Variables have values. Discrete: a variable whose values are countable. It can only assume certain values, with no intermediate values. E.g. Number of auto accidents in Oneonta in 1998. Quantitative: a variable that can be measured numerically. E.g. Income, height, number of siblings one has. Continuous: a variable that can assume any numerical value over an interval or intervals. E.g.Time. Nominal: grouping individual observations into qualitative categories or classes. E.g. Grouping individuals by whether they are left-handed or right-handed. No Arithmetic Operations: individual observations can only be categorized. Ordinal: individual observations are assigned a number or “ranking.” There is a sense of “more than,” but you cannot say “how much” more than. E.g. Military ranks. Scaling of Variables (Measurement Levels) Interval: variables have no true zero point. Cannot say how much more. E.g. Temperature ( F or C), IQ scores. Arithmetic Operations: individual observations have meaningful numeric values. Ratio: variables have a true zero point. Can say how much more. E.g. Weight, height. 34##
35
Misuse of Statistics Precise Numbers Guesstimates
07/16/96 Misuse of Statistics ah yes… the old torture the data long enough and they will confess to anything routine... Precise Numbers Tonight’s paid attendance was 56,423 Guesstimates It was estimated that one million spectators lined the rode to L’Alpe d’Heuz for the 16th stage of the 2004 Tour de France race. Distorted Percentages New and improved with 50% more ... – 50% might not be a meaningful amount. Partial Pictures Ford truck adds Loaded Questions Line item veto Misleading Graphs Visual distortions of data Pictographs The crescive cow. Pollster Pressure Public bathrooms. Small/Bad Samples 67% suspended Self-Selected Surveys CNN phone-in surveys Small Samples: e.g. toothpaste preferences of 10 dentists should not be used as basis for a generalized claim such as : “XYZ toothpaste is recommended by 7 out of ten dentists.” How representative and unbiased data are comes into question Precise Numbers: A precise figure such as an annual salary of $35,455.92, may be used to sound precise and instill a degree of confidence in its accuracy whereas a figure of $35,400 may not convey that same sense.Precision does not mean accurate. e.g. number of visitor to Statue of Liberty Guesstimates: e.g. Times Square to watch the millennium ball drop - 2 million people there - if you have been to Times Square, do you think 2 million people could fit in that space? E.g. est. of people seeing Pope in Miami: 250,000; while aerial photos and a grid system estimated more in the area of 150,000. Distorted %ages: e.g. “Our service has improved 100%.” !00% of what base amount. Could be not a meaningful change. Partial Picture: e.g. Ford pickup truck commercials noting the 94% (or whatever) of Ford trucks built in the past 10 years are still on the road. What you are not told is the number of trucks built per year - more in recent years, so most of the trucks included in the broad statement are actually only a couple years old. Deliberate Distortions: Data that does not exist or that has been misrepresented. Loaded Questions: e.g. “Should the President have the line item veto to eliminate waste?” 97% said yes to a mail survey. vs. “Should the President have the line item veto, or not.” - 57% of randomly selected subjects responded yes. e.g. “Would you say that traffic contributes more or less to air polution than industry/ vs. Would you say that industry contributes more or less to air pollution than traffic? First version: 45%traffic, 32% industry; second version – 24%traffic, 57% industry. Misleading Graphs: e.g. drawn bar chart of male vd. Female median wages. - truncating of charts. Pictographs: Diminishing Rhinoceros. Misuse of proportion. Pollster Pressure: Giving responses favorable to their self-image. E.g. Phone survey - 94% people wash hands after using bathroom. Observational study found 68%. Bad Samples: How data were gathered. E.g. A Nightline poll in which 186,000 tv viewere each paid $.50 to express an opinion about the U.N. Because viewers selected to participate, they are an exapmle of a self-selected survey. Quality of the data Preparation of the study and report Statistical Methodology Lack of knowledge of the subject matter Deliberate suppression of data 35##
36
Pictograph: “This year my business profits doubled!”
07/16/96 Pictograph: “This year my business profits doubled!” If you double the length, width, and height of a cube, the volume will increase by a factor of 8. 36##
37
Visual Presentations of Data – Beware
07/16/96 Visual Presentations of Data – Beware Source: 37## 37##
38
07/16/96 Data Considerations Anecdotal Evidence – basing our conclusions on a few individual cases. e.g. We remember the airplane crash that kills several hundred people and fail to notice that data for all flights show that flying is much safer than driving. Lurking Variables – almost all relationships between two variables are influenced by other variables lurking in the background. This is by no means an exhaustive list, but simply some examples of what is “out there.” Lurking Variables Let’s take a look. 38##
39
07/16/96 Airline Flights: Alaska Airlines vs. American West Which would you choose to fly? On Time Delayed Alaska Airlines 3274 (86.7%) 501 (13.3%) America West 6438 (89.1%) 787 (10.9%) Let’s look at the percentage of late flights for each airline: Alaska Airlines: 501/3775 = .133 or 13.3% America West: 787/7225 = .109 or 10.9% It appears that America West does better. But wait. Let’s add a third variable, the city that the flight left from. 39##
40
Alaska Airlines vs. American West A Closer Look
07/16/96 Alaska Airlines vs. American West A Closer Look Alaska Air America West Departure Location On Time Delayed Los Angeles 497 62 694 117 Phoenix 221 12 4840 415 San Diego 212 20 383 65 San Francisco 503 102 320 129 Seattle 1841 305 201 61 TOTAL 3274 501 6438 787 Let’s look again at the percentages of late flights. This time, by city. Los Angeles Alaska Airlines: 62/559 = .111 or 11.1% America West: 117/811 = .144 or 14.4% Phoenix Alaska Airlines: 12/233 = .052 or 5.2% America West: 415/5255 = .079 or 7.9% San Diego Alaska Airlines: 20/232 = .086 or 8.6% America West: 65/448 = .145 or 14.5% San Fran. Alaska Airlines: 102/605 = .169 or 16.9% America West: 129/449 = .287 or 28.7% Seattle Alaska Airlines: 305/2146 = .142 or 14.2% America West: 61/262 = .233 or 23.3% How is it that Alaska Airlines wins at every city but America West wins when we combine all the cities???? Look at the data: America West flies most often from sunny Phoenix, where there are few delays. Alaska Airlines flies most often from Seattle, where fog and rain cause frequent delays. What city we fly from has a major influence on the chance of a delay, so including the city data reverses our conclusion. This is an example of what we call Simpson’s Paradox, and it bears repeating… Almost all relationships between two variables are influenced by other background (lurking) variables. 40##
41
Alaska Air America West
07/16/96 We now know that American West has a better “On Time” record, but Alaska Airlines has a better “On Time” record at every airport. How can that be? Alaska Air America West Departure Location On Time Delayed Los Angeles 497 (88.9%) 62 (11.1) 694 (85.6) 117 (14.4) Phoenix 221 (94.8) 12 (5.2) 4840 (92.1) 415 (7.9) San Diego 212 (91.4) 20 (8.6) 383 (85.5) 65 (14.5) San Francisco 503 (83.1) 102 (16.9) 320 (71.3) 129 (28.7) Seattle 1841 (85.8) 305 (14.2) 201 (76.7) 61 (23.3) TOTAL 3274 (86.7) 501 (13.3) 6438 (89.1) 787 (10.9) Let’s look again at the percentages of late flights. This time, by city. Los Angeles Alaska Airlines: 62/559 = .111 or 11.1% America West: 117/811 = .144 or 14.4% Phoenix Alaska Airlines: 12/233 = .052 or 5.2% America West: 415/5255 = .079 or 7.9% San Diego Alaska Airlines: 20/232 = .086 or 8.6% America West: 65/448 = .145 or 14.5% San Fran. Alaska Airlines: 102/605 = .169 or 16.9% America West: 129/449 = .287 or 28.7% Seattle Alaska Airlines: 305/2146 = .142 or 14.2% America West: 61/262 = .233 or 23.3% How is it that Alaska Airlines wins at every city but America West wins when we combine all the cities???? Look at the data: America West flies most often from sunny Phoenix, where there are few delays. Alaska Airlines flies most often from Seattle, where fog and rain cause frequent delays. What city we fly from has a major influence on the chance of a delay, so including the city data reverses our conclusion. This is an example of what we call Simpson’s Paradox, and it bears repeating… Almost all relationships between two variables are influenced by other background (lurking) variables. 41##
42
07/16/96 End of Slides 42##
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.