Statistics I. Tamás Dusek Széchenyi István University 2016.

Statistics I. Tamás Dusek Széchenyi István University 2016

Historical meaning of statistics The term „statistics” have many different shades of meaning In the older, original sense of the word (18th century meaning), statistics was used for any descriptive information about the state of society By the 18th century, the term "statistics" designated the systematic collection of demographic and economic data by states Today it is also used for descriptive data which have a quantitative nature and a numerical form In this sense statistics is a method of historical research, it is a description in numerical terms of historical events that happened in a definite period of time with definite groups of people in a definite geographical area.

Modern meaning of statistics The previous meaning has nothing in common with its modern natural science meaning Accordingly statistics deals with mass phenomena and it enables us to analyze systems with very large numbers of particles In the field of natural sciences, statistics is a method of inductive research. To take an example: quantum mechanics deals with the fact that we do not know how a particle will behave in an individual instance. But we know what pattern of behavior can possibly occur and the proportion in which these patterns really occur.

Modern meaning of statistics Meaning I.: Statistics is the mathematics of the collection, organization, and interpretation of numerical data, especially the analysis of population characteristics by inference from sampling Classification and interpretation of quantitative data in accordance with probability theory and the application of methods such as hypothesis testing to them The mathematical study of the theoretical nature of such distributions and tests. Meaning II.: quantitative data on any subject

Key events in the history of statistics YearEventPerson 1532First weekly data on deaths in LondonSir W. Petty 1539Start of data collection on baptisms, marriages, and deaths in France 1608Beginning of parish registry in Sweden 1662First published demographic study based on bills of mortalityJ. Graunt 1693 Publ. of An estimate of the degrees of mortality of mankind drawn from curious tables of the births and funerals at the city of Breslaw with an attempt to ascertain the price of annuities upon livesE. Halley 1713Publ. of Ars ConjectandiJ. Bernoulli 1714Publ. of Libellus de Ratiocinus in Ludo AleaeC. Huygens 1714Publ. of The Doctrine of ChancesA. De Moivre 1763Publ. of An essay towards solving a problem in the Doctrine of ChancesRev. Bayes 1790First Census in the USA 1809Publ. of Theoria Motus Corporum CoelestiumC.F. Gauss 1812Publ. of Théorie analytique des probabilitésP.S. Laplace 1834Establishment of the Statistical Society of London 1839Establishment of the American Statistical Association (Boston) 1869Establishment of the Central Statistical Office, Hungary

Key events in the history of statistics YearEventPerson 1889Publ. of Natural InheritanceF. Galton 1900Development of the chi^2 testK. Pearson 1901Publ. of the first issue of BiometrikaF. Galton et al. 1903Development of Principal Component AnalysisK. Pearson 1908Publ. of The probable error of a mean``Student'' 1910Publ. of An introduction to the theory of statisticsG.U. Yule 1933Publ. of On the empirical determination of a distributionA.N. Kolmogorov 1935Publ. of The Design of ExperimentsR.A. Fisher 1936Publ. of Relations between two sets of variablesH. Hotelling 1972Publ. of Regression models and life tablesD.R. Cox 1972Publ. of Generalized linear modelsJ.A. Nelder and R.W.M. Wedderburn 1979Publ. of Bootstrap methods: another look at the jackknifeB. Efron

Uses of Statistics Almost all fields of study benefit from the application of statistical methods Economics, Sociology, Genetics, Insurance, Biology, Criminology, Polling, Retirement Planning, automobile fatality rates, and many more too numerous to mention. Statistics is objective, interpretation of statistics not entirely objective.

Statistics is the science of collecting, organizing, summarising, analysing, and making inference from data Descriptive statistics: collecting, organizing, summarising, analysing, and presenting data Inferential statistics: Making inferences, hypothesis testing Determining relationship, and making prediction

A simple general taxonomy of statistical methods

Image of statistics in pop culture is often negative, based on misunderstandings, mistakes or jokes

Some famous antistatistician quotations “I only believe in statistics that I doctored myself.” (Churchill) „I never believe in statitistics if I didn’t make it myself.” (Churchill) "There are three kinds of lies: lies, damned lies, and statistics." (origin is uncertain; attributed to Disraeli, but popularised by Mark Twain) Statistics is a precise and logical method for stating a half truth inaccurately. It is proven that the celebration of birthdays is healthy. Statistics show that those people who celebrate the most birthdays become the oldest.

The statistician

Statistical biases

Basic Terms Population: A collection, or set, of individuals or objects or events whose properties are to be analyzed. Two kinds of populations: finite or infinite. Sample: A subset of the population.

Variable: A characteristic about each individual element of a population or sample. Observational unit: the individual entities whose characteristics are measured Data (singular): The value of the variable associated with one element of a population or sample. This value may be a number, a word, or a symbol. Data (plural): The set of values collected for the variable from each of the elements belonging to the sample. Experiment: A planned activity whose results yield a set of data. Parameter: A numerical value summarizing all the data of an entire population. Statistic: A numerical value summarizing the sample data.

Example: A college dean is interested in learning about the average age of faculty. Identify the basic terms in this situation. The observational unit is the persons of faculty. The population is the age of all faculty members at the college. A sample is any subset of that population. For example, we might select 10 faculty members and determine their age. The variable is the “age” of each faculty member. One data would be the age of a specific faculty member. The data would be the set of values in the sample. The experiment would be the method used to select the ages forming the sample and determining the actual age of each faculty member in the sample. The parameter of interest is the “average” age of all faculty at the college. The statistic is the “average” age for all faculty in the sample.

Two kinds of variables: Qualitative, or Attribute, or Categorical, Variable: A variable that categorizes or describes an element of a population. Note: Arithmetic operations, such as addition and averaging, are not meaningful for data resulting from a qualitative variable. Quantitative, or Numerical, Variable: A variable that quantifies an element of a population. Note: Arithmetic operations such as addition and averaging, are meaningful for data resulting from a quantitative variable.

Example: Identify each of the following examples as attribute (qualitative) or numerical (quantitative) variables. 1. The residence hall for each student in a statistics class. (Attribute) 2. The amount of gasoline pumped by the next 10 customers at a MOL gasoline station. (Numerical) 3. The amount of radon in the basement of each of 25 homes in a new development. (Numerical) 4. The color of the baseball cap worn by each of 20 students. (Attribute) 5. The length of time to complete a mathematics homework assignment. (Numerical) 6. The state in which each truck is registered when stopped and inspected at a weigh station. (Attribute)

Qualitative and quantitative variables may be further subdivided Variables Quantitative Discrete (counting) Continuous (measurement) Qualitative Ordinal Categorical/Attribute

Nominal Variable: A qualitative variable that categorizes (or describes, or names) an element of a population. Ordinal Variable: A qualitative variable that incorporates an ordered position, or ranking. Discrete Variable: A quantitative variable that can assume a countable number of values. Intuitively, a discrete variable can assume values corresponding to isolated points along a line interval. That is, there is a gap between any two values. Continuous Variable: A quantitative variable that can assume an uncountable number of values. Intuitively, a continuous variable can assume any value along a line interval, including every possible value between any two values.

Note: 1.In many cases, a discrete and continuous variable may be distinguished by determining whether the variables are related to a count or a measurement. 2.Discrete variables are usually associated with counting. If the variable cannot be further subdivided, it is a clue that you are probably dealing with a discrete variable. 3.Continuous variables are usually associated with measurements. The values of discrete variables are only limited by your ability to measure them. 4.Countinuous variables are recorded often as a discrete variable.

Example Discrete The number of eggs that hens lay; for example, 3 eggs a day. The number of cars in a parking lot. Number of the inhabitants of a town. Continuous The amounts of milk that cows produce; for example, 8.343115 liter a day. The temperature. Age of a person.

Example: Identify each of the following as examples of qualitative or numerical variables: 1.The temperature in Győr, Hungary at 12:00 pm on any given day. 2.Whether or not a 6 volt lantern battery is defective. 3.The weight of a lead pencil. 4.The length of time billed for a long distance telephone call. 5.The brand of cereal children eat for breakfast. 6.The type of book taken out of the library by an adult.

Levels of measurement 1 Nominal 1A Coding 1B Qualitativ data, categorical data (gender, nationality, ethnicity, language, genre, style, biological species) 2 Ordinal – rank order 3 Interval - degree of difference; however zero is arbitrary 4 Ratio 4A continuous quantity with true zero 4B discrete quantity

Importance of the levels of measurement Helps you decide what statistical analysis is appropriate on the values that were assigned Helps you decide how to interpret the data from that variable Dangers to Avoid Attaching unwarranted significance to aspects of the numbers that do not convey meaningful information Failing to simply data when would easily do so Manipulating our data in ways that destroy information Performing meaningless statistical operations on the data

Nominal and ordinal measurement Nominal measurement: not measurement in the everyday sense of the word; the value does not imply any ordering of the cases, for example, shirt numbers in football; Even though player 17 has higher number than player 7, you can’t say from the data that he’s greater than or more than the other. When attributes can be rank-ordered Distances between attributes do not have any meaning, for example, the distance between the winner of a sport competition and the second one, and between the second and third one

The Hierarchy of Levels Nominal Interval Ratio Attributes are only named; weakest Attributes can be ordered Distance is meaningful Absolute zero Ordinal

Types of data Nominal and ordinal are qualitative (categorical) levels of measurement. Interval and ratio are quantitative levels of measurement. VARIABLES QUANTITATIVEQUALITATIVE RATIO Pulse rate Height INTERVAL 36 o -38 o C ORDINAL Social class NOMINAL Gender Ethnicity

Example: Identify each of the following as examples of (1) nominal, (2) ordinal, (3) discrete, or (4) continuous variables: 1.The length of time until a pain reliever begins to work. 2. The number of chocolate chips in a cookie. 3.The number of colors used in a statistics textbook. 4.The brand of refrigerator in a home. 5.The overall satisfaction rating of a new car. 6.The number of files on a computer’s hard disk. 7.The pH level of the water in a swimming pool. 8.The number of staples in a stapler.

Measure and Variability No matter what the response variable: there will always be variability in the data. One of the primary objectives of statistics: measuring and characterizing variability. Controlling (or reducing) variability in a manufacturing process: statistical process control.

Methods used to collect data Census: A 100% survey. Every element of the population is listed. Seldom used: difficult and time-consuming to compile, and expensive. Survey: Data are obtained by sampling some of the population of interest. The investigator does not modify the environment. Experiment: The investigator controls or modifies the environment and observes the effect on the variable under study. Administrative resources: The source of the data is an administrative activity. Other

Surveys Surveys may be administered in a variety of ways, e.g. Personal Interview, Telephone Interview, Self Administered Questionnaire, and Internet Questionnaire design principles: 1.Keep the questionnaire as short as possible. 2.Ask short, simple, and clearly worded questions. 3.Start with demographic questions to help respondents get started comfortably. 4.Use dichotomous (yes|no) and multiple choice questions. 5.Use open-ended questions cautiously. 6.Avoid using leading-questions. 7.Pretest a questionnaire on a small number of people. 8.Think about the way you intend to use the collected data when preparing the questionnaire.

Not everything that counts can be counted 5 (Quantity) Happy (Quality) Kids

Univariate descriptive statistics After collecting data, the first task is to organize and simplify the data so that it is possible to get a general overview of the results. This is the goal of descriptive statistical techniques. One method for simplifying and organizing data is to present them in graphical way

Graphical presentation Graphs and statistics are often used to persuade. Advertisers and others may accidentally or intentionally present information in a misleading way. For example, art is often used to make a graph more interesting, but it can distort the relationships in the data. Questions to Ask When Looking at Data and/or Graphs: Is the information presented correctly? Is the graph trying to influence you? Does the scale use a regular interval? What impression is the graph giving you?

Pie charts and bar graphs Both is used for categorical variables Pie charts show the amount of data that belongs to each category as a proportional part of a circle Bar graphs show the amount of data that belongs to each category as proportionally sized rectangular areas

Example:The table below lists the number of automobiles sold last week by day for a local dealership. Describe the data using a pie chart (circle graph) and a bar graph DayNumber Sold Monday15 Tuesday23 Wednesday35 Thursday11 Friday12 Saturday42

Pie chart Automobiles Sold Last Week

Bar graph Automobiles Sold Last Week Frequency

Pareto Diagram Pareto Diagram: A bar graph with the bars arranged from the most numerous category to the least numerous category. It includes a line graph displaying the cumulative percentages and counts for the bars.  Used to identify the number and type of defects that happen within a product or service  Separates the “vital few” from the “trivial many”  The Pareto diagram is often used in quality control applications

Pareto diagram example The final daily inspection defect report for a cabinet manufacturer is given in the table below: DefectNumber Dent5 Stain12 Blemish43 Chip25 Scratch40 Others10

2)The production line should try to eliminate blemishes and scratches. This would cut defects by more than 50%.

Frequency distributions and histograms Frequency distributions and histograms are used to summarize large data sets Used for quantitative variables Frequency Distribution: A listing, often expressed in chart form, that pairs each value of a variable with its frequency Ungrouped Frequency Distribution: Each value of x in the distribution stands alone Grouped Frequency Distribution: Group the values into a set of classes 1.A table that summarizes data by classes, or class intervals 2.In a typical grouped frequency distribution, there are usually 5-12 classes of equal width 3.The table may contain columns for class number, class interval, tally (if constructing by hand), frequency, relative frequency, cumulative relative frequency, and class midpoint 4.In an ungrouped frequency distribution each class consists of a single value

Guidelines for constructing a frequency distribution 1.All classes should be of the same width. In the case of very uneven distribution of the data or outliers, class width can be different. 2.Classes should be set up so that they do not overlap and so that each piece of data belongs to exactly one class 3.For problems in the text, 5-12 classes are most desirable. The square root of n is a reasonable guideline for the number of classes if n is less than 150. 4.Use a system that takes advantage of a number pattern, to guarantee accuracy 5.If possible, an even class width is often advantageous

Histogram Histogram: A bar graph representing a frequency distribution of a quantitative variable. A histogram is made up of the following components: 1.A title, which identifies the population of interest 2.A vertical scale, which identifies the frequencies in the various classes 3.A horizontal scale, which identifies the variable x. Values for the class boundaries or class midpoints may be labeled along the x- axis. Use whichever method of labeling the axis best presents the variable. Notes: The relative frequency is sometimes used on the vertical scale It is possible to create a histogram based on class midpoints

Example: A recent survey of Roman Catholic nuns summarized their ages in the table below. Age Frequency Class Midpoint ------------------------------------------------------------ 20 up to 303425 30 up to 405835 40 up to 507645 50 up to 6018755 60 up to 7025465 70 up to 8024175 80 up to 9014785

85756555453525 200 100 0 Frequency Age Roman Catholic Nuns

Special histogram: age pyramids

Terms Used to Describe Histograms Symmetrical: Both sides of the distribution are identical mirror images. There is a line of symmetry. Uniform (Rectangular): Every value appears with equal frequency Skewed: One tail is stretched out longer than the other. The direction of skewness is on the side of the longer tail. (Positively skewed vs. negatively skewed) J-Shaped: There is no tail on the side of the class with the highest frequency Bimodal: The two largest classes are separated by one or more classes. Often implies two populations are sampled. Normal: A symmetrical distribution is mounded about the mean and becomes sparse at the extremes

The mode is the value that occurs with greatest frequency The modal class is the class with the greatest frequency A bimodal distribution has two high-frequency classes separated by classes with lower frequencies Graphical representations of data should include a descriptive, meaningful title and proper identification of the vertical and horizontal scales

Ogive:A line graph of a cumulative frequency or cumulative relative frequency distribution. An ogive has the following components: 1.A title, which identifies the population or sample 2.A vertical scale, which identifies either the cumulative frequencies or the cumulative relative frequencies 3.A horizontal scale, which identifies the upper class boundaries. Until the upper boundary of a class has been reached, you cannot be sure you have accumulated all the data in the class. Therefore, the horizontal scale for an ogive is always based on the upper class boundaries. Note:Every ogive starts on the left with a relative frequency of zero at the lower class boundary of the first class and ends on the right with a relative frequency of 100% at the upper class boundary of the last class.

This graph is an ogive using cumulative relative frequencies: Cumulative Relative Frequency 0481216202428 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Test Score

Factors that make a graph misleading Y-axis scale is too big or too small Y-axis skips numbers, or does not start at zero X-axis scale is too big or too small X-axis skips numbers, or does not start at zero Axes are not labeled Data is left out Exaggerated area or volume

Misleading graphs This title tells the reader what to think (that there are huge increases in price). The actual increase in price is 2,000 pounds, which is less than a 3% increase. The graph shows the second bar as being 3 times the size of the first bar, which implies a 300% increase in price. The scale moves from 0 to 80,000 in the same amount of space as 80,000 to 81,000.

A more accurate graph An unbiased title A scale with a regular interval. This shows a more accurate picture of the increase.

Scaling Because the scale leaves out 0 to 100 (in school play ticket sales example), the bar heights make it appear that the sixth grade sold about three times as many tickets as either of the other two grades. In fact, the sixth grade sold only about 20% more. 140 142 144 146 148 150 Preferred Juice Flavors GrapeCherryApple

From CNN.com The difference in percentage points between Democrats and Republicans (and between Democrats and Independents) is 8% (62 – 54). Since the margin of error is 7%, it is likely that there is even less of a difference. The graph implies that the Democrats were 8 times more likely to agree with the decision. In truth, they were only slightly more likely to agree with the decision. The graph does not accurately demonstrate that a majority of all groups interviewed agreed with the decision.

Correct versus incorrect graph

While retail sales do go down in April 2002, the title doesn’t accurately reflect what the rest of the graph shows. Yes, the sales do rise and fall over a period of a year and a half, but in general, they have been steadily rising since November 1998. The original graph seems to be trying to convince us that April sales have very obviously fallen, these two graphs tell us the opposite. The title for the third graph has been changed completely to give the opposite minute.

The scale does not have a regular interval.

The scale is so compressed that it’s hard to see any difference among the brands.

Irregular scale axes 1993, 1996 and 1998 are missing.

Exaggerated use of Area or Volume Number of Singles Sold 1995 1996 1997 1998 purple The Brown column looks bigger than the purple column.. Number of Singles Sold 1995 1996 1997 1998

Exaggerated use of Area or Volume doubled Sales at Gerry’s Milkbar have doubled from 2014 to 2015. 20142015 The 2015 volume is eight times bigger than the 2014 volume.

Exaggerated use of Volume The new iPad battery gained 70% in capacity. They did this by making the battery on right 70% taller than the battery on left.

The perspective puts barrel 1979 at the forefront and barrel 1973 at the back. This effectively draws reader’s eyes to the 1979 barrel first and then forces him read the rest of the years in descending order. Supporting this deceptive tactic is the fact that only the foremost barrels have complete year to read. The rest are indicated with only the last two digits, as in ‘76. The makers of the graph intend for the audience to read in reverse chronological order, which has the effect of making oil prices seem to fall. Secondly, the perspective makes it hard to judge the numerical difference between each barrel. For example, even though barrel 1975 appears to be over two thirds the height of 1976, in reality, the difference between them is only $0.95. An other misleading aspect is that this pictograph doesn’t contain a scale or axis’ of any kind. Without it, the reader’s attention might be directed to the area of each barrel instead. The way in which the barrels are labeled seem somewhat awkward. Shouldn’t the prices be on the barrel instead of years? Prices written on the barrel will clarify that it is the cost that is changing, not the years. And with more space to indicate years, readers won’t be forced to read in reverse.

Pie chart should add up to 100%

Extremely bad pie chart

Preudo-pye chart What do these colors mean? Why is it divided into quadrants?

Misleading scaling of two y-axes

Problems: Only shows five numbers Y-axis is broken twice The top section is inverse of the bottom Three dimensions for no reasons

Problems: Missing y-axis the points don’t follow a straight line The four points are not equidistant with time

There are only two distinct age categories, grid lines are unnecessary

Area is independent from the represented numbers

Meaningless map due to the lack of differentiation

Absolute versus relative magnitudes

Measures of central tendency MEANAverage or arithmetic mean of the data MEDIAN The value which comes half way when the data are ranked in order MODEMost common value observed

Mean ( μ or ) The arithmetic average (add all of the scores together, then divide by the number of scores) μ = ∑x / n

Note: The mean can be greatly influenced by outliers

Median The middle number (just like the median strip that divides a highway down the middle; 50/50) To find the median: 1.Rank the data 2.Determine the depth of the median: 3.Determine the value of the median Used when data is not normally distributed Often hear about the median price of housing Example: Find the median for the set of data:{4, 8, 3, 8, 2, 9, 2, 11, 3} 1.Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11 2.Find the depth: (9+1)/2=5 3.The median is the fifth number from either end in the ranked data: 4 If n is odd Median = middle value; else, median = mean of two middle values

Mean versus median Mean Interval data with and approximately symmetric distribution Median Interval data ordinal data Mean is sensitive to outliers, median is not

Mode Mode: The mode is the value of x that occurs most frequently Note:If two or more values in a sample are tied for the highest frequency (number of occurrences), there is no mode Mode can be the minimum or maximum value.

Potential Problem with Means Mean is sensitive to outliers, median and mode are not; mode can be more „typical” than mean Mean

Mean, Median, or Mode? Mean –If the sum of all values is meaningful –Incorporates all available information Median –Intuitive sense of central tendency with outliers –What is “typical” of a set of values? Mode –When data can be grouped into distinct types, categories (categorical data)

In a normal distribution, mean and median are the same If median and mean are different, indicates that the data are not normally distributed

The Arithmetic MeanThe Geometric Mean Is the sum of the observations divided by the total number of observations (a 1 +...+a N )/N ~ Used when several quantities are added together to produce a total. - this is the midpoint of the added numbers if those numbers are stretched out on a line * Is the nth root of the product of the observations * Can also be calculated by taking the antilog of the arithmetic mean. (a 1 ·... ·a N ) 1/N ~ Used when several quantities are multiplied by a factor to give a product. - this is the average of the factors that contribute to a product. Always less than or equal to the arithmetic mean (only equal to it when the components of the set are equal) Arithmetic and geometric means

Example of the use of geometric mean If we had an investment that returned 10% the first year, 60% the second, and 20% the third what is the average rate of return? (not 30%!) To calculate this, remember 10, 60, and 20 percents are the same as multiplying the investment by 1.10, 1.60, and 1.20. To get the geometric mean calculate: (1.10 x 1.60 x 1.20) 1/3 = 1.283 or an average return of 28,3% (not 30%!)

Harmonic mean We could get the harmonic mean by: Taking the number of terms (n) in a set and dividing it by The sum of the terms’ reciprocals

Example of the use of Harmonic mean Suppose you spend 600 Ft on pills costing 30 Ft per dozen, and 600 on pills costing 20 Ft per dozen. What was the average price of the pills you bought? You spent 1200 on 50 dozen pills, so the average cost is 1200/50=24. This also happens to be the harmonic mean of 20 and 30:

The arithmetic, geometric, and harmonic means are related in the following way: the arithmetic mean > the geometric mean > the harmonic mean Unless the terms of the set are equal in which case the harmonic, arithmetic, and geometric means will all be the same.

Measures of position Measures of position are used to describe the relative location of an observation Quartiles and percentiles are two of the most popular measures of position An additional measure of central tendency, the midquartile, is defined using quartiles Quartiles are part of the 5-number summary

Quartiles: Values of the variable that divide the ranked data into quarters; each set of data has three quartiles 1.The first quartile, Q 1, is a number such that at most 25% of the data are smaller in value than Q 1 and at most 75% are larger 2.The second quartile, Q 2, is the median 3.The third quartile, Q 3, is a number such that at most 75% of the data are smaller in value than Q 3 and at most 25% are larger Ranked data, increasing order

Box-and-Whisker Display Box-and-Whisker Display: A graphic representation of the 5- number summary: The five numerical values (smallest, first quartile, median, third quartile, and largest) are located on a scale, either vertical or horizontal The box is used to depict the middle half of the data that lies between the two quartiles The whiskers are line segments used to depict the other half of the data One line segment represents the quarter of the data that is smaller in value than the first quartile The second line segment represents the quarter of the data that is larger in value that the third quartile

Importance: it helps to interpret and represent data. It gives a visual representation of data. Data set: 85,92,78,88,90,88,89 78 85 88 88 89 90 92 Lower quartile Upper quartile Median

Measures of the shape of data Shape of data is measured by –Skewness –Kurtosis There are 4 central moments: - The first central moment, r=1, is the sum of the difference of each observation from the sample average (arithmetic mean), which always equals 0 - The second central moment, r=2, is variance. - The third central moment, r=3, is skewness. Skewness describes how the sample differs in shape from a symmetrical distribution. If a normal distribution has a skewness of 0, right skewed is greater then 0 and left skewed is less than 0.

Skewness Negatively skewed distributions, skewed to the left, occur when most of the scores are toward the high end of the distribution. In a normal distribution where skewness is 0, the mean, median and mode are equal. In a negatively skewed distribution, the mode > median > mean. Positively skewed distributions occur when most of the scores are toward the low end of the distribution. In a positively skewed distribution, mode< median< mean.

Kurtosis Kurtosis is the 4 th central moment. This is the “peakedness” of a distribution. It measures the extent to which the data are distributed in the tails versus the center of the distribution There are three types of peakedness. Leptokurtic- very peaked, kurtosis + Platykurtic – relatively flat, kurtosis - Mesokurtic – in between, kurtosis 0

Measures of dispersion Measures of central tendency alone cannot completely characterize a set of data. Two very different data sets may have similar measures of central tendency. Measures of dispersion are used to describe the spread, or variability, of a distribution Common measures of dispersion: range, variance, and standard deviation Range: The difference in value between the highest- valued (H) and the lowest-valued (L) pieces of data: H-L The interquartile range is the difference between the first and third quartiles. It is the range of the middle 50% of the data

Same means, but very different distributions Mean

We need to come up with some way of measuring not just the average, but also the spread of the distribution of our data. The Standard Deviation is a number that measures how far away each number in a set of data is from their mean. If the Standard Deviation is large it means the numbers are spread out from their mean. If the Standard Deviation is small it means the numbers are close to their mean.

Standard deviation Calculating the standard deviation. 1. Find the mean of the data. 2. Subtract the mean from each value. 3. Square each deviation of the mean. 4. Find the sum of the squares. 5. Divide the total by the number of items – this is the variance. 6. Take the square root of the variance.

This is the Standard Deviation 72 76 80 81 83 84 85 89 - 9.5 - 5.5 - 1.5 - 0.5 1.5 2.5 3.5 7.5 Distance from Mean 90.25 30.25 2.25 0.25 2.25 6.25 12.25 56.25 Distances Squared Sum: 214.5 (10 - 1) = 23.8 = 4.88

Coefficient of Variation Coefficient of variation (CV) measures the spread of a set of data as a proportion of its mean. It is the ratio of the sample standard deviation to the sample mean It is sometimes expressed as a percentage It is a dimensionless number that can be used to compare the amount of variance between populations with different means

Moments of the Distribution - Summary Statistics that describe the shape of the distribution, using formulae that are similar to those of the mean and variance 1 st moment - Mean (describes central value) 2 nd moment - Variance (describes dispersion) 3 rd moment - Skewness (describes asymmetry) 4 th moment - Kurtosis (describes peakedness)

Inter-quartile range MEDIAN (50 th centile) 75 th Centile 25 th Centile 2.5 th Centile 97.5 th Centile Inter-quartile range

STANDARD DEVIATION – MEASURE OF THE SPREAD OF VALUES OF A SAMPLE AROUND THE MEAN SD decreases as a function of: smaller spread of values about the mean larger number of values IN A NORMAL DISTRIBUTION, 95% OF THE VALUES WILL LIE WITHIN 2 SDs OF THE MEAN THE SQUARE OF THE SD IS KNOWN AS THE VARIANCE

NORMAL DISTRIBUTION MEAN CASES DISTRIBUTED SYMMETRICALLY ABOUT THE MEAN THE EXTENT OF THE ‘SPREAD’ OF DATA AROUND THE MEAN – MEASURED BY THE STANDARD DEVIATION

SKEWED DISTRIBUTION MEAN MEDIAN – 50% OF VALUES WILL LIE ON EITHER SIDE OF THE MEDIAN

I’m so confused!!

Distributions, examples Normal distributionSkewed distribution Height Weight Haemoglobin Bankers’ bonuses Number of marriages

Bivariate data Bivariate Data: Consists of the values of two different response variables that are obtained from the same population of interest. Four combinations of variable types: 1.Both variables are qualitative (attribute). 2.One variable is qualitative (attribute) and the other is quantitative (numerical). 3.Both variables are ordinal. 4.Both variables are quantitative (both numerical).

Dependent or independent variables? basic question: can the state of one variable be predicted from the state of another variable? if not, they are independent if partly, the connection is stochastic If perfectly, they are dependent

Two Qualitative Variables When bivariate data results from two qualitative (attribute or categorical) variables, the data is often arranged on a cross-tabulation or contingency table. Example: A survey was conducted to investigate the relationship between preferences for television, radio, or newspaper for national news, and gender. The results are given in the table below.

This table may be extended to display the marginal totals (or marginals). The total of the marginal totals is the grand total. Contingency tables often show percentages (relative frequencies). These percentages are based on the entire sample or on the subsample (row or column) classifications.

Percentages based on the grand total (entire sample): The previous contingency table may be converted to percentages of the grand total by dividing each frequency by the grand total and multiplying by 100. For example, 175 becomes 13.3%

These same statistics (numerical values describing sample results) can be shown in a (side-by-side) bar graph.

Percentages based on row (column) totals: The entries in a contingency table may also be expressed as percentages of the row (column) totals by dividing each row (column) entry by that row’s (column’s) total and multiplying by 100. The entries in the contingency table below are expressed as percentages of the column totals.

Measure of association Chi-square is a test of independence between two variables. Typically, one is interested in knowing whether an independent variable (x) “has some effect” on a dependent variable (y). Said another way, we want to know if y is independent of x (e.g., if it goes its own way regardless of what happens to x). Thus, we might ask, “Is church attendance independent of the sex of the respondent?”

Fisher’s Exact Test just for 2 x 2 tables useful where chi-square test is inappropriate gives the exact probability of all tables with the same marginal totals as or more deviant than the observed table…

P = (a+b)!(a+c)!(b+d)!(c+d)! / (N!a!b!c!d!) P = 5!5!6!6! / 11!4!1!1!5! = 5*6!6! / 11! P = 5*6!6! / 11! = 5*6! / 11*10*9*8*7 P = 5*6! / 11*10*9*8*7 = 3600 / 55440 P =.065 ab cd 41 15

The chi-squared test is an extremely simple test of relationships between categories. –In chi-squared tests, we ask “Does the distribution of one variable depend on the categories for the other variable?” –This sort of question requires only nominal-scaled data We are usually interested in more informative tests of relationships between categories. –In such tests, we ask “As we increase the level of one variable, how do we change the level of another?” –“The more of X, the more of Y”

Chi-square Statistic an aggregate measure (i.e., based on the entire table) the greater the deviation from expected values, the larger (exponentially!) the chi- square statistic… one could devise others that would place less emphasis on large deviations  |o-e|/e

Scenario 1: Consider these data on sex of the subject and church attendance: Church Attendance SexYesNoTotal Male281240 Female421860 Total:7030100

–Note that: 70% of all persons attend church. 70% of men attend church. 70% of women attend church. –Thus, we can say that church attendance is independent of the sex of the respondent because, if the total number of church goers equals 70%, then, with independence, we expect 70% of men and 70% of women to attend church, and they do.

Scenario 2: Now, suppose we observed this pattern of church attendance: Church Attendance SexYesNoTotal Male202040 Female501060 Total:7030100 50% of the men attend church and 83.3% of the women attend church.

Observed counts is in red Expected counts is in White Church Attendance SexYesNo Male20-28 = -820-12 = 8 Female50-42 = 810-18 = -8 in each cell, if we assume independence, we make a mistake equal to “8” (sometimes positive and sometimes negative). If we add all of our mistakes, we obtain a sum of zero, which we know is not true. So, we will square each mistake to give every number a positive valence.

Proportionate error is calculated for each cell: Church Attendance SexYesNo Male(-8 )2 / 28 = 2.29(8)2 / 12 = 5.33 Female(8)2 / 42 = 1.52(-8)2 / 18 = 3.56 The total of all proportionate error = 12.70. This is the chi-square value for this table. The chi-square value of 12.70 gives us a number that summarizes our proportionate amount of mistakes for the whole table

Calculation of chi-square (43*24) 91 (7-11.8) 2 11.8 =2=2  .025

Example for Association: Biblical Literalism and Education Is the Bible the word of God or of men? (NES 2000) Chi-sq = 105.4 at 4 df  p =.000  reject the null hypothesis

chi-square is basically a measure of significance it is not a good measure of strength of association can help you decide if a relationship exists, but not how strong it is

Cramer’s V also a measure of strength of association an attempt to standardize phi-square (i.e., control the lack of an upper boundary in tables larger than 2x2 cells) V=   2 /m where m=min(r-1,c-1) ; i.e., the smaller of rows-1 or columns-1) limits: 0-1 for any size table; 1=highest possible association

Yule’s Q for 2x2 tables only Q = (ad-bc)/(ad+bc) ab cd

Collapsing tables can often combine columns/rows to increase expected counts that are too low –may increase or reduce interpretability –may create or destroy structure in the table no clear guidelines –avoid simply trying to identify the combination of cells that produces a “significant” result

obs. counts exp. counts obs. counts exp. counts

Gamma, Tau-b, Tau-c… So our independent variable, education, reduces our error in predicting Biblical literalism by either 22.2% (tau-b), 18.8% (tau-c) or 38.3 whopping % (gamma) And, SPSS reports sign. level, but let me come back to that later.

Either of these might be considered a perfect relationship, depending on one’s reasoning about what relationships between variables look like. Why are there multiple measures of association? Statisticians over the years have thought of varying ways of characterizing what a perfect relationship is: tau-b = 1, gamma = 1 tau-b <1, gamma = 1 55 1025 3 730 55 35 40

The problem: Chi-Squared tests are for nominal associations. If we use a chi-squared test when there is an ordinal association, we waste some information. Chi-Squared tests cannot distinguish the following patterns: wag es like job? no maybe yes wag es like job? no maybe yes low+ --low+ -- med-+ -med--+ high--+ high-+ -

Rule of Thumb Gamma tends to overestimate strength but gives an idea of upper boundary. If table is square use tau-b; if rectangular, use tau-c. Pollock: τ <.1 is weak;.1<τ<.2 is moderate;.2<τ<.3 moderately strong;.3< τ<1 strong.

One Qualitative and One Quantitative Variable 1.When bivariate data results from one qualitative and one quantitative variable, the quantitative values are viewed as separate samples. 2.Each set is identified by levels of the qualitative variable. 3.Each sample is described using summary statistics, and the results are displayed for side-by-side comparison. 4.Statistics for comparison: measures of central tendency, measures of variation, 5-number summary. 5.Graphs for comparison: dotplot, boxplot.

Example: A random sample of households from three different parts of the country was obtained and their electric bill for June was recorded. The data is given in the table below. The part of the country is a qualitative variable with three levels of response. The electric bill is a quantitative variable. The electric bills may be compared with numerical and graphical techniques.

Comparison using Box-and-Whisker plots:

Connection between two ordinal data Example:

Connection between two ordinal data Measure: Spearman’s Rank Correlation Coefficient r s = 1 -  di2di2 i=1 i=n n 3 - n 6

Spearman's rank correlation coefficient or Spearman's rho is named after Charles Spearman Used Greek letter ρ (rho) or as rs (non- parametric measure of statistical dependence between two variables)ρ Assesses how well the relationship between two variables can be described using a monotonic function Monotonic is a function (or monotone function) in mathematic that preserves the given order. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other

A correlation coefficient is a numerical measure or index of the amount of association between two sets of scores. It ranges in size from a maximum of +1.00 through 0.00 to -1.00 The ‘+’ sign indicates a positive correlation (the scores on one variable increase as the scores on the other variable increase) The ‘-’ sign indicates a negative correlation (the scores on one variable increase, the scores on the other variable decrease)

Interpretation The sign of the Spearman correlation indicates the direction of association between X (the independent variable) and Y (the dependent variable) If Y tends to increase when X increases, the Spearman correlation coefficient is positive If Y tends to decrease when X increases, the Spearman correlation coefficient is negative A Spearman correlation of zero indicates that there is no tendency for Y to either increase or decrease when X increases Alternative name for the Spearman rank correlation is the "grade correlation” the "rank" of an observation is replaced by the "grade" When X and Y are perfectly monotonically related, the Spearman correlation coefficient becomes 1 A perfect monotone increasing relationship implies that for any two pairs of data values Xi, Yi and Xj, Yj, that Xi − Xj and Yi − Yj always have the same sign

Example # 1 Calculate the correlation between the IQ of a person with the number of hours spent in the class per week Find the value of the term d²i: 1.Sort the data by the first column (Xi). Create a new column xi and assign it the ranked values 1,2,3,...n. 2.Sort the data by the second column (Yi). Create a fourth column yi and similarly assign it the ranked values 1,2,3,...n. 3.Create a fifth column di to hold the differences between the two rank columns (xi and yi). IQ, XiHours of class per week, Yi 1067 860 10027 10150 9928 10329 9720 11312 1126 11017

4.Create one final column to hold the value of column di squared. IQ (Xi ) Hours of class per week (Yi) rank xirank yidid²i 8601100 972026-416 992838-525 1002747-39 10150510-525 1032969-39 106773416 110178539 112692749 11312104636

Example 1- Result With d²i found, we can add them to find  d²i = 194 The value of n is 10, so; ρ = 1- 6 x 194 10(10² - 1) ρ = −0.18 The low value shows that the correlation between IQ and hours spent in the class is very low

RECOMMENDED RESOURCES The books below explain statistics simply, without excessive mathematical or logical language. –David S. Moore: The basic practice of statistics. W. H. Freeman Publishers, 2003 –Geoffrey Norman and David Steiner: PDQ Statistics. 3 rd Edition. BC Decker, 2003 –David Bowers, Allan House, David Owens: Understanding Clinical Papers (2 nd Edition). Wiley, 2006 – Douglas Altman et al.: Statistics with Confidence. 2 nd Edition. BMJ Books, 2000

Statistics I. Tamás Dusek Széchenyi István University 2016.

Similar presentations

Presentation on theme: "Statistics I. Tamás Dusek Széchenyi István University 2016."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistics I. Tamás Dusek Széchenyi István University 2016.

Similar presentations

Presentation on theme: "Statistics I. Tamás Dusek Széchenyi István University 2016."— Presentation transcript:

Similar presentations

About project

Feedback