Introduction to the Practice of Statistics

Introduction to the Practice of Statistics
Instructor : Alex Kulik Office : C-11, p. 2.03

Your grade homework (25%) quizzes (25%) midterm 1 (25%)
Total grade of 90%=bdb, 70%=db, 50%=dst. Late homeworks and quizzes are not accepted. Class participation is required. Contact the instructor if expecting problems to take an exam.

Textbook Introduction to the Practice of Statistics, 4th edition, by David S. Moore and George P. McCabe – available in the library of C-11. We will go through Chapters 1-12 omitting Chapter 11.

To do.. Get a calculator, especially for tests.
Install MS Excel at home, to be occasionally used for your homework. Regularly visit out web page for the schedule, lecture notes, assignments, solutions, tables.

Data: We use data to answer scientific questions.
Data has variability. To assess the evidence data provide, we need to distinguish signal from noise.

Example Study the effect of exercise on cholesterol levels. One group exercises and another does not. Is cholesterol reduced by exercise? Consider: people differ other factors may have an effect exercise may affect other factors

What is Statistics? The science of understanding data and making decisions in face of variability/randomness. The set of methods to analyze the data and to design the experiment in order to extract information and quantify its reliability

Section 1.1 (Numbering as in the textbook) Data set: Individuals and Variables
Individuals – objects described by a set of data (people, animals, things) Variable – characteristic of the individuals

Types of Variables Variables Quantitative Continuous Discrete Ordinal
Not ordinal Categorical

Types of variables Quantitative (numerical)
Continuous: e.g. height, weight, concentration Discrete: e.g. number of customers, flowers Categorical (non-numerical) Ordinal: e.g. choices on a survey: never, rarely, occasionally, often, always Non-ordinal: e.g. shape, race

Example: Information on employees

Exploratory data analysis variables
Distribution = description of count or percent. Categorical variables: visualize the distribution by using bar char or pie chart. Quantitative variables: visualize the distribution by stemplot or histogram.

Education of 25- to 34-years-olds (US)
Count (in milions) Percent Less than high school 4.7 12.3 High school graduate 11.8 30.7 Some college 10.9 28.3 Bachelor’s degree 8.5 22.1 Advanced degree 2.5 6.6

Bar graph of education

Pie chart of education

Distribution of quantitative variables
Individual observations often differ—we observe a cloud rather than a few values The distribution of quantitative variables is displayed by histogram

Examining distributions
Describe the pattern: Shape: e.g. symmetric or skewed in one direction; the number of modes, Center – e.g. the midpoint, Spread –e.g. the range between the smallest and the largest values. Look for outliers – individual values that do not match the overall pattern.

A glimpse at the distribution
Example: Numbers of home runs that Babe Ruth hit in each of his 15 years (1920 – 1034) with the New York Yankees: Stemplot, also called stem-and-leaf plot. Leaf = the last digit Stem = all but the last digit

Draw the stem-and-leaf plot: a) write the stems
b) write the leafs for each stem c) order the leaves on each stem We can increase the number of stems by splitting them into two, e.g. one with leaves 0 to 4 and one with leaves 5 through 9. We can also round numbers before making stemplot.

Back-to-back stemplot
Compare the counts of Babe Ruth’s hits and Mark McGwire’s hits:

Distribution at large: Histograms

Frequency Table of the Hispanic data
Class Count Percent 30 60 1 2 10 20 4 8

Histogram of Percent of Hispanic adults

Histogram, comments: The ranges of the variable are called bins.
Bins should be convenient; usually of equal length, covering the whole range of data. The number of bins is a matter of judgement, choose e.g. an integer close to the square root of the number of observations. Frequency histogram = has counts Relative frequency histogram = has percents

Labelling the graph is important!
The horizontal axis is for the variable. The vertical axis is for the counts/frequencies or relative frequencies/percents. Remember to label the axes precisely as in our examples.

,800 nanoseconds

Give frequency table of Newcomb’s data

Draw frequency histogram of Newcomb’s data (Then relative frequency histogram)

Histogram of Newcomb’s data (note left outliers)

Other plots: e.g. time series
May exhibit hidden mechanisms Trend – persistent, long-term rise or fall Seasonal variation – a pattern that repeats itself at known regular intervals of time. ...less important in this course.

Time plots. Newcomb’s data.

Section 1.2 Describing distributions with numbers:
Mean Median Quartiles Boxplot Standard deviation Changing the unit of measurement

Measures of Centre Mean The arithmetic mean of a data set (average)
Denoted by Mean can be easily influenced by outliers, i.e. it is not resistant.

Median Median is the midpoint of a distribution: Sort the data in increasing order. Median equals the (n+1)/2-th observation if n is odd, and it is the average of the two middle observations if n is even. Median is a resistant measure of center. Outliers do not influence median much.

Mean vs. Median In a symmetric distribution mean=median
In a skewed distribution the mean is further out in the long tail than the median is. Example: The mean price of existing houses sold in was 176,200. The median price of these houses was 139,000.

Measures of spread Quartiles: Q2 (second quartile)=Median
Q1 (first quartile) =median of the lower “half” of the sorted data Q3 (third quartile) = median of the upper half of data p-th percentile – number q such that approximately p percent of the observations are smaller than q. Q1, Q2, Q3 are 25th, 50th, 75th percentiles.

The InterQuanileRange and criterion for outliers
The interquartile range: IQR=Q3-Q1 An observation is an outlier if it falls more then 1.5*IQR above the third quartile or more than 1.5*IQR below the first quartile. We often remove the outliers from the data.

Standard deviation Deviation of i-th observation: Variance:

Five-Number Summary Minimum, Q1, Median, Q3, Maximum
Boxplot – visual representation of the five- number summary.

Statistics: Minicomp. City Minicomp. Highway Two-seater City Highway
w/o outlier mean 13.4 25.8 19.2 14.1 23.4 median 18 25 26 14.5 Q1 16 23 13 21 Q3 20 28 27 SD 2.42 3.16 11.2 11.5 5.07 5.34

Boxplots

Hispanics data: the histogram...

...and a boxplot... Modified boxplot: outliers shown.

Five-Number Summary VS. Standard Deviation
s=0 when there is no spread s is not resistant The five-number summary usually better describes a skewed distribution or a distribution with outliers. Mean and standard deviation are usually used for reasonably symmetric distributions without outliers.

Linear Transformations: xnew=a+bxold
Examples: xmiles=0.62 xkm xg=28.35 xoz

Linear transformations do not change the shape of a distribution.
They do change the center and the spread e.g: Pythons 1 2 3 4 5 oz 1.13 1.02 1.23 1.06 1.16 g 32 29 35 30 33

Effect of a linear transformation: xnew=a+b*xold
meannew=a+b*meanold mediannew=a+b*medianold stdnew=|b|*stdold IRQnew=|b|*IRQold

in [g] in [oz] Mean Median SD
Calculate mean, median and SD for the weight of pythons in [g] in [oz] Mean Median SD

Introduction to the Practice of Statistics

Similar presentations

Presentation on theme: "Introduction to the Practice of Statistics"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to the Practice of Statistics

Similar presentations

Presentation on theme: "Introduction to the Practice of Statistics"— Presentation transcript:

Similar presentations

About project

Feedback