Presentation is loading. Please wait.

Presentation is loading. Please wait.

Summarising and presenting data www.anu.edu.au/nceph/surfstat/

Similar presentations


Presentation on theme: "Summarising and presenting data www.anu.edu.au/nceph/surfstat/"— Presentation transcript:

1

2 Summarising and presenting data www.anu.edu.au/nceph/surfstat/

3 Types of data Two broad types: qualitative and quantitative Qualitative data arise when the observations fall into separate distinct categories. Examples are:  Colour of eyes : blue, green, brown etc  Exam result : pass or fail  Socio-economic status : low, middle or high. Such data are discrete

4 Quantitative Data Quantitative or numerical data arise when the observations are counts or measurements Discrete if measurements are integers –number of people in a household, –number of cigarettes smoked per day Continuous if measurements can take any value, (usually within some range) – weight – height – time

5 Variables and statistics Quantities such as sex and weight are called variables, because the value of these quantities vary from one observation to another. Numbers calculated to describe important features of the data are called statistics. For example,  the proportion of females  the average age of unemployed persons, in a sample of residents of a town are statistics.

6 Example: Commodore data Prices of n=38 second-hand cars 600067003800700058009975105005990 20000119901650010750950012995125008000 9900180009500940072501500045008900 98509000580029500*15000900042504990 110009990220040001350014500 Continuous data, need to summarise

7 Constructing a frequency distribution Calculate the range and divide it by the chosen number of intervals to get the approximate length for each interval. Usually use from 5 to 15 intervals. Define interval end points so they don't overlap or leave gaps (ie. they are mutually exclusive and exhaustive) - This ensures that every observation belongs in exactly one interval. It is a usually simpler idea to have all intervals of the same length Count the number of values in each interval (the class frequency) - go through the data once only and use tally marks to help counting. Usually relative frequencies or percentages are helpful to show the distribution of data.

8 Frequency distribution

9 Histogram area of rectangle = frequency (or relative frequency) But area = length x height So if all intervals are the same length, L

10 Features of a histogram

11 Mode The mode is the value or category which occurs most frequently. If several data values occur with the same maximal frequency, they are all modes. For example, in the Commodore data, using the grouped data, the class interval, [8,000 - 10,999], is the modal interval.

12 Modality and Symmetry Modality: No. of peaks –E.g. one peak-unimodal Skewness: departure from symmetry positive skewness (skew to the right) negative skewness (skew to the left)

13 Human histogram

14 Human histogram explained

15 Process control example Is process in control? Why the gap? Deming 500 steel rods Ideal dia. = 1cm

16 MEASURES OF CENTRAL TENDENCY ("Averages") Mean (arithmetic mean): (read as 'x bar') Notation: denote data values by x 1,x 2,…,x n n denotes no. of data points

17 Mean for frequency distribution

18 Median ‘Middle’ value of the data set A number which is greater than half the data values and less than the other half (n+1)/2 –th ordered observation Data set: 6, 6.7, 3.8, 7, 5.8 Ordered: 3.8, 5.8, 6, 6.7, 7 Median: (5+1)/2 ordered obs. If even: 6, 6.7, 3.8, 7, 5.8, 9.975

19 Quartiles and percentiles Median: 50% below, 50% above 1 st quartile: 25% below, 50% above Q 1 : (n+1)/4 ordered observation Q 3 (3 rd quartile): (3n+1)/4 ordered observation Data set: 6, 6.7, 3.8, 7, 5.8 Ordered: 3.8, 5.8, 6, 6.7, 7 p-th percentile or quantile: p% below, (100-p)% above

20 Stem and leaf plot Finally order the leaves

21 Percentiles via stem and leaf plot Get the median: Median= (n+1)/2 ordered obs. i.e. 10.5 th ordered observation Lies in the stem 7| Median=(72+76)/2 = 74 Get 1 st quartile: Q 1 = (n+1)/4 ordered obs. Get third quartile: Q 3 = (3n+1)/4 ordered obs.

22 Percentiles from a freq. distr. What are median, 1 st and 3 rd quartiles ? Actual values are 6700, 5900 and 10200 You lose details in a frequency distribution

23 Comparison of Mean and Median Data set A: 2,3,3,4,5,7,8 Data set B: 2,3,3,4,5,8,20 Both have n = 7 values. The median is not affected by extreme values, but the mean is changed Median is useful for incomplete data E.g. consider an experiment to measure average lifetime of a light bulb (n=6) : 200,400, 650, 700, 900,..

24 Comparing Mean, Median and Mode If distribution is symmetric and unimodal, all three coincide If only symmetric, mean and median coincide If distribution is not symmetric, better to use median than mean

25 MEASURES OF VARIABILITY Statistics which summarise how spread out the data values are. Also called measures of dispersion The range = max-min (used in quality control) The range is susceptible to extreme values

26 IQR The interquartile range is defined as IQR = Q 3 - Q 1 IQR is less susceptible to outliers (like the median)

27 Five number summary Boxplot (or box-and-whisker plot) Box contains middle 50% of data If an obs is > 3 times IQR, it is an outlier

28 Boxplots are useful for comparing groups

29 Deviations from the mean

30 Summarising deviations from mean The deviation of each value x i from the mean is: The mean (or sum) of deviations is not a good summary: Instead use a positive function such as d i 2 or |d i | Variance or mean square error: Mean absolute deviation:

31 Variance and Standard Deviation Usually n-1 instead of n is used in the denominator : sample variance Problem: squared distances have squared units s = the sample standard deviation.

32 Example: small data set Data set A: {x i } = 2, 3, 3, 4, 5, 7, 8: There are n=7 observations and mean = 4.57. The deviations from the mean, d i, are: -2.57, -1.57, -1.57, -0.57, 0.43, 2.43, 3.43. So

33 Shortcut formulae for variance

34 Bivariate methods We have (mostly) looked at univariate methods Most interesting problems are bi (or multi) variate Continuous variable vs. qualitative variable: comparative boxplot Continuous variable vs. continuous variable: scatterplot

35 Presenting bivariate data Scatterplots are useful for illustrating the relationship between continuous variables (x i, y i ), i = 1,..n Indicates type of relationship

36 Creating a scatterplot Step 1: Create variables ht and wt Step 2: plot(ht,wt,xlab=“height”, ylab=“weight”)

37 Summarising a relationship plot(temperature,ozone) abline(lm(ozone~temperature, data=air))

38 Summarising a nonlinear relationship plot(E,NOx) lines(supsmu(E,NOx)) Use a smoother


Download ppt "Summarising and presenting data www.anu.edu.au/nceph/surfstat/"

Similar presentations


Ads by Google