# Computing in Archaeology Basic Statistics Week 8 (25/04/07) © Richard Haddlesey www.medievalarchitecture.net.

## Presentation on theme: "Computing in Archaeology Basic Statistics Week 8 (25/04/07) © Richard Haddlesey www.medievalarchitecture.net."— Presentation transcript:

Computing in Archaeology Basic Statistics Week 8 (25/04/07) © Richard Haddlesey www.medievalarchitecture.net

Aims To familiarise ourselves with KEY statistical terms and their meanings To familiarise ourselves with KEY statistical terms and their meanings To understand the use of stats in archaeology To understand the use of stats in archaeology To assign variables, appropriate levels of measurement, at the recording level To assign variables, appropriate levels of measurement, at the recording level

Key texts

Basic Stats Batch Variables Case Post holes Length, area, diameter Post hole ID

Variables Variables are measured according to one of FOUR levels Variables are measured according to one of FOUR levels 1. Nominal = arbitrary name 2. Ordinal = sequence with no distance 3. Interval = sequence with fixed distance 4. Ratio = sequence with a fixed datum

Vince NOIR Vince NOIR N ominal N ominal O rdinal O rdinal I nterval I nterval R atio R atio

Nominal examples Condition Condition Age Age Diameter Diameter Length Length Context Context Period Period

Ordinal examples Condition Condition 1. Excellent 2. Good 3. Fair 4. Poor Here 2 may be between 1 and 3 but is unlikely to be of equal distance Here 2 may be between 1 and 3 but is unlikely to be of equal distance

Interval examples Period Period 1. Late Bronze (1200-650) 2. Early Iron (649-100) 3. Late Iron (100+) Here, if we have 3 artefacts dated 150BC, 300BC and 450BC, although b may be equal distance between a and c, c is not twice as old as a. Here, if we have 3 artefacts dated 150BC, 300BC and 450BC, although b may be equal distance between a and c, c is not twice as old as a. This is because there is no datum. This is because there is no datum.

Ratio examples Age instead of period Age instead of period 1000 ya is twice 500 ya1000 ya is twice 500 ya 20kg is twice 10kg20kg is twice 10kg Ratio is the highest level of measurement because it has a datum Ratio is the highest level of measurement because it has a datum

Mortlake style bowl Fengate style bowl Grooved ware jar Nominal, Ordinal and Interval

Note! Avoid using 0 or 1 to indicate such variables as yes or no, as we may need to know if it is no or no data Avoid using 0 or 1 to indicate such variables as yes or no, as we may need to know if it is no or no data Also when using presence or absence you may wish to add missing to avoid confusion Also when using presence or absence you may wish to add missing to avoid confusion

Further distinction Nominal and Ordinal Nominal and Ordinal = categorical = categorical = qualitative = qualitative Interval and Ratio Interval and Ratio = continuous= continuous = quantitative= quantitative

Coding Nominal and Ordinal often need coding, to minimise errors, via a keyword index Nominal and Ordinal often need coding, to minimise errors, via a keyword index con = context con = context str = stray findstr = stray find set = settlementset = settlement bur = burialbur = burial Avoid 1,2,3,etc, as you will have to keep looking up their meanings which is time consuming Avoid 1,2,3,etc, as you will have to keep looking up their meanings which is time consuming

Coding NOTE! EVERY DATA VALUE MUST HAVE A CODE AND ONLY ONE CODE!

Grouping Good for periods, as in Good for periods, as in Late Bronze (1200-650)Late Bronze (1200-650) Early Iron (649-100)Early Iron (649-100) Late Iron (100+)Late Iron (100+) NOTE: it is better to record as a continuous variable (i.e. 780BC), then group as an output (i.e. Late Bronze) NOTE: it is better to record as a continuous variable (i.e. 780BC), then group as an output (i.e. Late Bronze)

Good Practice Always keep a CLEAN version of the original data set Always keep a CLEAN version of the original data set

Exploring the data

example data set

univariate frequency table speciesfrequency cattle187 sheep109 pig78 horse21 Total395

speciespitsditchesTotal cattle67120187 sheep6346109 pig413778 horse31821 Total174221395 bivariate frequency table

speciespitsditchesTotal cattle 67 39% 120 54% 187 sheep 63 36% 46 21% 109 pig 41 24% 37 17% 78 horse 3 2% 18 8% 21 Total 174 100% 221 100% 395

Multivariate These tend to operate on a table, or matrix of items, described in terms of a set of variables These tend to operate on a table, or matrix of items, described in terms of a set of variables

Pictorial displays for categorical data

bar chart

multiple bar chart

pie chart

Pictorial displays for continuous data

histogram

Basic descriptive statistics: mode median mean range variance standard deviation

pottery fragments (weights in grams): 2, 2, 3, 5, 8

pottery fragments (weights in grams): 2, 2, 3, 5, 8 Mode = 2

Mode Mode is the only way to measure average/typical in the Nominal class Mode is the only way to measure average/typical in the Nominal class If there are two averages then they are bimodal (1,2,3,3,6,6,7,8,9) If there are two averages then they are bimodal (1,2,3,3,6,6,7,8,9) Three = trimodal, etc. Three = trimodal, etc.

pottery fragments (weights in grams): 2, 2, 3, 5, 8 Mode = 2 Median = 3

Median Best for ordinal and above Best for ordinal and above If the number of variables is even, you make a number between the two middle numbers If the number of variables is even, you make a number between the two middle numbers (1,2,3,4,5,6,7,8 = 4+5/2=4.5) (1,2,3,4,5,6,7,8 = 4+5/2=4.5)

pottery fragments (weights in grams): 2, 2, 3, 5, 8 Mode = 2 Median = 3 Mean = (2+2+3+5+8)/5 = 4

Mean The most commonly used average and, it will only work for interval and ratio The most commonly used average and, it will only work for interval and ratio It is the most important measure of position because a lot of further statistical analyses are based on it It is the most important measure of position because a lot of further statistical analyses are based on it

Conclusion It is important to understand that the mode, median and mean are three quite different measures of position which can give three different values when applied to the same data-set It is important to understand that the mode, median and mean are three quite different measures of position which can give three different values when applied to the same data-set 2, 2, 3, 5, 8 2, 2, 3, 5, 6, 8 Mode = 22 Median = 34 Mean = 44.333

The skew symmetrical Positive skewNegative skew

Measures of variability – the spread

pottery fragments (weights in grams): 2, 2, 3, 5, 8 Range = max – min 8 - 2 = 6 Very simple and of limited use

variance key:

pottery fragments (weights in grams): 2, 2, 3, 5, 8 s 2 = (2-4) 2 + (2-4) 2 + (3-4) 2 +(5-4) 2 + (8-4) 2 5 variance (s 2 ) s 2 = 5.2 s 2 = (Mean = 2=2=3=5=8/5=4)

variance standard deviation

pottery fragments (weights in grams): 2, 2, 3, 5, 8 variance (s 2 ) = = 5.2 standard deviation = = (variance) = 5.2 = 2.28

Summary Variables are measured according to one of FOUR levels Variables are measured according to one of FOUR levels 1. Nominal = arbitrary name 2. Ordinal = sequence with no distance 3. Interval = sequence with fixed distance 4. Ratio = sequence with a fixed datum

Summary Measures of position (average/typical) Measures of position (average/typical) ModeMode MedianMedian MeanMean RangeRange VarianceVariance Standard DeviationStandard Deviation