# Transformations Getting normal or using the linear model.

## Presentation on theme: "Transformations Getting normal or using the linear model."— Presentation transcript:

Transformations Getting normal or using the linear model

Two Reasons to Transform Variables do not fit a normal distribution and parametric tests are desired A relationship between two variables is non-linear but transformation would allow the use of linear regression

Non-Normal Data Reasons real data can fail to follow a normal distribution: –Errors in measurement are multiplicative rather than additive, e.g. ± 2% rather than ± 2mm –Constraints on the dimensions of an artifact feature are not symmetrical, e.g. point length must exceed haft length but can be as long as the material allows

Non-Normal Data 2 –Measurements are products rather than sums of other measurements, e.g. area, volume –Counts follow binomial, poisson, or negative binomial distributions which are often asymmetrical unless sample sizes are large

Solutions Use non-parametric methods that do not depend on the normality of the data (increasingly easy to do) Use data transformations that shift the distribution to one that is normal

Transformation The goal is to change the spacing of the data to compress a long tail and draw out a flat tail The transformation must preserve the order of the original data – we only change the spacing between data points

Transformation Right skewed data with many zeros cannot be transformed effectively since nothing can stretch out observations that have the same value – e.g. artifact counts by site, grid square are often poisson distributed with many zeros

An Example Using the DartPoints data set, we saw that Length was asymmetrical Plot the kernel density of Length with and without a log scale to see the difference To transform Length we would use –logLength <- log(DartPoints\$Length)

plot(density(DartPoints\$Length), main="Dart Point Length", xlab="Normal scale") plot(density(DartPoints\$Length), main="Dart Point Length", xlab="Log scale", log="x")

Common Transformations Tail to the right –Natural or common (base 10) logarithm – no zero values –Square root, cube root, etc – zeros ok –Inverse, -1/x, -1/x 2, etc – no zero values Tail to the left –Exponential e x,10 x (low values) –Square, cube, etc

Other Transformations arctangent (inverse tangent) to handle values between 0 and 1 used for population studies of non-metric traits

Transforming to Linear By transforming variables before using linear regression we can fit nonlinear equations In some cases we can express the fitted equation in terms of the original untransformed variables

Polynomial Y = a + b1x + b2x 2 + b3x 3 + b4x 4... Create polynomial values or use the function poly() within lm() Begin with linear and then work up to quadratic, cubic, and so on until the new terms are not significant Eg. lm(y~x+I(x^2)+I(x^3))

Power Function Log-log transformation Use log() to transform dependent and independent variables Compute linear regression –log(y) = a + b * log(x) –y = Ax b (where A= exp(a)) If b = 1, same as the linear model x, y > 0

Exponential function Semi-log transformation Use log() to transform dependent variable, y > 0 Compute linear regression –log(y) = a + b * x –y = Ae bx (where A= exp(a)) Fits data with asymptotes

Inverse Function Reciprocal transformation – 1/x where x ≠ 0 Used for distance models – marriage, trade, social interaction declines with distance Fits data with asymptotes

Other Functions Logarithmic – no zeros in x –y = a + b * log(x) Square Root – no negative values in x –y = a + b * sqrt(x)

Examples Human cranial capacity over the last 1.8 million years Number of Identified Specimens (NISP) and Minimum Number of Individuals (MNI) at Chucalissa (Middle Misssissippian site)

# BrainsCC.RData # Explore logs with scatterplot RegModel.1 <- lm(BrainCC~AgeKa, data=BrainsCC) # Rcmdr summary(RegModel.1) # Rcmdr BrainsCC\$logAge <- with(BrainsCC, log(AgeKa)) # Rcmdr BrainsCC\$logBrain <- with(BrainsCC, log(BrainCC)) # Rcmdr RegModel.2 <- lm(logBrain~logAge, data=BrainsCC) # Rcmdr summary(RegModel.2) # Rcmdr RegModel.3 <- lm(BrainCC~logAge, data=BrainsCC) # Rcmdr summary(RegModel.3) # Rcmdr plot(BrainCC~AgeKa, data=BrainsCC, pch="+") abline(RegModel.1, lty=1, lwd=2, col="black") x <- seq(0, 1800, 10) logx <- log(x) lines(x, exp(predict(RegModel.2, data.frame(logAge=logx))), lty=1, lwd=2, col="red") lines(x, predict(RegModel.3, data.frame(logAge=logx)), lty=1, lwd=2, col="blue") legend("topright", c("Linear", "Power", "Logarithmic"), lty=1, lwd=2, col=c("black", "red", "blue"))

LinearModel.4 <- lm(BrainCC ~ AgeKa + I(AgeKa^2), data=BrainsCC) summary(LinearModel.4) LinearModel.5 <- lm(BrainCC ~ AgeKa + I(AgeKa^2) + I(AgeKa^3), data=BrainsCC) summary(LinearModel.5) LinearModel.6 <- lm(BrainCC ~ AgeKa + I(AgeKa^2) + I(AgeKa^3) + I(AgeKa^4), data=BrainsCC) summary(LinearModel.6) plot(BrainCC~AgeKa, data=BrainsCC, pch="+") abline(RegModel.1, lty=1, lwd=2, col="black") x <- seq(0, 1800, 10) lines(x, predict(LinearModel.4, data.frame(AgeKa=x)), lty=1, lwd=2, col="red") lines(x, predict(LinearModel.5, data.frame(AgeKa=x)), lty=1, lwd=2, col="blue") lines(x, predict(LinearModel.6, data.frame(AgeKa=x)), lty=1, lwd=2, col="green") legend("topright", c("Linear", "Quadratic", "Cubic", "Quartic"), lty=1, lwd=2, col=c("black", "red", "blue", "green"))

load("C:/Users/DCarlson/Documents/anth642/R/Data/Chucalissa.rda") #Rcmdr plot(mni~nisp, data=Chucalissa) RegModel.1 <- lm(mni~nisp, data=Chucalissa) #Rcmdr summary(RegModel.1) #Rcmdr abline(RegModel.1) plot(mni~nisp, data=Chucalissa, log="xy") # Plot log-log transform plot(mni~nisp, data=Chucalissa, log="y") # Plot semi-log transform Chucalissa\$logMNI <- log(Chucalissa\$mni) # Create logged variables Chucalissa\$logNISP <- log(Chucalissa\$nisp) plot(logMNI~logNISP, data=Chucalissa) RegModel.2 <- lm(logMNI~logNISP, data=Chucalissa) #Rcmdr summary(RegModel.2) #Rcmdr abline(RegModel.2) plot(mni~nisp, data=Chucalissa) # plot log-log equation on original data a2 <- exp(RegModel.2\$coefficients[[1]]) # Convert a to exp(a) b2 <- RegModel.2\$coefficients[[2]] a1 <- RegModel.1\$coefficients[[1]] b1 <- RegModel.1\$coefficients[[2]] curve(a2*x^b2, 0, 3250, add=TRUE) abline(RegModel.1, lty=3) text(locator(), as.expression(substitute(MNI == a*NISP^b, list(a=round(a2, 4), b=round(b2, 4)))), pos=2) text(locator(), as.expression(substitute(MNI == a+b*NISP, list(a=round(a1, 4), b=round(b1, 4)))), pos=4)