TRANSFORMATION.

TRANSFORMATION

DEFINITION A data transformation of the observations x1,x2,…,xn is a function T that replaces each xi by a new value T(xi) so that the transformed values of the batch are T(x1), T(x2),…, T(xn).

WHY WE NEED TRANSFORMATION?
Transformations of the response and predictors can improve the fit and correct violations of model assumptions: constant error variance or normality or linear relation between dependent and independent variables. We may also consider adding additional predictors that are functions of the existing predictors like quadratic or cross-product terms.

WHY WE NEED TRANSFORMATION?

WHICH TRANSFORMATION TO APPLY
Changes of origin and scale are linear transformations and they leave shape alone. Centering  Transformation of the origin Scaling  Transformation of both the origin and the scale Stronger transformation such as logarithm or square root change shape. A simple and commonly used transformation is a power transformation.

POWER TRANSFORMATION

Yt (no transformation)
POWER TRANSFORMATION p Transformation 1 1/Yt 0.5 1/(Yt)0.5 ln Yt 0.5 (Yt)0.5 1 Yt (no transformation) 2 (Yt)2

POWER TRANSFORMATION

EXAMPLE Careful data analysis begins with inspection of the data, and techniques for examining and transforming data find direct application to the analysis of data using linear models. The data for the four plots in Figure, given in the table below, were cleverly contrived by Anscombe (1973) so that the least-squares regression line and all other common regression ‘outputs’ are identical in the four datasets.

It is clear, however, that each graph tells a different story about the data:
In (a), the linear regression line is a reasonable descriptive summary of the tendency of Y to increase with X. In Figure (b), the linear regression fails to capture the clearly curvilinear relationship between the two variables; we would do much better to fit a quadratic function here. In Figure (c), there is a perfect linear relationship between Y and X for all but one outlying data point. The least-squares line is pulled strongly towards the outlier, distorting the relationship between the two variables for the rest of the data. When we encounter an outlier in real data we should look for an explanation. In (d), the values of X are invariant (all are equal to 8), with the exception of one point (which has an X-value of 19); the least squares line would be undefined but for this point. We are usually uncomfortable having the result of a data analysis depend so centrally on a single influential observation. Only in this fourth dataset is the problem immediately apparent from inspecting the numbers.

TRANSFORMATIONS FOR NONLINEAR RELATIONS
Assume that error terms are reasonably close to a normal distribution and they have approximately constant variance, but there is a nonlinear relationship between X and Y. In this case, apply transformation to X only. Not Y. Because this may change the distribution of Y causing problems with normality and constant error variance. Y Y Y X X X T(X)=log10X or X1/2 T(X)=X2 or eX T(X)=1/X or e-X

WARNING Regression coefficients will need to be interpreted with respect to the transformed scale. There is no straightforward way of back- transforming them to values that can interpreted in the original scale. You cannot directly compare regression coefficients for models where the response transformation is different. Difficulties of this type may dissuade one from transforming the response even if this requires the use of another type of model such as a generalized linear model.

TRANSFORMATIONS FOR STABILIZING VARIANCE
When a variable has very different degrees of variation in different groups, it becomes difficult to examine the data and to compare differences in level across the groups. In this case, we need to apply the transformation to the response Usually power transformation helps us to stabilize the variance. If we have a heavy-tailed symmetric distribution, variance stabilizing transformation or power transformation is not helpful.

TRANSFORMATION FOR NONNORMALITY
Many statistical methods require that the numeric variables we are working with have an approximate normal distribution. For example, t-tests, F-tests, and regression analyses all require in some sense that the numeric variables are approximately normally distributed.

TOOLS FOR ASSESSING NORMALITY
Descriptives: Skewness=0, Kurtosis=3 Histogram, Boxplot, Density Plots Normal Quantile Quantile Plot (QQ-Plot) Goodness of Fit Tests Shapiro-Wilk Test Kolmogorov-Smirnov Test Anderson-Darling Test Jarque Bera Test

NORMAL QUANTILE PLOT THE IDEAL PLOT:
Here is an example where the data is perfectly normal. The plot on right is a normal quantile plot with the data on the vertical axis and the expected z-scores if our data was normal on the horizontal axis. When our data is approximately normal the spacing of the two will agree resulting in a plot with observations lying on the reference line in the normal quantile plot. The points should lie within the dashed lines.

Normal Quantile Plot (leptokurtosis)
The distribution of sodium levels of patients in this right heart catheterization study has heavier tails than a normal distribution (i.e, leptokurtosis). When the data is plotted vs. the expected z-scores the normal quantile plot there is an “S-shape” which indicates kurtosis.

Normal Quantile Plot (discrete data)
Although the distribution of the gestational age data of infants in the very low birthweight study is approx. normal there is a “staircase” appearance in normal quantile plot. This is due to the discrete coding of the gestational age which was recorded to the nearest week or half week.

Normal Quantile Plots IMPORTANT NOTE:
If you plot DATA vs. NORMAL as on the previous slides then: downward bend = left skew upward bend = right skew If you plot NORMAL vs. DATA then: downward bend = right skew upward bend = left skew

Tukey’s Ladder of Powers
UP Here V represents our variable of interest. We are going to consider this variable raised to a power l, i.e. Vl Left skewed Bigger Impact We go up the ladder to remove left skewness and down the ladder to remove right skewness. Middle rung: No transformation (l = 1) Bigger Impact Because so many transformations available, need some way to organize – Tukey’s ladder. Upper rungs -- squares, cubes, … that is, power > 1. Lower rungs: Roots – that is, 0 < power < 1. Inverses – that is, power < 0. Why multiply inverse transformations by -1? Then, pop in the log: What is a log? Ask them? Log of number is power to which you raise a “base” to obtain the number itself: Log10100 = 2, ‘cos 100 = 102.Log = 3, ‘cos 100 = 103, etc. What’s the log of 10? What’s the log of 1? What’s the log of 1/10? What’s the log of 0? What are logs to base 2? What are logs to base e? Generally, further “up” or “down” the ladder you go, more dramatic the impact. But, the question is: How do you decide whether to go up or down? How do you decide how far to go? How do you decide whether to transform the outcome or the predictor? DOWN Right skewed

Tukey’s Ladder of Powers
To remove right skewness we typically take the square root, cube root, logarithm, or reciprocal of a the variable etc., i.e. V .5, V .333, log10(V) (think of V0) , V -1, etc. To remove left skewness we raise the variable to a power greater than 1, such as squaring or cubing the values, i.e. V 2, V 3, etc.

Removing Right Skewness
Example 1: PDP-LI levels for cancer patients In the log base 10 scale the PDP-LI values are approximately normally distributed.

Removing Right Skewness Example 2: Systolic Volume for Male Heart Patients
sysvol sysvol .5 sysvol .333 log10(sysvol) 1/sysvol

Removing Right Skewness Example 2: Systolic Volume for Male Heart Patients
1/sysvol The reciprocal of systolic volume is approximately normally distributed and the Shapiro-Wilk test provides no evidence against normality (p = .5340). CAUTION: The use of the reciprocal transformation reorders the data in the sense that the largest value becomes the smallest and the smallest becomes the largest after transformation. The units after transformation may or may not make sense, e.g. if the original units are mg/ml then after transformation they would be ml/mg.

The Lambert Way to Gaussianize Heavy-Tailed Data with the Inverse of Tukey’s h Transformation
Lambert W x F distributions are a generalized framework to analyze skewed, heavy-tailed data. It is based on an input/output system, where the output random variable (RV) Y is a non-linearly transformed version of an input RV X ~ F with similar properties as X, but slightly skewed (heavy- tailed). The transformed RV Y has a Lambert W x F distribution. Package ‘LambertW’ written by Georg M. Goerg. This package contains functions to model and analyze skewed, heavy-tailed data the Lambert Way: simulate random samples, estimate parameters, compute quantiles, and plot/ print results nicely. Probably the most important function is 'Gaussianize', which works similarly to 'scale', but actually makes the data Gaussian.

library(LambertW) set
library(LambertW) set.seed(10) ### Set parameters #### # skew Lambert W x t distribution with (location, scale, df) = (0,1,3) and positive skew parameter gamma = 0.1 theta.st <- list(beta = c(0, 1, 3), gamma = 0.1) # double heavy-tail Lambert W x Gaussian with (mu, sigma) = (0,1) and left delta=0.2; right delta = 0.4 (-> heavier on the right) theta.hh <- list(beta = c(0, 1), delta = c(0.2, 0.4)) ### Draw random sample #### # skewed Lambert W x t yy <- rLambertW(n=1000, distname="t", theta = theta.st) # double heavy-tail Lambert W x Gaussian (= Tukey's hh) zz <- rLambertW(n=1000, distname = "normal", theta = theta.hh) ### Plot ecdf and qq-plot #### op <- par(no.readonly=TRUE) par(mfrow=c(2,2), mar=c(3,3,2,1)) plot(ecdf(yy)) qqnorm(yy); qqline(yy) plot(ecdf(zz)) qqnorm(zz); qqline(zz) par(op)

### Parameter estimation #### mod
### Parameter estimation #### mod.Lst <- MLE_LambertW(yy, distname="t", type="s") mod.Lhh <- MLE_LambertW(zz, distname="normal", type="hh") layout(matrix(1:2, ncol = 2)) plot(mod.Lst) plot(mod.Lhh)

Since this heavy-tail generation is based on a bijective transformations of RVs/data, you can remove heavy-tails from data and check if they are nice now, i.e., if they are Gaussian (and test it using Normality tests). ### Test goodness of fit #### ## test if 'symmetrized' data follows a Gaussian xx <- get_input(mod.Lhh) normfit(xx) $shapiro.wilk Shapiro-Wilk normality test data: data.test W = , p-value =

install.packages('quantmod')
library('quantmod') getSymbols("GS") y=OpCl(GS) #daily percent change open to close qqnorm(y); qqline(y) z= Gaussianize(y, type="h", return.tau.mat = TRUE) x1 <- get_input(z, c(z$tau.mat[, 1])) # same as z$input test_normality(z$input) plot(z$input) qqnorm(z); qqline(z)

Transforming Proportions
Power transformations are often not helpful for proportions, since these quantities are bounded below by 0 and above by 1. • If the data values do not approach these two boundaries, then proportions can be handled much like other sorts of data. • Percents and many sorts of rates are simply rescaled proportions. • It is common to encounter ‘disguised’ proportions, such as the number of questions correct on an exam of fixed length. An example, drawn from the Canadian occupational prestige data, is shown in the stem-and-leaf display. The distribution is for the percentage of women among the incumbents of each of 102 occupations.

Several transformations are commonly employed for proportions; the most important is the logit transformation: The logit transformation is the log of the ‘odds,’ P/(1 P). The ‘trick’ of the logit transformation is to remove the upper and lower boundaries of the scale, spreading out the tails of the distribution and making the resulting quantities symmetric about 0; for example:

The logit transformations cannot be applied to proportions of exactly 0 or 1.
If we have access to the original counts, we can define adjusted proportions in place of P . Here, F is the frequency count in the focal category (e.g., number of women) and N is the total count (total number of occupational incumbents, women plus men).

Interpreting Coefficients in Regression with Log-Transformed Variables
Log transformations are one of the most commonly used transformations, but interpreting results of an analysis with log transformed data may be challenging. A log transformation is often useful for data which exhibit right skewness (positively skewed), and for data where the variability of residuals increases for larger values of the dependent variable. When a variable is log transformed, note that simply taking the anti-log of your parameters will not properly back transform into the original metric used.

To properly back transform into the original scale we need to understand some details about the log-normal distribution. In probability theory, a log-normal distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. More specifically, if a variable Y follows a log-normal distribution, then we have that ln(Y) follows a normal distribution with a mean = μ and a variance = 2 .

Interpreting parameter estimates in a linear regression when variables have been log transformed is not always straightforward either. The standard interpretation of a regression parameter  is that a one-unit change in the predictor results in  units change in the expected value of the response variable while holding all the other predictors constant. Interpreting a log transformed variable can also be done in such a manner. However, such coefficients are routinely interpreted in terms of percent change. Below we will explore the interpretation in a simple linear regression setting when the dependent variable, or the independent variable, or both variables are log transformed.

TRANSFORMATION.

Similar presentations

Presentation on theme: "TRANSFORMATION."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

TRANSFORMATION.

Similar presentations

Presentation on theme: "TRANSFORMATION."— Presentation transcript:

Similar presentations

About project

Feedback