Presentation is loading. Please wait.

Presentation is loading. Please wait.

Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables.

Similar presentations


Presentation on theme: "Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables."— Presentation transcript:

1

2 Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables (x), by estimating the parameters that define the relationship. Functional form known –Non-linear relationships whose parameters can be estimated by linear regression: e.g, y = ax b, y = ab x, y = ae bx –Non-linear relationships whose parameters can be estimated by non- linear regression, e.g, Functional form unknown: lowess/loess. While lowess and loess are often treated as synonyms, some people do insist that they are different as prescribed below: –lowess: a locally weighted linear least squares regression, generally involving a single IV –loess: a locally weighted linear or quadratic least squares regression, involving one or more IVs

3 Xuhua Xia Logistic growth 0 10 20 30 40 50 0102030 Time N Commonly Encountered Funtions

4 Rationale of nonlinear regression Both linear and non-linear regression aim to find the parameter values that minimize the residual sum of squared deviation, RSS =  [y – E(y)] 2 For linear regression, a solution exists for intercept (a) and slope (b); for non-linear regression, such a solution often does not exist and we need to try various combination of parameter values. Let's us first pretend that we do not know the solution for a and b in linear regression and try different a and b to find the best parameter estimates that minimize RSS. Xuhua Xia Slide 3

5 Get slope and intercept the hard way Xuhua Xia Slide 4 The data set has been used before in our first lecture on regression. X is humidity and Y is weight loss. Double-click it and copy to an EXCEL sheet. We will try different combination of intercept (a) and slope (b) to find the best combination that minimizes RSS. From the plot we can guess that a  9 and b  -0.06 The 3 rd column is the predicted value: E(Y) = a – bx The 4 th column is squared deviation: [Y – E(Y)] 2 You may first try different a and b values. Better ones will make RSS smaller. Now use EXCEL solver to automate this process. You may do an ordinary linear regression to check the parameter estimates. Summary: Guestimate parameter values Try different parameter values to minimize RSS EXCEL solver will try parameter values from 0 up. If a parameter is negative as the slope in our case, express the predicted value E(Y) as a – bx.

6 By using nls Xuhua Xia Slide 5 TimeN 0.520 142 1.575 2149 2.5278 3515 3.51018 42372 4.54416 56533 5.513068 619624 6.532663 757079 7.566230 887369 8.595274 9109380 9.599875 10129872 246810 0 20000 60000 100000 Time N Initial values of the parameters to estimate: K: carrying capacity: 200000? N 0 : 10? r: 1.35?

7 Use EXCEL solver to do estimates Xuhua Xia Slide 6 These (K, N0, and r) are our guestimates. Now refine them by using EXCEL solver (or by hand if you so wish

8 nls output Xuhua Xia Slide 7 md<-read.table("nlinLogistic.txt",header=T) attach(md) fit<-nls(N~N0*K/(N0+(K-N0)*exp(-r*Time)),start=c(K=150000,N0=10,r=1.35)) plot(Time,N) lines(Time,fitted(fit)) Parameters: Estimate Std. Error t value Pr(>|t|) K 1.232e+05 5.412e+03 22.759 3.59e-14 N0 2.708e+01 2.186e+01 1.239 0.232 r 1.151e+00 1.181e-01 9.753 2.23e-08

9 Xuhua Xia Slide 8 Fitting another equation In rapidly replicating unicellular eukaryotes such as the yeast, highly expressed intron-containing genes requires more efficient splicing sites than lowly expressed genes. GE: gene expression Natural selection will operate on the mutations at the slicing sites to optimize splicing efficiency (SE). Observation: SE increases with GE non-linearly, then levels off and appears to have reached a maximum. GESE 10.46 20.47 30.57 40.61 50.62 60.68 70.69 80.78 90.7 100.74 110.77 120.78 130.74 130.8 150.8 160.78

10 Xuhua Xia Slide 9 Guesstimate initial values The minimum of E(SE) is  when GE = 0.   4 The maximum of E(SE) is  /  when GE is large, e.g., 15),  /   8, i.e.,   8  The relationship is almost linear when GE is small. When GE = 6, SE  6.5.   8   8*0.278  2.22

11 Using EXCEL Solver Xuhua Xia Slide 10 These (K, N0, and r) are our guestimates. Now refine them by using EXCEL solver (or by hand if you so wish

12 R functions and output Xuhua Xia Slide 11 md<-read.table("nlinGESE.txt",header=T) attach(md) fit<-nls(SE~(a+b*GE)/(1+g*GE),start=c(a=4,b=2.22,g=0.278)) summary(fit) plot(GE,SE) lines(GE,fitted(fit)) Parameters: Estimate SE t P a 2.6668 0.9741 2.738 0.0169 b 1.9694 0.8687 2.267 0.0411 g 0.2036 0.1043 1.951 0.0729 51015 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 GE SE

13 Xuhua Xia Slide 12 A general approach Sometimes we do not know the functional form. So here is a general approach. Same problem as before, but we are not sure of the exact relationship between SE and GE GESE 10.46 20.47 30.57 40.61 50.62 60.68 70.69 80.78 90.7 100.74 110.77 120.78 130.74 130.8 150.8 160.78

14 A general approach Xuhua Xia Slide 13 1.y increases with x at decreasing rate: use a polynomial to approximate, e.g., y = a + bx + cx 2 when x < x 0 2.When x reaches a certain level (x 0 ), y reaches its maximum and does not increase any more, y = y max for x  x 0

15 Xuhua Xia Slide 14 Guesstimate initial values When GE=0 then SE = , so   4 For a short segment of GE, the relationship between SE and GE is approximately linear, i.e., SE  a + bGE. When GE increases from 2 to 8, SE increases from 4.7 to 7.5, so   (7.5-4.7)/(8-2)  0.47 Given the linear approximation, with   0.4 and   0.47, then SE for GE = 12 should be 0.4+0.47  12 = 9.6, but the actual SE is only about 7.7. This must be due to the quadratic term  GE 2, i.e., (7.7 – 9.6) =   12 2, so   - 0.02

16 Xuhua Xia Slide 15 A few more twists The continuity condition requires that The smoothness condition requires that The two conditions implies that We will find α, β, and  that minimise RSS =  [SE-E(SE)] 2 We tell R to substitute various values for α, β, and , and find the set of values that minimizes RSS Note that GE 0 and c are not parameters because they are functions of α, β, .

17 R statements to do the job md<-read.table("nlinGESE.txt",header=T) attach(md) # Function for estimating the parameters by minimizing RRS # a: alpha, b: beta, g: gamma, x0: GE0 myF <- function(x) { a<- x[1] b<- x[2] g<- x[3] x0<- -b/2/g c<- a-b^2/4/g seg1Data<-subset(md,subset=(md$GE < x0)) EY<- a+b*seg1Data$GE+g*seg1Data$GE*seg1Data$GE sumD2<-sum((seg1Data$SE-EY)^2) seg2Data = x0)) sumD2<-sumD2 + sum((seg2Data$SE-c)^2) } # obtain solution by supplying the initial values for a, b, g, and the function sol<-optim(c(4,0.47,-0.02),myF) a<-sol$par[1] b<-sol$par[2] g<-sol$par[3] x0<- -b/2/g c<- a-b^2/4/g seg1Data<-subset(md,subset=(md$GE < x0)) EY1<- a+b*seg1Data$GE+g*seg1Data$GE*seg1Data$GE PredY<- c(EY1,rep(c,length(GE)-length(seg1Data$GE))) plot(GE,SE) lines(GE,PredY, col="red") abline(v=x0)

18 Output $par [1] 3.49320527 0.64625314 -0.02431488 $value [1] 1.180377 $counts function gradient 150 NA $convergence [1] 0 c [1] 7.787315 x0 [1] 13.28925 51015 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 GE SE RSS α, β, and  0 means success

19 Xuhua Xia Slide 18 Robust regression LOWESS: robust local regression between Y and X, with linear fitting LOESS: robust local regression between Y and one or more Xs, with linear or quadratic fitting Used with relations that cannot be expressed in functional forms SAS: proc loess Data: –Data set: monthly averaged atmospheric pressure differences between Easter Island, Chile and Darwin, Australia for a period of 168 months (NIST, 1998), suspected to exhibit 12-month (annual), 42-month (El Nino), and 25-month (Southern Oscillation) cycles (From Robert Cohen of SAS Institute)

20 lowess in R Xuhua Xia Slide 19 md<-read.table("nlinGESE.txt",header=T) attach(md) fit<-loess(SE~GE,span=0.75,degree=1|2) summary(fit) pred<-predict(fit,GE,se=TRUE) OR pred<-predict(fit,c(3,6),se=TRUE) plot(GE,SE) lines(GE,pred$fit,col="red") par(mfrow=c(2,3)) for(span in seq(0.4,0.9,0.1)) { fit<-loess(SE~GE,span=span) pred<-predict(fit,GE) sTitle<-paste0("span = ",span) plot(GE,SE,main=sTitle) lines(GE,pred,col="red") } smooth parameter α (proportion of data points used): larger = more smooth, default=0.75 linear or quadratic, default is 1 tricubic weighting (proportional to (1 - (dist/maxdist) 3 ) 3 ) How would I know which span value to use?

21

22 Plotting the fitted values > fit<-loess(SE~GE,span=0.8) > pred<-predict(fit,GE,se=T) > pred $fit [1] 4.445761... $se.fit [1] 0.2785894... $residual.scale [1] 0.3273702 $df [1] 10.77648 t<-qt(0.975,pred$df) ub<-pred$fit+t*pred$se.fit lb<-pred$fit-t*pred$se.fit plot(GE,SE) lines(GE,pred$fit) lines(GE,lb,col="red") lines(GE,ub,col="red") plot(GE,SE,ylim=c(min(lb),max(ub)))... 51015 4 5 6 7 8 GE SE


Download ppt "Slide 1 Non-linear regression All regression analyses are for finding the relationship between a dependent variable (y) and one or more independent variables."

Similar presentations


Ads by Google