Dr. Mark Gardener 



Statistics for Ecologists Using R and Excel (Edition 2)Data Collection, Exploration, Analysis and Presentationby: Mark GardenerAvailable soon from Pelagic Publishing Welcome to the support pages for Statistics for Ecologists. These pages provide information and support material for the book. You should be able to find an outline and table of contents as well as support datafiles and additional material. Support Index  Exercises Index  Outline & TOC  Data files 

Exercise 11.3.2 

Table of Contents


Section 11.3.2 Multiple logistic regression: Example datafile: 
11.3.2 Model building in logisitc regressionThis exercise is concerned with how to build a logistic regression model (Section 11.3.2). The processes are very similar to those described in Section 11.1.2, where you saw a stepwise building process used with regular multiple linear regression. The data for this exercise are available as a separate file newt regression.RData. The file will open in R and create a data object, gcn. IntroductionWhen you have several (or indeed many) predictor variables you want to find a regression model that best describes the relationship between the variables. You should not incorporate every variable that you've got. Eventually you'll explain all the variability in your response variable simply because you've got so many explanatory predictors. The process of modelbuilding allows you to select the "best" variable to add to your current regression model. In the book you see how to carry out stepwise model building using a regular multiple regression (Section 11.1.2). In this exercise you can have a go at building a logistic regression model. The process is much the same as described in Section 11.1.2. 

Example data: Find the most important habitat factors for presence of great crested newts. Compare regression models. 
The example dataThe data you'll use comes from a survey of an amphibian (great creted newt) in southern England. You can get the data separately as newt regression.RData, which will load the gcn object to your console. > head(gcn) The first column, presence, gives the presence or absence of newts as a binary variable (0 = absence, 1= presence). The researcher used standard survey methods to detect the presence (or otherwise) of newts at 200 ponds. The other variables are habitat factors:
In this kind of survey the various habitat factors are converted to an index, the indices are combined to make a final HSI (habitat suitability index). The HSI is used to make it easier to assess waterbodies for their potential to support populations of the great crested newt and to give a measure of reproducibility to surveys. The example datafile is a "real" survey conducted on 200 ponds in Buckinghamshire in the UK. I've cut down the variables a little (the original data had the indices for each habitat variable) but essentially the dataset is unmodified. 

Use logisitic regression to: Find best factors 
The research taskThe research task is twofold:
The fisrt task will involve building a logistic regression model, which will determine that will show the habitat factors that are "most influential". The second tast will be to make a simple regression model using the HSI variable as a single predictor. The multiplfactor model and the HSI model can then be compared to see if there is any substantial difference in deviance explained. 

Use AIC values to determine the single best term o add to a regression model. Use add1() command to get AIC values in model building. 
Building a multiple regression modelTo build the model you'll need to use the glm() command, which carries out the logisitic regression (when you set family = binomial). You'll also need to use the add1() command. The command takes an existing regression model and adds all the variables you specify, one at a time. The results are then displayed. You then use the AIC values (a measure of how bad the model fit is) to select the next best variable to add to your model. You'll want to pick the variable with the lowest AIC value. With many potential variables you can of course end up with a comprehensive list. It is therefore helpful to see the significant variables. You can tell the add1() command to carry out a significance test too.


Start with a blank model, with an intercept only. 
Start with a blank modelThe starting point for your regression model should be one with an intercept only and no terms except the response. In our example the response is the presence variable. > gcn.glm = glm(presence ~ 1, data = gcn, family = binomial) You could use the summary() command to look at the result but it would not be very informative at this stage as you have not incorporated any variables (model terms). 

Use add1() to see which single variable is "best" to add to your regression model. Select a significant variable with the lowest AIC value. Replace the intercept in your blank model with the variable with the lowest AIC. 
Add the first termNow you have a blank model (intercept only) you will need to look at the effect of adding each variable in turn. The variable that has the lowest AIC value will be the one you want to incorporate (replace the intercept with the variable). The add1() command will "interrogate" the regression model and return a list of variables (from a source you specify, usually the original dataset). To make the selection process easier you can carry out a significance test at the same time. You'll want to include only variables that are significant. > add1(gcn.glm, scope = gcn, test = "Chisq") Single term additions Model: presence ~ 1 Df Deviance AIC LRT Pr(>Chi) <none> 261.37 263.37 area 1 258.88 262.88 2.492 0.1144364 dry 1 261.31 265.31 0.054 0.8159993 water 1 255.07 259.07 6.295 0.0121104 * shade 1 248.52 252.52 12.844 0.0003386 *** bird 1 261.03 265.03 0.338 0.5611190 fish 1 257.16 261.16 4.207 0.0402523 * other.ponds 1 252.05 256.05 9.317 0.0022700 ** land 1 260.80 264.80 0.568 0.4508771 macro 1 246.55 250.55 14.818 0.0001184 *** HSI 1 228.67 232.67 32.701 1.075e08 *** Note that the HSI variable has the lowest AIC value. Ignore that for the moment (we'll return to it later). For now just focus on the habitat variables. The macro variable has the lowest of the other AIC values (250.55) and is significant too. So you can use the up arrow on the R console to recall the earlier command and editi it to change the regression model. Replace the intercept with the macro variable. > gcn.glm = glm(presence ~ macro, data = gcn, family = binomial) Now you've got a model with a single term. 

Use add1() to add subsequent terms to the regression model. Stop if new terms are not significant. 
Add subsequent termsTo add more terms you:
If you follow the process you'll end up with a model like so: > gcn.glm = glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial) The water variable shows up as being significant if you runn add1() on this model but we'll stop because the pvalue is only 0.044. When building a multiple variable model it is not a bad thing to set a more rigorous cutoff value. The final model can be summarised: > summary(gcn.glm) Call: glm(formula = presence ~ macro + other.ponds + fish + shade, family = binomial, data = gcn) Deviance Residuals: Min 1Q Median 3Q Max 1.8022 0.9099 0.5447 0.9757 2.0516 Coefficients: Estimate Std. Error z value Pr(>z) (Intercept) 2.909742 0.704410 4.131 3.62e05 *** macro 0.018963 0.006368 2.978 0.00290 ** other.ponds 0.241031 0.078685 3.063 0.00219 ** fish 0.494789 0.172173 2.874 0.00406 ** shade 0.014619 0.005284 2.767 0.00566 **  Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 261.37 on 199 degrees of freedom Residual deviance: 222.57 on 195 degrees of freedom AIC: 232.57 Number of Fisher Scoring iterations: 4 

Use the deviance results to get a statistic equivalent to Rsquared. 
Explained devianceThe logistic regression does not return an Rsquared value. What you see is Null deviance and Residual deviance. You can use these values to get a value that is equivalent to the Rsquared, let's call it the Dsquared value:
Where model is your regression model result. > 1  (gcn.glm$deviance / gcn.glm$null.deviance) You can see the result is about 0.15, which is not really great when you consider that there are 4 explanatory variables in the model! 

Custom function to calculate Dsquared. 
Dsquared calculatorYou can make a simple custom command to calculate the Dsquared value. Type the following into your console to setup a custom command called d2: d2 = function(model, digits = 4) { round(1(model$deviance/model$null.deviance),digits=digits) } Now you can use d2(model) to carry out the calculations simply and easily. Just give the command the name of your regression model (you can also use the digits parameter to view different numbers of decimal places). 

Compare models using anova() or residual deviance (Dsquared). 
Compare modelsYou can compare models in two main ways:
First of all let's make some alternative models to compare: > mod1 = glm(presence ~ macro, data = gcn, family = binomial) The first model uses the single predictor macro (cover of macrophytes). This was the variable identified as the "best" of the habitatbased variables. The second model is the multiple variable model using the 4 best habitatbased variables. The third model uses the HSI variable alone. Recall that this is a composite index that uses information from several habitat factors in its construction. 

Use anova() to compare model deviance and examine differences between regresion models. 
Use anova()The anova() command can compare models. The name of the command is slightly misleading as it is not really analysis of variance. Rather it compares deviance of models that are based on the same set of variables. You can get a "test" of signifcance, with the logistic regression you need test = "Chisq" as you did with the add1() command. You can only really compare models that are based on the same subset of variables. You cannot really get meaningful results by comparing models that each contain only a single predictor variable. If you look at the logistic regression models for the newts: > anova(mod1, mod2, test = "Chisq") Analysis of Deviance Table Model 1: presence ~ macro Model 2: presence ~ macro + other.ponds + fish + shade Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 198 246.55 2 195 222.57 3 23.981 2.52e05 *** You can see that there is a significant difference between the two models. We would expect this of course, since we've been adding significant variables to the original model. Now look at the habitatbased model and the overall indexbased one: > anova(mod2, mod3, test = "Chisq") Analysis of Deviance Table Model 1: presence ~ macro + other.ponds + fish + shade Model 2: presence ~ HSI Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 195 222.57 2 198 228.67 3 6.0985 0.1069 This time there is no significant difference. Neither model is "better" than the other. This is interesting because the first model is based on only 4 habitat variables but the HSI index is formed from 9 different habitat variables, each of which is itself converted to an index. You cannot really use anova() to compare models with single variables: > anova(mod1, mod3, test = "Chisq") Analysis of Deviance Table Model 1: presence ~ macro Model 2: presence ~ HSI Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 198 246.55 2 198 228.67 0 17.883 You can get a "result" but there is no significance value as the models are not directly comparable. 

Use Dsquared values to compare explanatory "power" of different models. 
Use Dsquared valuesYou can explore differences between models using a different approach, their explanatory power. You can calculate a Dsquared value (equivalent to the Rsquared value of a regular regression). This approach is not strictly a statistical one but gives you an idea of "how much" better one model is than another. Look at the Dsquared values of each of the newt models: > d2(mod1) ; d2(mod2) ; d2(mod3) The first model (mod1, macropytes only) shows a Dsquared of a bit over 5%. The second habitatbased model (mod2) shows a value of nearly 15%, which is not great but is nearly 3 times better than mod1. The indexbased model was not statistically different from the habitatbased model. The Dsquared value of mod3 is 12.5%, which is a little smaller than the 15% of mod2. 

The best regression model does not always contain the most variables! 
SummaryThe more variables you add to a regression model the greater the "explanatory power", in other words the Dsquared value will keep increasing. However, it is rarely smart to keep adding variables just because you can measure them. The point of regression is to help you explain variability in the data. If variables are not significant they will only add a small fraction to the overall Dsquared. The best regression models explain as much of the variability as possible with the fewest variables. In this dataset it seems clear that the habitatbased model is the best of the options. It explains the most variability in the data with the fewest variables. The HSI model looks simpler but the HSI is calculated from 9 other variables. As an additional exercise you might look at a regression model containing all the habitat variables. You will get a Dsquared value of 17% and the model is not significantly "better" than the 4variable model or the HSI one. As a final note: the regresion models here are based on newt presence or absence. Effectively a model "success" is the presence of a newt. It might be better to think of the model as a better predictor of the absence of newts, rather than their presence. A pond may be perfectly suitable for newts but they are not there simply because they have not found it. This historical component probably explains the low Dsquared values. 

Top  
My Publications  My Publications See my personal pages at GardenersOwn 

Follow me... 


See also: 
KeywordsHere is a list of keywords: it is by no means complete! Ttest, Utest, KruskalWallis, Analysis of Variance, Spearman Rank, Correlation, Regression, Logistic Regression, Curved linear regression, histogram, scatter plot, bar chart, boxwhisker plot, pie chart, Mean, Median, Mode, Standard Deviation, Standard Error, Range, Max, Min, Interquartile Range, IQR 

Top  DataAnalytics Home  Contact  GardenersOwn Homepage 