Dr. Mark Gardener

Available soon from
Pelagic Publishing

# Statistics for Ecologists Using R and Excel (Edition 2)

### by: Mark Gardener

Available soon from Pelagic Publishing

Welcome to the support pages for Statistics for Ecologists. These pages provide information and support material for the book. You should be able to find an outline and table of contents as well as support datafiles and additional material.

## Exercise 11.3.2

Section 11.3.2

Multiple logistic regression:
Building a model

Example datafile:
newt regression.RData

Top

### 11.3.2 Model building in logisitc regression

This exercise is concerned with how to build a logistic regression model (Section 11.3.2). The processes are very similar to those described in Section 11.1.2, where you saw a stepwise building process used with regular multiple linear regression. The data for this exercise are available as a separate file newt regression.RData. The file will open in R and create a data object, gcn.

#### Introduction

When you have several (or indeed many) predictor variables you want to find a regression model that best describes the relationship between the variables. You should not incorporate every variable that you've got. Eventually you'll explain all the variability in your response variable simply because you've got so many explanatory predictors.

The process of model-building allows you to select the "best" variable to add to your current regression model. In the book you see how to carry out stepwise model building using a regular multiple regression (Section 11.1.2). In this exercise you can have a go at building a logistic regression model. The process is much the same as described in Section 11.1.2.

Example data:

newt regression.RData

Find the most important habitat factors for presence of great crested newts.

Compare regression models.

Top

### The example data

The data you'll use comes from a survey of an amphibian (great creted newt) in southern England. You can get the data separately as newt regression.RData, which will load the gcn object to your console.

`> head(gcn)  presence area dry water shade bird fish other.ponds land macro  HSI1        0  100   1     2    10    2    1           2    1    30 0.532        0   40   1     2    90    1    3           1    1     0 0.473        0  150   1     3    20    2    3           4    1    80 0.694        0   25   1     2    90    1    3           4    1     0 0.515        1  800   1     2    90    1    4           5    3     0 0.706        0  275   1     3     5    2    1           2    4    50 0.72`

The first column, presence, gives the presence or absence of newts as a binary variable (0 = absence, 1= presence). The researcher used standard survey methods to detect the presence (or otherwise) of newts at 200 ponds. The other variables are habitat factors:

• area – the pond area in square metres.
• dry – pond seasonality (1-4 with 1 being non-seasonal and 4 being most seasonal).
• water – a subjective measure of water quality (1-4, 1 = bad).
• bird – presence of waterfowl (1-3, 1 = absent).
• fish – presence of fish (1-4, 4 = absent).
• other.ponds – number of other ponds within 1km.
• land – terrestrial habitat quality (1-4, 4 = good).
• macro – cover of macrophytes as a %.
• HSI – habitat suitability index (0-1). This is a standard measure compiled from other habitat measures.

In this kind of survey the various habitat factors are converted to an index, the indices are combined to make a final HSI (habitat suitability index). The HSI is used to make it easier to assess waterbodies for their potential to support populations of the great crested newt and to give a measure of reproducibility to surveys.

The example datafile is a "real" survey conducted on 200 ponds in Buckinghamshire in the UK. I've cut down the variables a little (the original data had the indices for each habitat variable) but essentially the dataset is unmodified.

Use logisitic regression to:

Find best factors
Compare models

Top

• To find the most important (and significant) habitat factors affecting the presence or absence of the great crested newt.
• To compare the "predictive power" of a habitat factor-based model against one using the HSI alone.

The fisrt task will involve building a logistic regression model, which will determine that will show the habitat factors that are "most influential".

The second tast will be to make a simple regression model using the HSI variable as a single predictor. The multipl-factor model and the HSI model can then be compared to see if there is any substantial difference in deviance explained.

Use AIC values to determine the single best term o add to a regression model.

Use add1() command to get AIC values in model building.

Top

### Building a multiple regression model

To build the model you'll need to use the glm() command, which carries out the logisitic regression (when you set family = binomial). You'll also need to use the add1() command. The command takes an existing regression model and adds all the variables you specify, one at a time. The results are then displayed.

You then use the AIC values (a measure of how bad the model fit is) to select the next best variable to add to your model. You'll want to pick the variable with the lowest AIC value. With many potential variables you can of course end up with a comprehensive list. It is therefore helpful to see the significant variables. You can tell the add1() command to carry out a significance test too.

• For linear models (family = gaussian) you use test = "F".
• For logistic models (family = binomial) you use test = "Chisq".

Top

The starting point for your regression model should be one with an intercept only and no terms except the response. In our example the response is the presence variable.

> gcn.glm = glm(presence ~ 1, data = gcn, family = binomial)

You could use the summary() command to look at the result but it would not be very informative at this stage as you have not incorporated any variables (model terms).

Select a significant variable with the lowest AIC value.

Replace the intercept in your blank model with the variable with the lowest AIC.

Top

Now you have a blank model (intercept only) you will need to look at the effect of adding each variable in turn. The variable that has the lowest AIC value will be the one you want to incorporate (replace the intercept with the variable).

The add1() command will "interrogate" the regression model and return a list of variables (from a source you specify, usually the original dataset). To make the selection process easier you can carry out a significance test at the same time. You'll want to include only variables that are significant.

```> add1(gcn.glm, scope = gcn, test = "Chisq")
```Model:
presence ~ 1
Df Deviance    AIC    LRT  Pr(>Chi)
<none>           261.37 263.37
area         1   258.88 262.88  2.492 0.1144364
dry          1   261.31 265.31  0.054 0.8159993
water        1   255.07 259.07  6.295 0.0121104 *
shade        1   248.52 252.52 12.844 0.0003386 ***
bird         1   261.03 265.03  0.338 0.5611190
fish         1   257.16 261.16  4.207 0.0402523 *
other.ponds  1   252.05 256.05  9.317 0.0022700 **
land         1   260.80 264.80  0.568 0.4508771
macro        1   246.55 250.55 14.818 0.0001184 ***
HSI          1   228.67 232.67 32.701 1.075e-08 ***    ```

Note that the HSI variable has the lowest AIC value. Ignore that for the moment (we'll return to it later). For now just focus on the habitat variables.

The macro variable has the lowest of the other AIC values (250.55) and is significant too. So you can use the up arrow on the R console to recall the earlier command and editi it to change the regression model. Replace the intercept with the macro variable.

`> gcn.glm = glm(presence ~ macro, data = gcn, family = binomial)`

Now you've got a model with a single term.

Stop if new terms are not significant.

Top

2. Choose the variable with the lowest AIC.
3. If the variable is significant then use it. If the variable is not significant then stop.
4. Edit the regression formula to include the new variable.
5. Repeat the steps until no significant variables remain.

If you follow the process you'll end up with a model like so:

> gcn.glm = glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial)

The water variable shows up as being significant if you runn add1() on this model but we'll stop because the p-value is only 0.044. When building a multiple variable model it is not a bad thing to set a more rigorous cut-off value.

The final model can be summarised:

`> summary(gcn.glm)`
```Call:
glm(formula = presence ~ macro + other.ponds + fish + shade,
family = binomial, data = gcn)```
```Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.8022  -0.9099  -0.5447   0.9757   2.0516 ```
```Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.909742   0.704410  -4.131 3.62e-05 ***
macro        0.018963   0.006368   2.978  0.00290 **
other.ponds  0.241031   0.078685   3.063  0.00219 **
fish         0.494789   0.172173   2.874  0.00406 **
shade       -0.014619   0.005284  -2.767  0.00566 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1```
`(Dispersion parameter for binomial family taken to be 1)`
```Null deviance: 261.37  on 199  degrees of freedom
Residual deviance: 222.57  on 195  degrees of freedom
AIC: 232.57```
`Number of Fisher Scoring iterations: 4`

Use the deviance results to get a statistic equivalent to R-squared.

Top

### Explained deviance

The logistic regression does not return an R-squared value. What you see is Null deviance and Residual deviance. You can use these values to get a value that is equivalent to the R-squared, let's call it the D-squared value:

1-(model\$deviance/model\$null.deviance)

Where model is your regression model result.

> 1 - (gcn.glm\$deviance / gcn.glm\$null.deviance)
[1] 0.1484474

You can see the result is about 0.15, which is not really great when you consider that there are 4 explanatory variables in the model!

Custom function to calculate D-squared.

Top

#### D-squared calculator

You can make a simple custom command to calculate the D-squared value. Type the following into your console to set-up a custom command called d2:

d2 = function(model, digits = 4) { round(1-(model\$deviance/model\$null.deviance),digits=digits) }

Now you can use d2(model) to carry out the calculations simply and easily. Just give the command the name of your regression model (you can also use the digits parameter to view different numbers of decimal places).

Compare models using anova() or residual deviance (D-squared).

Top

### Compare models

You can compare models in two main ways:

• Are two models significanty different? Use the anova() command.
• Compare the D-squared values. Is the explanatory power of one model better than another?

First of all let's make some alternative models to compare:

`> mod1 = glm(presence ~ macro, data = gcn, family = binomial)> mod2 = glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial)> mod3 = glm(presence ~ HSI, data = gcn, family = binomial)`

The first model uses the single predictor macro (cover of macrophytes). This was the variable identified as the "best" of the habitat-based variables.

The second model is the multiple variable model using the 4 best habitat-based variables.

The third model uses the HSI variable alone. Recall that this is a composite index that uses information from several habitat factors in its construction.

Use anova() to compare model deviance and examine differences between regresion models.

Top

#### Use anova()

The anova() command can compare models. The name of the command is slightly misleading as it is not really analysis of variance. Rather it compares deviance of models that are based on the same set of variables. You can get a "test" of signifcance, with the logistic regression you need test = "Chisq" as you did with the add1() command.

You can only really compare models that are based on the same subset of variables. You cannot really get meaningful results by comparing models that each contain only a single predictor variable. If you look at the logistic regression models for the newts:

```> anova(mod1, mod2, test = "Chisq")
Analysis of Deviance Table```
```Model 1: presence ~ macro
Model 2: presence ~ macro + other.ponds + fish + shade
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       198     246.55
2       195     222.57  3   23.981 2.52e-05 ***    ```

You can see that there is a significant difference between the two models. We would expect this of course, since we've been adding significant variables to the original model. Now look at the habitat-based model and the overall index-based one:

```> anova(mod2, mod3, test = "Chisq")
Analysis of Deviance Table```
```Model 1: presence ~ macro + other.ponds + fish + shade
Model 2: presence ~ HSI
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       195     222.57
2       198     228.67 -3  -6.0985   0.1069```

This time there is no significant difference. Neither model is "better" than the other. This is interesting because the first model is based on only 4 habitat variables but the HSI index is formed from 9 different habitat variables, each of which is itself converted to an index.

You cannot really use anova() to compare models with single variables:

```> anova(mod1, mod3, test = "Chisq")
Analysis of Deviance Table```
```Model 1: presence ~ macro
Model 2: presence ~ HSI
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       198     246.55
2       198     228.67  0   17.883```

You can get a "result" but there is no significance value as the models are not directly comparable.

Use D-squared values to compare explanatory "power" of different models.

Top

#### Use D-squared values

You can explore differences between models using a different approach, their explanatory power. You can calculate a D-squared value (equivalent to the R-squared value of a regular regression). This approach is not strictly a statistical one but gives you an idea of "how much" better one model is than another.

Look at the D-squared values of each of the newt models:

> d2(mod1) ; d2(mod2) ; d2(mod3)
[1] 0.0567
[1] 0.1484
[1] 0.1251

The first model (mod1, macropytes only) shows a D-squared of a bit over 5%. The second habitat-based model (mod2) shows a value of nearly 15%, which is not great but is nearly 3 times better than mod1.

The index-based model was not statistically different from the habitat-based model. The D-squared value of mod3 is 12.5%, which is a little smaller than the 15% of mod2.

The best regression model does not always contain the most variables!

Top

### Summary

The more variables you add to a regression model the greater the "explanatory power", in other words the D-squared value will keep increasing. However, it is rarely smart to keep adding variables just because you can measure them. The point of regression is to help you explain variability in the data. If variables are not significant they will only add a small fraction to the overall D-squared.

The best regression models explain as much of the variability as possible with the fewest variables. In this dataset it seems clear that the habitat-based model is the best of the options. It explains the most variability in the data with the fewest variables. The HSI model looks simpler but the HSI is calculated from 9 other variables.

As an additional exercise you might look at a regression model containing all the habitat variables. You will get a D-squared value of 17% and the model is not significantly "better" than the 4-variable model or the HSI one.

As a final note: the regresion models here are based on newt presence or absence. Effectively a model "success" is the presence of a newt. It might be better to think of the model as a better predictor of the absence of newts, rather than their presence. A pond may be perfectly suitable for newts but they are not there simply because they have not found it. This historical component probably explains the low D-squared values.

Top

My Publications

My Publications

See my personal pages at GardenersOwn