Dr. Mark Gardener

 
About

Statistics for Ecologists Edition 2 Cover
Available soon from
Pelagic Publishing


Statistics for Ecologists Using R and Excel (Edition 2)

Data Collection, Exploration, Analysis and Presentation

by: Mark Gardener

Available soon from Pelagic Publishing

Welcome to the support pages for Statistics for Ecologists. These pages provide information and support material for the book. You should be able to find an outline and table of contents as well as support datafiles and additional material.

Support Index | Exercises Index | Outline & TOC | Data files


 

Exercise 11.3.2

Pelagic Publishing Logo

Table of Contents


Section 11.3.2

Multiple logistic regression:
Building a model

Example datafile:
newt regression.RData

Top

11.3.2 Model building in logisitc regression

This exercise is concerned with how to build a logistic regression model (Section 11.3.2). The processes are very similar to those described in Section 11.1.2, where you saw a stepwise building process used with regular multiple linear regression. The data for this exercise are available as a separate file newt regression.RData. The file will open in R and create a data object, gcn.

Introduction

When you have several (or indeed many) predictor variables you want to find a regression model that best describes the relationship between the variables. You should not incorporate every variable that you've got. Eventually you'll explain all the variability in your response variable simply because you've got so many explanatory predictors.

The process of model-building allows you to select the "best" variable to add to your current regression model. In the book you see how to carry out stepwise model building using a regular multiple regression (Section 11.1.2). In this exercise you can have a go at building a logistic regression model. The process is much the same as described in Section 11.1.2.


Example data:

newt regression.RData

Find the most important habitat factors for presence of great crested newts.

Compare regression models.

Top

The example data

The data you'll use comes from a survey of an amphibian (great creted newt) in southern England. You can get the data separately as newt regression.RData, which will load the gcn object to your console.

> head(gcn)
presence area dry water shade bird fish other.ponds land macro HSI
1 0 100 1 2 10 2 1 2 1 30 0.53
2 0 40 1 2 90 1 3 1 1 0 0.47
3 0 150 1 3 20 2 3 4 1 80 0.69
4 0 25 1 2 90 1 3 4 1 0 0.51
5 1 800 1 2 90 1 4 5 3 0 0.70
6 0 275 1 3 5 2 1 2 4 50 0.72

The first column, presence, gives the presence or absence of newts as a binary variable (0 = absence, 1= presence). The researcher used standard survey methods to detect the presence (or otherwise) of newts at 200 ponds. The other variables are habitat factors:

  • area – the pond area in square metres.
  • dry – pond seasonality (1-4 with 1 being non-seasonal and 4 being most seasonal).
  • water – a subjective measure of water quality (1-4, 1 = bad).
  • shade – the shadiness of the pond as a %.
  • bird – presence of waterfowl (1-3, 1 = absent).
  • fish – presence of fish (1-4, 4 = absent).
  • other.ponds – number of other ponds within 1km.
  • land – terrestrial habitat quality (1-4, 4 = good).
  • macro – cover of macrophytes as a %.
  • HSI – habitat suitability index (0-1). This is a standard measure compiled from other habitat measures.

In this kind of survey the various habitat factors are converted to an index, the indices are combined to make a final HSI (habitat suitability index). The HSI is used to make it easier to assess waterbodies for their potential to support populations of the great crested newt and to give a measure of reproducibility to surveys.

The example datafile is a "real" survey conducted on 200 ponds in Buckinghamshire in the UK. I've cut down the variables a little (the original data had the indices for each habitat variable) but essentially the dataset is unmodified.


Use logisitic regression to:

Find best factors
Compare models

Top

The research task

The research task is twofold:

  • To find the most important (and significant) habitat factors affecting the presence or absence of the great crested newt.
  • To compare the "predictive power" of a habitat factor-based model against one using the HSI alone.

The fisrt task will involve building a logistic regression model, which will determine that will show the habitat factors that are "most influential".

The second tast will be to make a simple regression model using the HSI variable as a single predictor. The multipl-factor model and the HSI model can then be compared to see if there is any substantial difference in deviance explained.


Use AIC values to determine the single best term o add to a regression model.

Use add1() command to get AIC values in model building.

Top

Building a multiple regression model

To build the model you'll need to use the glm() command, which carries out the logisitic regression (when you set family = binomial). You'll also need to use the add1() command. The command takes an existing regression model and adds all the variables you specify, one at a time. The results are then displayed.

You then use the AIC values (a measure of how bad the model fit is) to select the next best variable to add to your model. You'll want to pick the variable with the lowest AIC value. With many potential variables you can of course end up with a comprehensive list. It is therefore helpful to see the significant variables. You can tell the add1() command to carry out a significance test too.

  • For linear models (family = gaussian) you use test = "F".
  • For logistic models (family = binomial) you use test = "Chisq".

Start with a blank model, with an intercept only.

Top

Start with a blank model

The starting point for your regression model should be one with an intercept only and no terms except the response. In our example the response is the presence variable.

> gcn.glm = glm(presence ~ 1, data = gcn, family = binomial)

You could use the summary() command to look at the result but it would not be very informative at this stage as you have not incorporated any variables (model terms).


Use add1() to see which single variable is "best" to add to your regression model.

Select a significant variable with the lowest AIC value.

Replace the intercept in your blank model with the variable with the lowest AIC.

Top

Add the first term

Now you have a blank model (intercept only) you will need to look at the effect of adding each variable in turn. The variable that has the lowest AIC value will be the one you want to incorporate (replace the intercept with the variable).

The add1() command will "interrogate" the regression model and return a list of variables (from a source you specify, usually the original dataset). To make the selection process easier you can carry out a significance test at the same time. You'll want to include only variables that are significant.

> add1(gcn.glm, scope = gcn, test = "Chisq")
       Single term additions
Model:
presence ~ 1
            Df Deviance    AIC    LRT  Pr(>Chi) 
   <none>           261.37 263.37 
area         1   258.88 262.88  2.492 0.1144364 
dry          1   261.31 265.31  0.054 0.8159993 
water        1   255.07 259.07  6.295 0.0121104 * 
shade        1   248.52 252.52 12.844 0.0003386 ***
bird         1   261.03 265.03  0.338 0.5611190 
fish         1   257.16 261.16  4.207 0.0402523 * 
other.ponds  1   252.05 256.05  9.317 0.0022700 ** 
land         1   260.80 264.80  0.568 0.4508771 
macro        1   246.55 250.55 14.818 0.0001184 ***
HSI          1   228.67 232.67 32.701 1.075e-08 ***    

Note that the HSI variable has the lowest AIC value. Ignore that for the moment (we'll return to it later). For now just focus on the habitat variables.

The macro variable has the lowest of the other AIC values (250.55) and is significant too. So you can use the up arrow on the R console to recall the earlier command and editi it to change the regression model. Replace the intercept with the macro variable.

> gcn.glm = glm(presence ~ macro, data = gcn, family = binomial)

Now you've got a model with a single term.


Use add1() to add subsequent terms to the regression model.

Stop if new terms are not significant.

Top

Add subsequent terms

To add more terms you:

  1. Run add1() on your existing model.
  2. Choose the variable with the lowest AIC.
  3. If the variable is significant then use it. If the variable is not significant then stop.
  4. Edit the regression formula to include the new variable.
  5. Repeat the steps until no significant variables remain.

If you follow the process you'll end up with a model like so:

> gcn.glm = glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial)

The water variable shows up as being significant if you runn add1() on this model but we'll stop because the p-value is only 0.044. When building a multiple variable model it is not a bad thing to set a more rigorous cut-off value.

The final model can be summarised:

> summary(gcn.glm)
Call:
glm(formula = presence ~ macro + other.ponds + fish + shade, 
    family = binomial, data = gcn)
Deviance Residuals: 
    Min       1Q   Median       3Q      Max 
-1.8022  -0.9099  -0.5447   0.9757   2.0516 
Coefficients:
             Estimate Std. Error z value Pr(>|z|) 
(Intercept) -2.909742   0.704410  -4.131 3.62e-05 ***
macro        0.018963   0.006368   2.978  0.00290 ** 
other.ponds  0.241031   0.078685   3.063  0.00219 ** 
fish         0.494789   0.172173   2.874  0.00406 ** 
shade       -0.014619   0.005284  -2.767  0.00566 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 261.37  on 199  degrees of freedom
Residual deviance: 222.57  on 195  degrees of freedom
AIC: 232.57
Number of Fisher Scoring iterations: 4

Use the deviance results to get a statistic equivalent to R-squared.

Top

Explained deviance

The logistic regression does not return an R-squared value. What you see is Null deviance and Residual deviance. You can use these values to get a value that is equivalent to the R-squared, let's call it the D-squared value:

1-(model$deviance/model$null.deviance)

Where model is your regression model result.

> 1 - (gcn.glm$deviance / gcn.glm$null.deviance)
[1] 0.1484474

You can see the result is about 0.15, which is not really great when you consider that there are 4 explanatory variables in the model!


Custom function to calculate D-squared.

Top

D-squared calculator

You can make a simple custom command to calculate the D-squared value. Type the following into your console to set-up a custom command called d2:

d2 = function(model, digits = 4) { round(1-(model$deviance/model$null.deviance),digits=digits) }

Now you can use d2(model) to carry out the calculations simply and easily. Just give the command the name of your regression model (you can also use the digits parameter to view different numbers of decimal places).


Compare models using anova() or residual deviance (D-squared).

Top

Compare models

You can compare models in two main ways:

  • Are two models significanty different? Use the anova() command.
  • Compare the D-squared values. Is the explanatory power of one model better than another?

First of all let's make some alternative models to compare:

> mod1 = glm(presence ~ macro, data = gcn, family = binomial)
> mod2 = glm(presence ~ macro + other.ponds + fish + shade, data = gcn, family = binomial)
> mod3 = glm(presence ~ HSI, data = gcn, family = binomial)

The first model uses the single predictor macro (cover of macrophytes). This was the variable identified as the "best" of the habitat-based variables.

The second model is the multiple variable model using the 4 best habitat-based variables.

The third model uses the HSI variable alone. Recall that this is a composite index that uses information from several habitat factors in its construction.


Use anova() to compare model deviance and examine differences between regresion models.

Top

Use anova()

The anova() command can compare models. The name of the command is slightly misleading as it is not really analysis of variance. Rather it compares deviance of models that are based on the same set of variables. You can get a "test" of signifcance, with the logistic regression you need test = "Chisq" as you did with the add1() command.

You can only really compare models that are based on the same subset of variables. You cannot really get meaningful results by comparing models that each contain only a single predictor variable. If you look at the logistic regression models for the newts:

> anova(mod1, mod2, test = "Chisq")
Analysis of Deviance Table
Model 1: presence ~ macro
Model 2: presence ~ macro + other.ponds + fish + shade
  Resid. Df Resid. Dev Df Deviance Pr(>Chi) 
1       198     246.55 
2       195     222.57  3   23.981 2.52e-05 ***    

You can see that there is a significant difference between the two models. We would expect this of course, since we've been adding significant variables to the original model. Now look at the habitat-based model and the overall index-based one:

> anova(mod2, mod3, test = "Chisq")
Analysis of Deviance Table
Model 1: presence ~ macro + other.ponds + fish + shade
Model 2: presence ~ HSI
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       195     222.57 
2       198     228.67 -3  -6.0985   0.1069

This time there is no significant difference. Neither model is "better" than the other. This is interesting because the first model is based on only 4 habitat variables but the HSI index is formed from 9 different habitat variables, each of which is itself converted to an index.

You cannot really use anova() to compare models with single variables:

> anova(mod1, mod3, test = "Chisq")
Analysis of Deviance Table
Model 1: presence ~ macro
Model 2: presence ~ HSI
  Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1       198     246.55 
2       198     228.67  0   17.883

You can get a "result" but there is no significance value as the models are not directly comparable.


Use D-squared values to compare explanatory "power" of different models.

Top

Use D-squared values

You can explore differences between models using a different approach, their explanatory power. You can calculate a D-squared value (equivalent to the R-squared value of a regular regression). This approach is not strictly a statistical one but gives you an idea of "how much" better one model is than another.

Look at the D-squared values of each of the newt models:

> d2(mod1) ; d2(mod2) ; d2(mod3)
[1] 0.0567
[1] 0.1484
[1] 0.1251

The first model (mod1, macropytes only) shows a D-squared of a bit over 5%. The second habitat-based model (mod2) shows a value of nearly 15%, which is not great but is nearly 3 times better than mod1.

The index-based model was not statistically different from the habitat-based model. The D-squared value of mod3 is 12.5%, which is a little smaller than the 15% of mod2.


The best regression model does not always contain the most variables!

Top

Summary

The more variables you add to a regression model the greater the "explanatory power", in other words the D-squared value will keep increasing. However, it is rarely smart to keep adding variables just because you can measure them. The point of regression is to help you explain variability in the data. If variables are not significant they will only add a small fraction to the overall D-squared.

The best regression models explain as much of the variability as possible with the fewest variables. In this dataset it seems clear that the habitat-based model is the best of the options. It explains the most variability in the data with the fewest variables. The HSI model looks simpler but the HSI is calculated from 9 other variables.

As an additional exercise you might look at a regression model containing all the habitat variables. You will get a D-squared value of 17% and the model is not significantly "better" than the 4-variable model or the HSI one.

As a final note: the regresion models here are based on newt presence or absence. Effectively a model "success" is the presence of a newt. It might be better to think of the model as a better predictor of the absence of newts, rather than their presence. A pond may be perfectly suitable for newts but they are not there simply because they have not found it. This historical component probably explains the low D-squared values.


Top
Support Index | Exercises Index | Outline & TOC | Data files
 
My Publications

My Publications

Managing Data Using Excel, Cover

See my personal pages at GardenersOwn

Follow me...
Facebook Twitter Google+ Linkedin Amazon

See also:

Writer's Bloc
MonogRaphs
Tips & Tricks

Keywords

Here is a list of keywords: it is by no means complete!

T-test, U-test, Kruskal-Wallis, Analysis of Variance, Spearman Rank, Correlation, Regression, Logistic Regression, Curved linear regression, histogram, scatter plot, bar chart, box-whisker plot, pie chart, Mean, Median, Mode, Standard Deviation, Standard Error, Range, Max, Min, Inter-quartile Range, IQR

Top DataAnalytics Home
Publications
Contact GardenersOwn Homepage