Dr. Mark Gardener 


Statistics for Ecologists Using R and Excel. Edition 2 Statistics for Ecologists Using R and Excel: Get a 20% discount on "Statistics for Ecologists" when you buy direct from the publisher! Enter the voucher code S4E20 in the shopping basket at Pelagic Publishing. 
Writer's BlocOn this page you can find out about my latest writing project. I'll post updates on progress, tables of contents and also some of the R scripts (and possibly Excel spreadsheets) I am developing in support of the new book. I'll try to keep the material reasonably up to date. The Writer's Bloc homepage contains a table of contents and an index of the pages that contain custom R commands and R scripts.
I am working on a new edition of my book Statistics for Ecologists Using R and Excel. I am currently revising the chapter on tests for linking several factors (regression). These notes are about how to use the results of a regression model to predict the value of the response variable when you supply certain values of the predictor.
Predicting values from a regression modelIntroductionWhen you carry out a regression you are looking to describe the data in terms of the variables that form the relationships. When you've got your regression model you are able to describe the relationship using a mathematical model (which is what the regression model is). You can use the regression model to make predicted values, which is where you use "new" values of the predictor (that is ones not observed in the original dataset) to predict the response variable. These predicted values are especially important in logistic regression, where your response is binary, that is it only has two possibilities. The result you get when you "predict" response values in a logistic regression is a probability; the likelihood of getting a "positive" result when the predictor variable is set to a particular value. These notes will show you how to use the predict() command in R to produce predicted values for regression models, particularly logistic regression. 

Use predict() to compute predicted values from a regression model. 
The predict() commandThe predict() command is used to compute predicted values from a regression model. The general form of the command is:
The model is simply the result of a regression model. You need to specify type = "response" so that your prediction predicts the response variable (note that this is not necessarily the default, so you must specify it). The newdata parameter is where you specify the values of the predictor(s) that you want to use to predict the response variable. You can make a data.frame object containing the variables (with variable names to match the original dataset), or can specify it "on the fly". You'll see examples shortly. 

In logistic regression response data are binary but may be in differnt forms. 
Prediction in logisitic regressionLogistic regression is carried out in cases where your response variable can take one of only two forms (i.e. it is binary). There are two general forms your response variable can take:
The way your data are arranged does not make much difference to how you carry out the predictions but because of the different forms it is useful to see an example of each. 

Use predict() when the response is 0 or 1 (binary). The result of prediction is the probability of "success", which is getting a 1. Make a new data.frame to hold the values of the predictor you want to use in the prediction. 
Logisitc regression and presence/absenceWhen you have the response variable as presence/absence (1 or 0) it is easy to see that your response is in binary form. The logistic regression converts the 1s and 0s to a likelihood (under the various levels of your predictor variables), so your result is in that form. So, when you use the predict() command you can get the probability of getting a success (a presence: 1). The example file (Predict regressions.RData) contains a dataset that looks at the presence or absence of an amphibian (great crested newt) in ponds. The data are called gcn. Various habitat factors were measured. One factor is the percentage cover of macrophytes. You can use the logistic regression to explore the relationship between the presence (or absence) of newts and the cover of macrophytes. Once you have the regression model you can use predict() to predict the likelihood of finding a newt given any value for the cover of macrophytes. Make a regerssion modelMake a regression model result: > gcn.glm = glm(presence ~ macro, data = gcn, family = binomial) The result is that the macro variable is statistically significant: Coefficients: You can present a graph of the model (see the book text for the code): Predict outcomesNow you could look at the line on the plot (made using fitted values) and estimate the probability of finding a newt (presence = 1) for any percentage cover of macrophyte. Alternatively you can use the predict() command. > nd = data.frame(macro = c(40, 50, 60)) See that you first make a data.frame with a variable called macro, to match the original data. This new data is used to predict the response. However, the value you get is not exactly a response but it is the likelihood of "success", that is finding a newt. You could have set the original response variable (presence) to have been 1 for absence and 0 for presence. Apart from being a little odd you'd be assuming that a "success" was the absence of the newts. In any event your prediction is a probability of success. You might have chosen to create the prediction newdata data.frame directly in the predict() command, which is what you'll see in the following example. 

Use predict() when the response variable is split into a success/failure matrix. The first column of the success/failure matrix is the "success" the second is the "failure". The success/failure matrix forms the binary response variable. 
Logistic regression and success/failure matrixYour data may be in a subtly different form to the 0 or 1 model. In the following example you have a predictor variable, latitude, and a response but the response is split into two frequencies: > cbh Here you have two versions of an allele in the Californian beach hopper. The data are in the file (Predict regressions.RData) and are called cbh. For each sampling latitude a number of animals were collected. They can have one form of the allele or the other. So your response variable is split into the frequency of each. If you take those two variables you have a success/failure matrix. The logistic regression can constrict the likelihood from the success/failure. It does not matter which allele you choose to be the "success" but here I've used the Mpi100 allele (because the proportion increases with latitude). The proportion of the "success" is shown in the final column. It won't be used in the regression except to draw a plot. Make a regression modelStart by making a regression model. You need to make a success/failure matrix but this can be done as part of the glm() command formula. > cbh.glm = glm(cbind(cbh$Mpi100, cbh$Mpi90) ~ latitude, data = cbh, family = binomial) The result shows that latitude is a significant variable: Coefficients: You can draw this as a plot (because of the replicates we can add error bars): Predict outcomesAs before you could use the plot to predict the likelihood of "success", the Mpi100 allele. You can also use the predict() command: > predict(cbh.glm, newdata = data.frame(latitude = c(35, 40, 45)), type = "response") Note that this time the newdata parameter used a data.frame created on the fly. 

With multiple predictor variables you make a new data.frame containing all the predictor variables. Variables have to have the same name as used in the regression model but can be in any order. If there are additional variables they are ignored. 
Prediction with multiple predictorsWhen you have multiple predictor variables you have to specify them all in the newdata parameter of the predict() command. It is easiest to make a new data.frame object to hold these "new" data unless you only have one or two values. In the following data example there are four predictors and one response (Length). > names(mf) The data give the lengths of a freshwater invertebrate (a mayfly species) and some habitat conditions at the sampling locations where each observation took place. The data are in the file (Predict regressions.RData) and are called mf. Your new data.frame has to have the same variable names but they do not have to be in any order. You only have to include variables that are in the regression model. Make the regression modelMake the model: > mf.lm = lm(Length ~ BOD + Algae, data = mf) The result shows the coefficients (one variable is not significant but we'll leave it in). Coefficients: Predict the resultBecause there are two predictors it is not practical to draw a plot (apart from diagnostics, see these notes). So the predict() command is the way to go. > predict(mf.lm, newdata = data.frame(BOD = 20, Algae = 10), type = "response") You can of course make a separate data.frame containing BOD and Algae columns and populate it with values for which you want to get a prediction. 

Example regression data: Predict regressions.RData 
Data examplesThe data examples used here are available as part of the support data for the book. I've also added a file Predict regressions.RData, which contains just the three datasets mentioned on this page. 

Top 


Providing training for:


Follow me... 

Top  Contact  DataAnalytics Homepage 