Dr. Mark Gardener 



Statistics for Ecologists Using R and Excel (Edition 2)Data Collection, Exploration, Analysis and Presentationby: Mark GardenerAvailable soon from Pelagic Publishing Welcome to the support pages for Statistics for Ecologists. These pages provide information and support material for the book. You should be able to find an outline and table of contents as well as support datafiles and additional material. Support Index  Exercises Index  Outline & TOC  Data files 

Exercise 11.1 

Table of Contents 

Section 11.1 Multiple regression disgnostics: Example datafile: 
11.1 Graphing multiple regressionThis exercise is concerned with how to produce and chart some disgnostics for regression models (Sections 11.1 and 11.2). The dataset used to illustrate these notes is available as a file: regression diagnostic.csv. IntroductionWhen you only have two variables (a predictor and a single response) you can use a regular scatter plot to show the relationship. Even if the relationship is logarithmic or polynomial you can represent the situation, as long as there is only one predictor variable. When you have two or more predictor variables it becomes hard to represent the situation graphically. You can try a 3D plot but they are rarely successful. Generally you'll stick to plotting the "most important" predictor variable and display the model as a standard regression table. However, it can be helpful show some diagnostics from your regression model as graphics. These notes show you how you can produce some simple regression diagnostics and present them graphically. The data used for these notes can be downloaded as a CSV file: regression diagnostic.csv. The data are the same as shown in the book (mayfly regression) but reproduced here to make it easier to follow the notes. 

Use the regression equation with observed predictor variable values to produce fitted yvalues. Fitted values are "expected" values of the response variable calculated from the coefficients of the regression and observed values of predictor variables. 
Fitted valuesWhen you make a regression model you are trying to find the "best" mathematical fit between the variables. In a simple form the relationship will be:
In this case there is a single predictor variable, x. The m is the slope and c is the intercept. Both m and c are known as coefficients. You might have a logarithmic or polynomial relationship but essentially the form of the relationship is the same:
In multiple regression you have more than one predictor and each predictor has a coefficient (like a slope), but the general form is the same:
Where a and b are coefficients, x and z are predictor variables and c is an intercept. You can add more and more variables:
Note that each predictor variable has a unique coefficient. Once you have your "equation" you can use the observed values of the predictors and of the coefficients in your equation to produce new yvalues. These values are "fitted" to the regression model. Think of them as being idealised values or "expected" values. Note that you should not call these fitted values "predicted" values because they are based on your observed predictor variable values. If you insert predictor values that you did not observe into the equation then you can call the resulting yvalues predicted yvalues. The values you insert to the equation could lie within the maxmin range of observed predictor (interpolation) or be outside (extrapolation). In general it is "safer" to go with interpolation than with extrapolation. 

Fitted values in R: fitted(model) The fitted() command produces fitted values as a simple numeric vector. 
Calculate Fitted values using RFitted values are easy to compute in R. You can get them from the result of a lm() command in two ways:
In both cases model is the result of a lm() command. > names(mf) > mf.lm = lm(Length ~ BOD + Algae, data = mf) Then you can get the fitted values: > fitted(mf.lm) Once you have the values you can use them like any other numeric variable (the vector has a names attribute). 

Use LINEST to get regression coefficients. Coefficients are listed backwards with the inercept being last. 
Calculate Fitted values using ExcelTo get fitted values in Excel you'll need to calculate the coefficients and plug the values into the spreadsheet to generate them:
Once you have the coefficients it is fairly easy to create the fitted values but you will need to "fix" the references to the coefficients using $ in the cell reference(s). Then you can copy/paste the formula down the column to fill in the fitted values for all the rows of data. 

Use Analysis ToolPak to carry out regression and compute fitted values as well as residuals. Data > Data Analysis button Select Regression option and tick Residuals to get fitted and residuals. 
Use the Analysis ToolPakIf you have the Analysis ToolPak addin you can use this to generate the regression statistics. The Regression routine will produce the coefficients as well as the fitted values.
The results for the residuals will appear after the main regression results. You'll also see another column containing the fitted values. Unfortunately this is labelled Predicted. 

Residuals are the difference between observed yvalues and idealised fitted values based on your regression model. 
ResidualsResidual values are the difference between the fitted values and the actually observed values for your response variable. Think of the fitted values as being the "ideal" or "expected" values based on your regression equation. The residual values are (usually) not ideal and differ from the "perfect fit". So, the residuals are a way to see how bad the fit is between the ideal values based on your regression model and the real situation. In general you want to see that the residuals are normally distributed, at least this is what you want when doing regular linear regression (and curvilinear). In diagnostic terms it is the normal distribution of the residuals that is the really important thing, not the distribution of the original data (although usually you do not get the former without the latter). 

Residual values in R: resid(model) The resid() command produces residual values as a simple numeric vector. 
Calculate residuals using RResidual values are easy to compute using R; you get them from the result of a lm() command:
Simply make your regression model result then away you go: > mf.lm = lm(Length ~ BOD + Algae, data = mf) > fitted(mf.lm) The result of resid() is a numerical vector (with a names attribute), which you can use in various ways (as you will see later). 

Residuals are Observed  Fitted values. A simple calculation in Excel if you have the fitted values. 
Calculate residuals using ExcelResiduals are simply the difference between the observed and expected values of your response variable. You can use the regression equation for your model to generate the fitted values. Once you have those you simply subtract the fitted value from the observed:
Note that you always carry out the calculation this way around by convention. 

Use Analysis ToolPak to compute residual values. 
Use the Analysis ToolPakThe Analysis ToolPak can compute the residuals for you easily. You simply tick the box labelled Residuals in the Regression dialogue box. The process is exactly the same as when calculating the fitted values. The residuals are placed alongside the fitted values (which are labelled Predicted), and are labelled Residuals. 

Standardized residuals are rescaled residuals. The rescale is dependent on the variance of individual residuals and takes into account leverage (a measure of influence). Use rstandard() in R Excel cannot easily compute correct standardized residuals. An approximation is: resid รท std.dev(resid) The Analysis ToolPak only computes an approximation. 
Standardized residualsStandardized residuals are residuals that have been rescaled. The way they are scaled is such that each residual value has unit standard deviaiton. The standardization process takes into account how much "influence" a datum might have on the regression (something called leverage). It is generally preferrable to use the standardized residuals. Standardized resisuals in RYou can compute the standardized residuals using the rstandard() command: > rstandard(mf.lm) The result is a numeric vector (with a names attribute). Standardized residuals in ExcelYou cannot calculate the standardized residuals in Excel without some tedious computations! The Analysis ToolPak claims to calculate standardized residuals but what you end up with is the residuals divided by the standard deviation of the residuals. This is only an approximation to the "proper" standardization. 

Diagnostic plots: Residuals vs. Fitted 
Diagnostic plotsThere are several type of plot you might use as diagnostic tools. The ones I'll show here are:
You can make these plots easily using R. In Excel you can make some worthwhile attempts but you can only use approximately standardized residuals. 

Use plot() with result of regression model for diagnostic plots. For specific residual vs. fitted plot use: plot(model, which = 1) Use lowess() with a regular plot() to view the fit (or lack of) between variables. 
Residuals vs. Fitted valuesYou can calculate the fitted and residuals easily in R or Excel. A plot of the residuals against the fitted values is a simple way to produce a useful diagnostic tool. Plot using RThe simplest way to produce a residual plot is to use the plot() command on a regression model result (which actually runs the plot.lm() command). The plot.lm() command produces up to six diagnostic plots, which you can choose using the which = parameter. The plot of residuals vs. fitted values is obtained via:
This produces a scatter plot of the regular residuals against the fitted values. The command also produces a locally weighted polynomial smooth fit line and identifies and data points that might have undue influence on the model fit.
Ideally you would like not to see any kind of pattern. In the example there are some points identified as possibly having undue influence but the lowess line runs aimlessly along the middle. You can produce a similar plot using standardized residuals and regular plot commands:


Use a regular scatter plot to show residuals vs. fitted values. Add a trendline using a Moving Average to get an impression of the fit (or lack of). Values need to be sorted in ascending numerical order (by fitted value) for the moving average to work. 
Plot using ExcelIt is fairly easy to plot the regular residuals against the fitted values using Excel. Once you have calculated the values you can simply make a scatter plot. However, it is not so easy to add a locally weighted polynomial. There are addins you might try but a simple workaround is to use a moving average trendline. To incorporate this you need to sort the fitted values in ascending numerical order.
Make a scatter plot by highlighting the fitted and residual values data (and the column headings). You want to ensure that the fitted values are in the first column (Excel will expect this). Add a moving average trendline.
You want to select the Moving Average option. The default is for a twopoint moving average. This is okay for some cases and not for others so format the trendline and alter the options until you get something that looks sensible. In the following chart a 4point moving average was used.
The moving average trendline is not a perfect solution but it will give you an idea. 

Use plot(model, which = 2) to make a QQ plot of the standardized residuals of a regression model. 
Distribution plot of residualsIdeally you want the residuals (standardized) to be normally distributed. In R you can calculate the standardized residuals and plot a histogram or QQ plot to show the distribution. In Excel you can only approximate the standardized residuals, which you can plot as a histogram or a QQ plot (with a bit of work). Plot using RThe simplest solution is to use plot() on the result of a regression model. If you use the parameter, which = 2, you'll get a QQ plot of the standardized residuals.
You can see that the plot() command has highlighted those data points that have some influence over the model fit. You can produce a similar plot using regular commands:
This would not identify any "odd points". 

Make a QQ plot of residuals using Excel. You need to rank values (no ties) then determine cumulative rank proportion. Use NORMSINV to get zscores. Plot sorted residuals against (sorted) zscores to make a QQ plot. Add a linear trendline to help assess normality. 
Plot using ExcelOnce you have the residuals in Excel (either by direct calculation or via the Analysis ToolPak) you are well on your way to making a normal distribution plot. If you want the standardized residuals you will have to put up with Excel's approximation. You can work out the approximate values using:
Or let the Analysis ToolPak calculate this for you. You can then make a histogram of the values. You'll have to select appropriate bins and use the FREQUENCY fumction to work out the frequencies. Alternatively you can let the Analysis ToolPak work out frequencies for you via the Histogram routine (which will make the chart too). It is possible to produce a QQ plot, which requires a few steps:
You can add a linear trendline and with a bit of formatting can end up with a half decent QQ plot.
The QQ plot is a bit more useful than a histogram and does not take a lot of extra work. However, it can be a bit tedious if you have many rows of data. 

Top  
My Publications  My Publications See my personal pages at GardenersOwn 

Follow me... 


See also: 
KeywordsHere is a list of keywords: it is by no means complete! Ttest, Utest, KruskalWallis, Analysis of Variance, Spearman Rank, Correlation, Regression, Logistic Regression, Curved linear regression, histogram, scatter plot, bar chart, boxwhisker plot, pie chart, Mean, Median, Mode, Standard Deviation, Standard Error, Range, Max, Min, Interquartile Range, IQR 

Top  DataAnalytics Home  Contact  GardenersOwn Homepage 