  Dr. Mark Gardener Available soon from
Pelagic Publishing

# Statistics for Ecologists Using R and Excel (Edition 2)

### by: Mark Gardener

Available soon from Pelagic Publishing

Welcome to the support pages for Statistics for Ecologists. These pages provide information and support material for the book. You should be able to find an outline and table of contents as well as support datafiles and additional material.

## Exercise 11.1 Section 11.1

Multiple regression disgnostics:
Residuals, fitted values and plots.

Example datafile:
regression diagnostic.csv

Top

### 11.1 Graphing multiple regression

This exercise is concerned with how to produce and chart some disgnostics for regression models (Sections 11.1 and 11.2). The dataset used to illustrate these notes is available as a file: regression diagnostic.csv.

#### Introduction

When you only have two variables (a predictor and a single response) you can use a regular scatter plot to show the relationship. Even if the relationship is logarithmic or polynomial you can represent the situation, as long as there is only one predictor variable.

When you have two or more predictor variables it becomes hard to represent the situation graphically. You can try a 3D plot but they are rarely successful. Generally you'll stick to plotting the "most important" predictor variable and display the model as a standard regression table.

However, it can be helpful show some diagnostics from your regression model as graphics. These notes show you how you can produce some simple regression diagnostics and present them graphically.

The data used for these notes can be downloaded as a CSV file: regression diagnostic.csv. The data are the same as shown in the book (mayfly regression) but reproduced here to make it easier to follow the notes.

Use the regression equation with observed predictor variable values to produce fitted y-values.

Fitted values are "expected" values of the response variable calculated from the coefficients of the regression and observed values of predictor variables.

Top

### Fitted values

When you make a regression model you are trying to find the "best" mathematical fit between the variables. In a simple form the relationship will be:

y = mx + c

In this case there is a single predictor variable, x. The m is the slope and c is the intercept. Both m and c are known as coefficients. You might have a logarithmic or polynomial relationship but essentially the form of the relationship is the same:

y = m.log(x) + c
y = ax + bx^2 + c

In multiple regression you have more than one predictor and each predictor has a coefficient (like a slope), but the general form is the same:

y = ax + bz + c

Where a and b are coefficients, x and z are predictor variables and c is an intercept. You can add more and more variables:

y = m1.x1 + m2.x2 + m3.x3 + ... + c

Note that each predictor variable has a unique coefficient.

Once you have your "equation" you can use the observed values of the predictors and of the coefficients in your equation to produce new y-values. These values are "fitted" to the regression model. Think of them as being idealised values or "expected" values.

Note that you should not call these fitted values "predicted" values because they are based on your observed predictor variable values. If you insert predictor values that you did not observe into the equation then you can call the resulting y-values predicted y-values. The values you insert to the equation could lie within the max-min range of observed predictor (interpolation) or be outside (extrapolation). In general it is "safer" to go with interpolation than with extrapolation.

Fitted values in R:

fitted(model)
model\$fitted.values

The fitted() command produces fitted values as a simple numeric vector.

Top

#### Calculate Fitted values using R

Fitted values are easy to compute in R. You can get them from the result of a lm() command in two ways:

model\$fitted.values
fitted(model)

In both cases model is the result of a lm() command.

> names(mf)
 "Length" "Speed" "Algae" "NO3" "BOD"

> mf.lm = lm(Length ~ BOD + Algae, data = mf)

Then you can get the fitted values:

`> fitted(mf.lm)       1        2        3        4        5        6        7        8 16.71301 17.70924 19.40969 21.65981 21.79722 20.93840 21.88309 19.12458        9       10       11       12       13       14       15       16 17.22829 16.42101 19.50246 20.23417 21.14452 20.23417 17.79511 16.72331       17       18       19       20       21       22       23       24 17.57183 19.29977 22.31250 14.66902 16.47254 24.35649 22.90681 22.52893       25 22.36403`

Once you have the values you can use them like any other numeric variable (the vector has a names attribute).

Use LINEST to get regression coefficients.

Coefficients are listed backwards with the inercept being last.

Top

#### Calculate Fitted values using Excel

To get fitted values in Excel you'll need to calculate the coefficients and plug the values into the spreadsheet to generate them:

1. Arrange your data so that the predictor values are next to one another.
2. Use the LINEST function to determine the coefficients:
1. Highlight one block of cells in a row, you need one cell per coefficient.
2. Type =LINEST and start the formula, inside the () you need
3. The y-values, the x-values, 1, 0. Note that the final 0 suppresses additional regression stats, you only get the coefficients.
4. Enter the formula as an array using Control+Enter
3. The results of LINEST show the coefficients backwards! The intercept is last. The first coefficient shows the last of the predictor variables.
4. Use the coefficient values and the observed predictors to make your predicted values.

Once you have the coefficients it is fairly easy to create the fitted values but you will need to "fix" the references to the coefficients using \$ in the cell reference(s). Then you can copy/paste the formula down the column to fill in the fitted values for all the rows of data.

Use Analysis ToolPak to carry out regression and compute fitted values as well as residuals.

Data > Data Analysis button

Select Regression option and tick Residuals to get fitted and residuals.

Top

#### Use the Analysis ToolPak

If you have the Analysis ToolPak add-in you can use this to generate the regression statistics. The Regression routine will produce the coefficients as well as the fitted values.

1. Arrange your data so that the predictor values are next to one another.
2. Click the Data > Data Analysis button.
3. Select the Regression option.
4. Use the mouse to select the appropriate data (you can include the label headings).
5. Choose the location for the results, which can be in the current worksheet or in another.
6. Tick the box labelled, Residuals.
7. Click OK to finish. The Analysis ToolPak can undertake regression and compute fitted values

The results for the residuals will appear after the main regression results. You'll also see another column containing the fitted values. Unfortunately this is labelled Predicted.

Residuals are the difference between observed y-values and idealised fitted values based on your regression model.

Top

### Residuals

Residual values are the difference between the fitted values and the actually observed values for your response variable. Think of the fitted values as being the "ideal" or "expected" values based on your regression equation. The residual values are (usually) not ideal and differ from the "perfect fit".

So, the residuals are a way to see how bad the fit is between the ideal values based on your regression model and the real situation. In general you want to see that the residuals are normally distributed, at least this is what you want when doing regular linear regression (and curvilinear). In diagnostic terms it is the normal distribution of the residuals that is the really important thing, not the distribution of the original data (although usually you do not get the former without the latter).

Residual values in R:

resid(model)
model\$residuals

The resid() command produces residual values as a simple numeric vector.

Top

#### Calculate residuals using R

Residual values are easy to compute using R; you get them from the result of a lm() command:

model\$residuals
resid(model)

Simply make your regression model result then away you go:

> mf.lm = lm(Length ~ BOD + Algae, data = mf)

`> fitted(mf.lm)       1        2        3        4        5        6        7        8 16.71301 17.70924 19.40969 21.65981 21.79722 20.93840 21.88309 19.12458        9       10       11       12       13       14       15       16 17.22829 16.42101 19.50246 20.23417 21.14452 20.23417 17.79511 16.72331       17       18       19       20       21       22       23       24 17.57183 19.29977 22.31250 14.66902 16.47254 24.35649 22.90681 22.52893       25 22.36403`

The result of resid() is a numerical vector (with a names attribute), which you can use in various ways (as you will see later).

Residuals are Observed - Fitted values. A simple calculation in Excel if you have the fitted values.

Top

#### Calculate residuals using Excel

Residuals are simply the difference between the observed and expected values of your response variable. You can use the regression equation for your model to generate the fitted values. Once you have those you simply subtract the fitted value from the observed:

Observed - Fitted = Residual

Note that you always carry out the calculation this way around by convention.

Use Analysis ToolPak to compute residual values.

Top

#### Use the Analysis ToolPak

The Analysis ToolPak can compute the residuals for you easily. You simply tick the box labelled Residuals in the Regression dialogue box. The process is exactly the same as when calculating the fitted values.

The residuals are placed alongside the fitted values (which are labelled Predicted), and are labelled Residuals.

Standardized residuals are rescaled residuals.

The rescale is dependent on the variance of individual residuals and takes into account leverage (a measure of influence).

Use rstandard() in R

Excel cannot easily compute correct standardized residuals. An approximation is:

resid ÷ std.dev(resid)

The Analysis ToolPak only computes an approximation.

Top

### Standardized residuals

Standardized residuals are residuals that have been rescaled. The way they are scaled is such that each residual value has unit standard deviaiton. The standardization process takes into account how much "influence" a datum might have on the regression (something called leverage).

It is generally preferrable to use the standardized residuals.

#### Standardized resisuals in R

You can compute the standardized residuals using the rstandard() command:

`> rstandard(mf.lm)          1           2           3           4           5           6  1.89433577  1.85925744  1.53830120  0.77572446 -0.45212719 -0.52806189           7           8           9          10          11          12 -1.70410540 -1.82866111 -1.29005538 -1.41946426  0.85656348  0.43789841          13          14          15          16          17          18 -0.09180616 -0.13389770  0.71314102  0.75324789 -0.32958291 -0.16844397          19          20          21          22          23          24 -0.76996058 -1.02235369 -0.27307270  0.39957434  0.63390452  0.27061395          25 -0.20911178`

The result is a numeric vector (with a names attribute).

#### Standardized residuals in Excel

You cannot calculate the standardized residuals in Excel without some tedious computations! The Analysis ToolPak claims to calculate standardized residuals but what you end up with is the residuals divided by the standard deviation of the residuals. This is only an approximation to the "proper" standardization.

Diagnostic plots:

Residuals vs. Fitted
Normal distribution of Std.Residuals.

Top

### Diagnostic plots

There are several type of plot you might use as diagnostic tools. The ones I'll show here are:

• Residuals vs. Fitted values.
• Normal distribution plot of standardized residuals.

You can make these plots easily using R. In Excel you can make some worthwhile attempts but you can only use approximately standardized residuals.

Use plot() with result of regression model for diagnostic plots.

For specific residual vs. fitted plot use:

plot(model, which = 1)

Use lowess() with a regular plot() to view the fit (or lack of) between variables.

Top

### Residuals vs. Fitted values

You can calculate the fitted and residuals easily in R or Excel. A plot of the residuals against the fitted values is a simple way to produce a useful diagnostic tool.

#### Plot using R

The simplest way to produce a residual plot is to use the plot() command on a regression model result (which actually runs the plot.lm() command). The plot.lm() command produces up to six diagnostic plots, which you can choose using the which = parameter. The plot of residuals vs. fitted values is obtained via:

plot(model, which = 1)

This produces a scatter plot of the regular residuals against the fitted values. The command also produces a locally weighted polynomial smooth fit line and identifies and data points that might have undue influence on the model fit.

> plot(mf.lm, which = 1) The plot.lm() command produces diagnostic plots. This one is the residuals vs. fitted values.

Ideally you would like not to see any kind of pattern. In the example there are some points identified as possibly having undue influence but the lowess line runs aimlessly along the middle.

You can produce a similar plot using standardized residuals and regular plot commands:

> plot(fitted(mf.lm), rstandard(mf.lm))
> abline(h = 0, lty = 3)
> lines(lowess(fitted(mf.lm), rstandard(mf.lm)), col = "red")

Use a regular scatter plot to show residuals vs. fitted values.

Add a trendline using a Moving Average to get an impression of the fit (or lack of).

Values need to be sorted in ascending numerical order (by fitted value) for the moving average to work.

Top

#### Plot using Excel

It is fairly easy to plot the regular residuals against the fitted values using Excel. Once you have calculated the values you can simply make a scatter plot. However, it is not so easy to add a locally weighted polynomial. There are add-ins you might try but a simple workaround is to use a moving average trendline. To incorporate this you need to sort the fitted values in ascending numerical order.

1. First of all you need to calculate the fitted values and the residuals. You can do this using the Analysis ToolPak.
2. If you used regular commands to compute the values then:
1. Copy them to the clipboard.
2. Paste Special the values only to a new location.
3. If you used the Analysis ToolPak you do not need to do anything.
4. Highlight the values (include their headings) and sort them:
1. Click the Home > Sort & Filter button.
2. Click Custom Sort.
3. Use the dropdown on the left to sort by the Fitted values.
4. Make sure you use Smallest to Largest.
5. Click OK to apply the sort.
5. Now you have the fitted values sorted in ascending numerical order.

Make a scatter plot by highlighting the fitted and residual values data (and the column headings). You want to ensure that the fitted values are in the first column (Excel will expect this).

Add a moving average trendline.

• In Excel 2013 click the chart then use the Design > Add Chart Element button.
• In Excel 2010 click the chart then use the Layout > Trendline button.

You want to select the Moving Average option. The default is for a two-point moving average. This is okay for some cases and not for others so format the trendline and alter the options until you get something that looks sensible. In the following chart a 4-point moving average was used. A residual vs. fitted values plot in Excel. Use a moving point average trendline to get an impression of the fit (or lack of).

The moving average trendline is not a perfect solution but it will give you an idea.

Use plot(model, which = 2) to make a QQ plot of the standardized residuals of a regression model.

Top

### Distribution plot of residuals

Ideally you want the residuals (standardized) to be normally distributed. In R you can calculate the standardized residuals and plot a histogram or QQ plot to show the distribution. In Excel you can only approximate the standardized residuals, which you can plot as a histogram or a QQ plot (with a bit of work).

#### Plot using R

The simplest solution is to use plot() on the result of a regression model. If you use the parameter, which = 2, you'll get a QQ plot of the standardized residuals.

> plot(mf.lm, which = 2) A normal QQ plot of the standardized residuals of a regression model

You can see that the plot() command has highlighted those data points that have some influence over the model fit. You can produce a similar plot using regular commands:

> qqnorm(rstandard(mf.lm))
> qqline(rstandard(mf.lm), lty = 3)

This would not identify any "odd points".

Make a QQ plot of residuals using Excel.

You need to rank values (no ties) then determine cumulative rank proportion.

Use NORMSINV to get z-scores.

Plot sorted residuals against (sorted) z-scores to make a QQ plot.

Add a linear trendline to help assess normality.

Top

#### Plot using Excel

Once you have the residuals in Excel (either by direct calculation or via the Analysis ToolPak) you are well on your way to making a normal distribution plot. If you want the standardized residuals you will have to put up with Excel's approximation. You can work out the approximate values using:

Approx Std.Resid = Residual / Stdev(Residual)

Or let the Analysis ToolPak calculate this for you.

You can then make a histogram of the values. You'll have to select appropriate bins and use the FREQUENCY fumction to work out the frequencies. Alternatively you can let the Analysis ToolPak work out frequencies for you via the Histogram routine (which will make the chart too).

It is possible to produce a QQ plot, which requires a few steps:

1. Make a column of "ranks", which is a simple list of "observation number", values from 1 to the number of replicates (rows) in your dataset.
2. Now determine the cumulative proportion for the rank values. You need a formula resembling: =(rank-0.5)/count(rank1:rankn) where rank1:rankn is your column of rank values and rank is the rank you are working out the proportion of. You will need to "fix" the cell references for rank1:rankn using \$.
3. Copy and paste the formula down the column. You should see ascending values rising towards a value of 1 (you'll probably end up with 0.98).
4. In the next column you need to calculate a z-score. Use NORMSINV to work this out using the proportions you just calculated. You should end up with a series of values starting at about -2 and rising through 0 to around +2.
5. In the next column copy your standardized residual values.
6. Sort the residuals from smallest to largest.
7. Highlight the z-score and sorted residuals data.
8. Insert a Scatter plot.

You can add a linear trendline and with a bit of formatting can end up with a half decent QQ plot. A QQ plot of residuals from a regression model

The QQ plot is a bit more useful than a histogram and does not take a lot of extra work. However, it can be a bit tedious if you have many rows of data.

Top

My Publications     