Dr. Mark Gardener

Statistics for Ecologists Using R and Excel.

Edition 2.

Statistics for Ecologists Using R and Excel:
Data collection, exploration, analysis and presentation

Get a 20% discount on "Statistics for Ecologists" when you buy direct from the publisher! Enter the voucher code S4E20 in the shopping basket at Pelagic Publishing.

Top

Writer's Bloc

On this page you can find out about my latest writing project. I'll post updates on progress, tables of contents and also some of the R scripts (and possibly Excel spreadsheets) I am developing in support of the new book. I'll try to keep the material reasonably up to date.

The Writer's Bloc homepage contains a table of contents and an index of the pages that contain custom R commands and R scripts.

I am working on a new edition of my book Statistics for Ecologists Using R and Excel. I added a chapter on community statistics, which I did not cover in the original. Community data usually requires rather more heavy-duty statistics but there are some analyses that you can carrry out fairly easily. Community similarity (or dissimilarity) is one such topic. In these notes you'll see some of the more commonly used measures of dissimilarity used when you have abundance data.

Abundance-based dissimilarity metrics

Introduction

When your community data samples include abundance information (as opposed to simple presence-absence) you have a wider choice of metrics to use in calculating (dis)similarity. When you have presence-absence data you use the number of shared species (J) and the species richness of each sample (A & B). Measures of (dis)similarity obtained are therefore slightly "crude".

When you have abundance data your measures of (dis)similarity are a bit more "refined" and you have the potential to pick up patterns in the data that you would otherwise not see using presence-absence data.

There are many metrics that you might use to explore (dis)similarity, in this exercise you'll see four of the more commonly used ones:

• Bray-Curtis
• Canberra
• Manhattan (City Block)
• Euclidean

The exercise shows you how you can carry out the calculations using Excel (in the book you also see how to do this using R). You can get the sample spreadsheet here: Distance metrics.xlsx.

One sample is designated i and one sample j in equations computing dissimilarity between two communities.

Top

Notation

When you have presence-absence data you use the species richness of each sample, calling these values A and B. The number of shared species is designated J. When you have abundance data you use a slightly different notation. The abundance of an individual species is given as x. Now you are comparing two samples so you call one i and one j. Subscripts are used to differentiate the abundances between the two samples, so you get xi and xj.

Generally speaking it does not matter which sample is i and which is j.

You'll also see the ∑ symbol (meaning sum, you add things together) and the vertical bars e.g. |x - y|, which indicate that you ignore any negative sign and simply use the magnitude of the result.

Bray-Curtis dissimilarity has two components:

|xi - xj| and (xi + xj)

Both are easily computed using Excel.

Use =ABS to get absolute values (i.e. ignore negative sign).

Bray-Curtis is also known as:
Steinhaus
Czekanowski
Sørensen

Top

Bray-Curtis

The Bray-Curtis metric uses two components: |xi - xj| and (xi + xj).

In the first case you subtract the abundance of one species in a sample from its counterpart in the other sample but ignore the sign. The second component is the abundance of a species in one sample added to the abundance of its counterpart in the second sample. If a species is absent then its abundance should be recorded as 0 (zero).

The equation for the Bray-Curtis dissimilarity is shown below:

Bray-Curtis dissimilarity

Bray-Curtis dissimilarity is easy to calculate in Excel but there are no shortcuts to working out the main components, |xi - xj| and (xi + xj).

You might have your data arranged in rows or columns; in the following example the data are in columns.

 Species Si Sj |xi - xj| (xi + xj) Spp1 6 4 2 10 Spp2 5 3 2 8 Spp3 7 4 3 11 Spp4 2 6 4 8 Spp5 3 0 3 3

So, the 4th column is easily calculated by subtracting values in the 3rd column from the 2nd. Use the =ABS function to get the absolute value (i.e. ignore the sign). So for example the 4th data row would be 2 – 6 = -4 but the ABS function ignores the -.

The last column is even easier, as it is the sum of a value in the 2nd column with its counterpart in the 3rd column.

You don't need to compute the sum of the 4th or 5th columns because your final calculation can do that in one go:

=SUM(column4) / SUM(column5)

You can use the mouse to highlight the values in the columns in place of the column4 and column5 items in the formula (which represent the cell ranges).

If you calculate the Bray-Curtis dissimilarity for these data you get a result of 0.350.

If your data were arranged in rows the calculations are more or less the same (the spreadsheet shows both).

Canberra dissimilarity has two components:

|xi - xj| and (xi + xj)

Both are easily computed using Excel.

Use =ABS to get absolute values (i.e. ignore negative sign).

Canberra distance is tedious to calculate in Excel so you need an intermediate step.

Top

Canberra

The Canberra dissimilarity uses the same components as Bray-Curtis but the components are summed differently:

Canberra dissimilarity.

In the Canberra metric each |xi - xj| result is divided by the corresponding (xi + xj) value.

In the following example the data are shown in rows but the calculations are more or less the same for data in columns (the spreadsheet shows both).

 Calc Spp1 Spp2 Spp3 Spp4 Spp5 Si 6 5 7 2 3 Sj 4 3 4 6 0 |xi - xj| 2 2 3 4 3 (xi + xj) 10 8 11 8 3

This time you need to evaluate each |xi - xj| ÷ (xi + xj) then adding them together.

This means you need to work out 2/10 + 2/8 + 3/11 + 4/8 + 3/3 for the current example. There is no shortcut for this (obviously you will use the cell references) so when you have a lot of species the formula is quite long. You can get around this by making a new row (or column, depending how your data are arranged) for the |xi - xj| ÷ (xi + xj) values. You can then copy/paste the results and then do a final =SUM to get the Canberra distance.

In this case the result works out to be 2.223.

Manhattan (City Block) dissimilarity is simple to compute.

Use =ABS to get absolute values (i.e. ignore negative sign).

Top

Manhattan (City Block)

The Manhattan (or City Block) dissimilarity uses the |xi - xj| component only.

This is the simplest dissimilarity metric to compute:

Manhattan (City Block) dissimilarity.

You can use the =ABS function to ignore any negative signs (and retain the value only). Then the =SUM funtion can simply total them to give the final result. In this case you get: 2 + 2 + 3 + 4 + 3 = 14.

As you can see, City Block dissimilarity is easy to work out using Excel.

Euclidean dissimilarity uses squares of differences in species abundance.

Use =SUMXMY2 function in Excel as a "shortcut" in calculation.

Top

Euclidean

The Euclidean dissimilarity metric uses a different component (xi – xj) from the other metrics.

Here the abundance of a species from one sample is subtracted from its counterpart in the other sample. Instead of ignoring the sign, the result is squared (which gives a positive value):

Euclidean dissimilarity.

In this case you can use the =SUMXMY2 function to eliminate any intermediate calculation steps (although you can of course do those intermediate steps if you like).

=SUMXMY2(range1, range2)

What the function does is to take two ranges of cells, it subtracts one value from its counterpart in the other and squares the result, then gives the total. This is exactly what you want to calculate the Euclidean dissimilarity. All you need to to is to use =SUMXMY2 and then SQRT the result. You can of course do this in one formula.

The spreadsheet gives an example of the calculation in action for data in columns and in rows (there is little difference).

If you already had calculated |xi - xj| values then you can use them. Simply square each one, add the squares together, there is a function to do that =SUMSQ. Then take a square root. You might as well use the =SUMXMY2 function!

In the example here the |xi - xj| values are 2, 2, 3, 4, 3.

Their squares would be 4, 4, 9, 16, 9 giving a total of 42.

The final Euclidean dissimilarity is √42 = 6.481.

The choice of dissimilarity metric is a source of debate!

Top

Which metric to use?

Good quesiton! There is no definite "correct" metric to use in most cases. As a rule of thumb you should use the metric that gives the best separation of samples. This runs a little counter to the idea in statistics that you decide on the proper test to apply before you begin. However, this is not a statistical "test" but a method of separating samples into meaningful groupings. Your distance values should be visualized using a dendrogram (see this exercise on drawing dendrograms in Excel). You choose the metric that gives the most useful results.

The choice of dissimilarity metrics is something that occasionally sparks low-level holy wars amongst ecologists. Hopefullly you will use several and the results will all give broadly the same ecological meaning!

Examples on the support page.
Distance metrics.xlsx

Example data and files

You can get examples from the book from the support pages. There are also some specific examples used for the exercises in addition to those shown in the book. You can also download the spreadsheet file Distance metrics.xlsx, which shows how to carry out the calcualtions using Excel.

Top

Providing training for:

• Ecology
• Data analysis
• Statistics
• R The statistical programming language
• Data management
• Data mining