Dr. Mark Gardener

Providing training for:

• Ecology
• Data analysis
• Statistics
• R The statistical programming language
• Data management
• Data mining

# Statistics – A guide

These pages are aimed at helping you learn about statistics. Why you need them, what they can do for you, which routines are suitable for your purposes and how to carry out a range of statistical analyses.

<= Back to Introduction | Forward to Choosing the right analysis =>

Data Analysis Home

The natural world is variable so several measurements need to be taken.

Summary statistics help make sense of these repeated measurements.

Top

## Summarizing data

When you want to measure something in the natural world you usually have to take several measurements. This is because things are variable, so you need several results to get an idea of the situation. Once you have these measurements you need to summarize them in some way because sets of raw numbers are not easily interpreted by most people.

### In this section:

The following sections give you an idea of the most useful summary statistics you can use. You can click on the links in the preceding list to jump direct to a section.

Four key elements in data summary, centrality, dispersion, replication and shape (distribution).

Top

## Basics of summarizing data

There are four key areas to consider when summarizing a set of numbers:

• Centrality – the middle value or average.
• Dispersion – how spread out the values are from the average.
• Replication – how many values there are in the sample.
• Shape – the data distribution, which relates to how "evenly" the values are spread either side of the average.

You need to present the first three summary statistics in order to summarize a set of numbers adequately. There are different measures of centrality and dispersion – the measures you select are based on the the last item, shape (or data distribution).

Averages are measures of centrality:

Most appropriate average depends on data shape (distribution)

Top

## Averages

An average is a measure of the middle point of a set of values. This central tendency (centrality) is an important measure and is usually what you are comparing when looking at differences between samples for example.

There are three main kinds of average:

• Mean – the arithmetic mean, the sum of the values divided by the replication.
• Median – the middle value when all the numbers are ranked in order.
• Mode – the most frequent value(s) in a sample.

Of these three, the mean and the median are most commonly used in statistical analysis. The most appropriate average depends on the shape of the data sample.

Arithmetic mean:

Measure of centrality for normally distributed data

Top

### Mean

The arithmetic mean is calculated by adding together the values in the sample. The sum is then divided by the number of items in the sample (the replication).

Calculating the arithmetic mean

The formula is shown above. The ∑ symbol represents "sum of". The n represents the replication. The final mean is indicated using an overbar. This shows that the mean is your estimate of the true mean. This is because you usually measure only some of the items in a "population"; this is called a sample. If you measured everything then you would be able to calculate the true mean, which would be indicated by giving it a µ symbol.

The mean should only be used when the shape of the sample is appropriate. When the data are normally distributed the mean is a good summary of the average. If the data are not normally distributed the mean is not a good summary and you should use the median instead.

Median:

Measure of centrality that is not dependent on data shape (distribution)

When data are normally distributed median and mean are very close

Top

### Median

The median is the middle value, taken when you arrange your numbers in order (rank). This measure of the average does not depend on the shape of the data. The "formula" for working out the median depends on the ranks of the values, you want a value whose rank is the (n/2)+0.5th like so:

Calculating the median

If you have an odd number of values in your sample the median is simply the middle value like so:

4 6 7 8 9

The median is 7 in this case.

When you have an even number of values the middle will fall between two items:

2 3 4 7 8 9

What you do is use a value mid-way between the two items in the middle. In this case mid-way between 4 and 7, which gives 5.5.

The median is a good general choice for an average because it is not dependent on the shape of the data. When the data are normally distributed the mean and the median are coincident (or very close).

Mode:

Most frequent value in a sample

Not much used in statistical analysis

Top

### Mode

The Mode is the most frequent value in a sample. It is calculated by working out how many there are of each value in your sample. The one with the highest frequency is the mode. It is possible to get tied frequencies, in which case you report both values. The sample is then said to be bimodal. You might get more than two modal values!

The mode is not commonly used in statistical analysis. It tends to be used most often when you have a lot of values, and where you have integer values (although it can be calculated for any sample).

The mode is not dependent on the shape of your sample. Generally speaking you would expect your mode and median to be close, regardless of the sample distribution. If the sample is normally distributed the mode will usually also be close to the mean.

Dispersion:

How spread out the values are around the average

High dispersion indicates high variability in a sample

Most useful general measures of dispersion are:

Standard deviation
Inter-quartile range

Choice of measure depends on data shape (distribution)

Top

## Dispersion

The dispersion of a sample refers to how spread out the values are around the average. If the values are close to the average then your sample has low dispersion. If the values are widely scattered about the average your sample has high dispersion.

Two samples with the same average but different dispersion

The example figure shows samples that are normally distributed, that is, they are symmetrical around the average (mean). As far as dispersion goes, the principle is the same regardless of the shape of the data. However, different measures of dispersion will be more appropriate for different data distribution.

There are various measures of dispersion, such as:

The choice of measurement depends largely on the shape of the data and what you want to focus on. In general with normally distributed data you use the standard deviation. If the data are not normally distributed you use the inter-quartile range.

Standard deviation:

A measure of dispersion for normally distributed data

Top

### Standard deviation

The standard deviation is used when the data are normally distributed. You can think of it as a sort of "average deviation" from the mean. The general formula for calculating standard deviation looks like the following:

Calculating standard deviation

To work out standard deviation follow these steps:

1. Subtract the mean from each value in the sample.
2. Square the results from step 1 (this removes negative values).
3. Add together the squared differences from step 2.
4. Divide the summed squared differences from step 3 by n-1, which is the number of items in the sample (replication) minus one.
5. Take the square root of the result from step 4.

The final result is called s, the standard deviation. In most cases you will have taken a sample of values from a larger "population", so your value of s is your estimate of standard deviation (the sample standard deviation). This is also why you used n-1 as the divisor in the formula. If you measured the entire population you can use n as the divisor. You would then have σ, which is the "true" standard deviation (called the population standard deviation).

In effect the -1 is a compensation factor. As n gets larger and therefore closer to the entire population, subtracting 1 has a smaller and smaller effect on the result. In most statistical analyses you will use sample standard deviation (and so n-1).

Top

### Other measures of dispersion for normally distributed data

There are other measures that can be used to represent dispersion when your sample is normally distributed. I will add notes about these at a later date.

Inter-quartile range (IQR):

A measure of dispersion for not normally distributed data

Based on the ranks of the data items

Top

### Inter-Quartile Range

The inter-quartile range (IQR) is a useful measure of the dispersion of data that are not normally distributed (see shape). You start by working out the median; this effectively splits the data into two chunks, with an equal number of values in each part. For each half you can now work out the value that is half-way between the median and the "end" (the maximum or minimum). This gives you values for the two inter-quartiles. The difference between them is the IQR, which you usually express as a single value.

The IQR essentially "knocks off" the most extreme portions of the data sample, leaving you with a core 50% of your original data. A small IQR denotes a small dispersion and a large IQR a large dispersion.

As a by-product of working out the IQR you'll usually end up with five values:

• Minimum – the 0th quartile (or 0% quantile).
• Lower quartile – the 1st quartile (or 25% quantile).
• Median – the 2nd quartile (or 50% quantile).
• Upper quartile – the 3rd quartile (or 75% quantile).
• Maximum – the 4th quartile (or 100% quantile).

These 5 values split the data sample into four parts, which is why they are called quartiles. You can calculate the quartiles from the ranks of the data values like so:

1. Rank the values in ascending order. Use the mean rank for tied values.
2. The median corresponds to the item that has rank 0.5n + 0.5 (where n = replication).
3. The lower quartile corresponds to the item that has rank 0.25n + 0.75.
4. The upper quartile corresponds to the item that has rank 0.75n + 0.25.

If you are using Excel you can compute the quartiles using the QUARTILE function.

Range is max-min

Not very useful as a measure of dispersion

Top

### Range

The range is simply the difference between the maximum and the minimum values. It is quite a crude measure and not very useful. The inter-quartile range is much more useful, and makes use of the maximum and minimum values in the calculation.

Replication is the number of values in your sample

Top

## Replication

This is the simplest of the summary statistics but it is still important. The replication is simply how many items there are in your sample (that is, the number of observations).

The value n, the replication, is used in calculating other summary statistics, such as standard deviation and IQR, but it is also helpful in its own right. You should look at the dispersion and replication together. A certain value for dispersion might be considered "high" if n is small but quite "low" if n is very large.

Data shape affects the kind of summary statistic and analytical approach

Data shape relates to the distribution of values around the average

Top

## Shape

The shape of the data affects the type of summary statistics that best summarize them. The "shape" refers to how the data values are distributed across the range of values in the sample. Generally you expect there to be a "cluster" of values around the average. It is important to know if the values are more or less symmetrically arranged around the average, or if there are more values to one side than the other.

There are two main ways to explore the shape (distribution) of a sample of data values:

The ultimate goal is to determine what kind of distribution your data forms. If you have normal distribution you have a wide range of options when it comes to data summary and subsequent analysis.

Types of data distribution include:

Normal (Gaussian)
Poisson
Binomial

Normal distribution is called parametric

Other distributions are non-parametric

Top

### Types of data distribution

There are many "shapes" of data, commonly encountered ones are:

• Normal (also called Gaussian)
• Poisson
• Binomial

In general your aim is to work out if you have normal distribution or not. If you do have normal distribution you can use mean and standard deviation for summary. If you do not have normal distribution you need to use median and IQR instead.

The normal distribution (also called Gaussian) has well-explored characteristics and such data are usually described as parametric. If data are not parametric they can be described as skewed or non-parametric.

Visualize data distribution with a frequency chart:

Top

### Drawing the distribution

There are two main ways to visualize the shape of your data:

In both cases the idea is to make a frequency plot. The data values are split into frequency classes, usually called bins. You then determine how many data items are in each bin. There is little difference between a tally plot and a histogram, they show the same information but are constructed is slightly different ways.

A tally plot is a simple frequency chart that can be drawn in a notebook

Split data into size classes (bins) and determine frequency of data in each size class

Top

#### Tally plots

A tally plot is a kind of frequency graph that you can sketch in a notebook. This makes it a very useful tool for times when you haven't got a computer to hand.

To draw a tally plot follow these steps:

1. Determine the size classes (bins), you want around 7 bins.
2. Draw a vertical line (axis) and write the values for the bins to the left.
3. For each datum, determine which size class it fits into and add a tally mark to the right of the axis, opposite the appropriate bin.

You will now be able to assess the shape of the data sample you've got.

A tally plot, showing normal distribution

The tally plot in the preceding figure shows a normal (parametric) distribution. You can see that the shape is more or less symmetrical around the middle. So here the mean and standard deviation would be good summary values to represent the data. The original dataset was:

17 26 28 27 29 28 25 26 34 32 23 29 24 21 26 31 31 22 26 19 36 23 21 16 30

The first bin, labelled 18, contains values up to 18. There are two in the dataset (17, and 16). The next bin is 21 and therefore contains items that are >18 but not greater than 21 (there are three: 21, 19 and 21).

The following dataset is not normally distributed:

21 36 18 17 16 22 20 19 20 22 25 19 17 21 19 21 31 22 19 19 16 23 21 16 30

These data produce a tally plot like so:

A tally plot, showing non-parametric distribution

Note that the same bins were used for the second dataset. The range for both samples was 16-36. The data in the second sample are clearly not normally distributed. The tallest size class is not in the middle and there is a long "tail" towards the higher values (see shape statistics). For these data the median and inter-quartile range would be appropriate summary statistics.

Histogram:

A kind of bar chart showing frequency of data in various size classes

Top

#### Histograms

A histogram is like a bar chart. The bars represent the frequency of values in the data sample that correspond to various size classes (bins). Generally the bars are drawn without gaps between them to highlight the fact that the x-axis represents a continuous variable. There is little difference between a tally plot and a histogram but the latter can be produced easily using a computer (you can sketch one in a notebook too).

To make a histogram you follow the same general procedure as for a tally plot but with subtle differences:

1. Determine the size classes.
2. Work out the frequency for each size class.
3. Draw a bar chart using the size classes as the x-axis and the frequencies on the y-axis.

You can draw a histogram by hand or use your spreadsheet. The following histograms were drawn using the same data as for the tally plots in the preceding section. The first histogram shows normally distributed data.

Histogram showing normal distribution

The next histogram shows a non-parametric distribution.

Histogram showing non-parametric distribution

In both these examples the bars are shown with a small gap, more properly the bars should be touching. The x-axis shows the size classes as a range under each bar. You can also show the maximum value for each size class. Ideally your histogram should have the labels at the divisions between size classes like so:

Histogram with x-axis labelled at size class boundaries

Note that this histogram uses slightly different size classes to the earlier ones.

Shape statistics are numerical values that help you characterize the distribution (its shape):

Top

### Shape statistics

Visualizing the shape of your data samples is usually your main goal. However, it is possible to characterize the shape of a data distribution using shape statistics. There are two, which are used in conjunction with each other:

• Skewness – a measure of how central the average is in the distribution.
• Kurtosis – a measure of how pointy the distribution is ( think of it as how clustered the values are around the middle).

If you are producing a numerical data summary these two values are useful statistics.

Skewness is a measure of how central the average is

Use SKEW in Excel

Top

#### Skewness

The skewness of a sample is a measure of how central the average is in relation to the overall spread of values. The formula to calculate skewness uses the number of items in the sample (the replication, n) and the standard deviation, s.

Formula to calculate skewness

In practice you'll use a computer to calculate skewness; Excel has a SKEW function that will compute it for you.

A positive value indicates that the average is skewed to the left, that is, there is a long "tail" of more positive values. A negative value indicates the opposite. The larger the value the more skewed the sample is.

Kurtosis is a measure of how "pointed" a distribution is

Use KURT in Excel

Top

#### Kurtosis

The kurtosis of a sample is a measure of how pointed the distribution is (see drawing the distribution). It is also a way to think about how clustered the values are around the middle. The formula to calculate kurtosis uses the number of items in the sample (the replication, n) and the standard deviation, s.

Formula to calculate kurtosis

In practice you'll use a computer to calculate kurtosis; Excel has a KURT function that will compute it for you.

A positive result indicates a pointed distribution, which will probably also have a low dispersion. A negative result indicates a flat distribution, which will probably have high dispersion. The higher the value the more extreme the pointedness or flatness of the distribution.

Data summary:

Shape of data determines which statistics are most appropriate

Explore shape (distribution) using:

Top

## Summary

You should always summarize a sample of data values to make them more easily understood (by you and others). At the very least you need to show:

The shape of the data (its distribution) is also important because the shape determines which summary statistics are most appropriate to describe the sample. Your data may be normally distributed (i.e. with a symmetrical, bell-shaped curve) and so parametric, or they may be skewed and therefore non-parametric.

You can explore and describe the shape of data using graphs:

You can also use shape statistics:

• Skewness – how central the average is.
• Kurtosis – how pointed the distribution is.

The shape of the data also leads you towards the most appropriate ways of analyzing the data, that is, which statistical tests you can use.

Navigate:

See my Publications about statistics and data analysis.

Courses in data analysis, data management and statistics.