Dr. Mark Gardener

Managing Data Using Excel

# Writer's Bloc

On this page you can find out about my latest writing project. I'll post updates on progress, tables of contents and also some of the R scripts (and possibly Excel spreadsheets) I am developing in support of the new book. I'll try to keep the material reasonably up to date.

The Writer's Bloc homepage contains a table of contents and an index of the pages that contain custom R commands and R scripts.

Managing Data Using Excel:
Organizing, Summarizing and Visualizing Scientific Data

Available now from
Pelagic Publishing

Get £5 discount when you buy direct from the publisher. Enter voucher code MDUE20 in the shopping basket at Pelagic Publishing

## Making sense of your data.

Once you have got your data into a sensible arrangement and carefully built-up yourt dataset you'll need to think about exploring the data in order to make sense of it. After all, this is generally the point of collecting the data in the first place.

You need to be able to work out what kind of data you've got and produce summary statistics, to help reduce the information into the data to a more manageable form (for the human brain). You also need to be able to visualize the data, this helps highlight patterns in the data in a consice and helpful manner. Summary statistics and charts are also essential to convey information to others when it comes to sharing your results (Chapter 9 in the book).

Excel has various tools that can help you to explore your data, including Pivot Tables and Chart Tools. This page gives an overview of the material covered in Chapters 4–8 of the book.

Main types of dataset:

• Correlation & Regression
• Association
• Differences

Data can be time-related

Top

## Types of dataset

In a broad sense you can think of data as being in one of three camps:

• Correlation and regression: involves finding links between variables, which are usually numeric. You may have grouping variables (which might be time-related).
• Association: involves finding links between categories of things, your data are frequencies of the items in the various categories.
• Differences: involves splitting data into chunks and comparing these sample groups. Your data may be time-sensitive and require you to "follow" a variable over time.

How you deal with your data depends to a large extent on which camp your data is most allied to. Some datasets may be a combination and contain elements of more than one kind. Once you know what you are looking at you can use the most appropriate tools to help you explore and make sense of your data.

Cornerstones of data summary:

Centrality - averages
Replication - how many
Distribution - shape

Top

## Fundamentals of data summary

It is hard to make sense of a load of numbers. The sensible thing to do is to distil the numbers into a summary. Ideally you should do a visual summary before a numerical one but in practice you have to do the latter before the former. Such summary statistics reduce the original data to a few key values, which help the reader view the data. Summary statistics come in four main sorts:

• Middle values – averages, i.e. measures of centrality.
• Variability – are the values clustered tightly around the middle or more widespread.
• Replication – how many items you have.
• Distribution – the shape of the data.

Summary statistics are useful but most readers will find a graphic more helpful in the long run.

Averages:

Mean
Median
Mode

Top

### Centrality

There are three main sorts of average:

• Mean: this is what most people are familiar with, you add everything together (the sum) and divide by the number of things you've got (the replication).
• Median: here you put the values in order (rank), the value in the middle is the median. If you have have an even number of observations your middle will lie mid-way between two values.
• Mode: this is the most frequent value in the dataset. You can have two modal values (or more). This measure is not used in statistical testing.

The most appropriate average is related to the shape of the data, its distribution.

Dispersion:

Standard deviation
Standard Error
Range
IQR

Top

### Dispersion

This is a measure of how spread-out the data are from the middle. There are several commonly used measures:

• Standard deviation – this is a kind of average deflection from the mean.
• Standard error – this is related to Std. Dev. (s) and sample size (n) S.E = s÷√n.
• Range – this is simply the difference between thelargest value and the smallest value.
• Inter-quartile range (IQR) – the median is a way of splitting the data into 2 parts. The quartiles similarly split the sample into 4 parts, with the quartile values at the intervals (the mdeian is the 2nd, max = 4th and min = 0th). The IQR is the difference between the 1st and 3rd quartiles.

There are two camps" of measure, the first 2 are in one camp the rest are in the other camp. Which camp you go for depends on the distribution of the data.

The size of the sample is called the replication or number of observations

### Replication

This is the simplest, yet often overlooked. The number of observations (called replications or records) is used in many other calculations. It is important to know if the sample is small (few replicates) or large because this can affect the level of certainty.

The shape of the data determines the summary statistics and analytical methods

Describe shape using:

Tally plot
Histogram
Kurtosis
Skewness

Top

### Distribution

The shape of the data is important because it will affect the kinds of summary statistics you can use and also the statistical approach you can use later on. Essentially you split the data into chunks (called size classes or bins) and work out the frequency of observations in each bin. You usually make a graphical representation of the resulting table but there are "shape statistics" that can help:

• Tally plot – this is a simple "tick plot", whcih you can scrawl on a piece of paper. It is simple but effective.
• Histogram – a kind of bar chart that shows the frequency of values in each bin.
• Kurtosis – a measure of how "pointy" the shape is.
• Skewness – a measure of how central the "hump" is.

If your data are normally distributed the tally plot or histogram looks symmetrical, with a central "hump", which tails off gracefully either side (forming a bell-shape). The properties of the normal distribution are well understood and this kind of shape is also called parametric. With parametric data you can use the following:

• Mean.
• Standard Deviation.
• Standard Error.

If your data shape is non-parametric (i.e. is skewed) the mathematical properties are rather more uncertain, you have to use the following for summary:

• Median.
• Range.
• IQR.

Of course the replication is independent of the shape of the data.

Different graphs have different uses:

Bar chart
Column chart
Box-whisker plot (boxplot)
Dot chart
Line plot
Scatter plot
Pie chart

Top

## Graphs and charts

Graphical methods of presentation are usually the best way to explore your data. Graphs are also more easily understood by others so are useful for sharing your findings. Excel has quite powerful graphical capabilities but you often have to put in a bit of effort to overcome the default settings. However, since the graphs of your work represent the pinnacle of your research you probably ought to make it look like you care.

Different charts are useful for different purposes:

• Bar and column charts – good for showing differences between sample groups, each bar being a sample group. You can add error bars to show variability (dispersion). You can show multiple groupings using adjacent bars or have them stacked. Stacked charts can also be scaled to 100% to show compositional data. Bar charts can also be used with association data, to show frequencies in the various categories.
• Box-whisker plots (boxplots) – good for differences, like bar charts. However, they show a lot of information in a compact manner, usually median, IQR and range (i.e. max-min).
• Dot charts – show data in sample groups. Usually you'll use these for data validation as you present all the data in each sample (rather than an average).
• Line plots – show sample groups in distinct categories, joined by a line. You therefore usually use them to show how a variable changes over time, but only if the time intervals are equal.
• Scatter plots – show two numeric variables, one on each axis if a regular plot. These are what you use if you have correlation or regression data.
• Pie charts – show compositional data, as the pie is always 360˚ of course. Use with caution as the human eye is not good with angular measures. They are sometimes used in association data. If you can make a pice chart you can always show the data in another form, such as a bar chart.
• Histogram (and tally plots) – used to show frequency data for a single sample, that is they are used to show the shape (distribution) of a sample.

Some graphs can be pressed into service for more than one use. All of these charts types are easily made in Excel, although the boxplot requires some lateral thinking!

Exploring correlation & regression:

Excel's default correlation is Pearson's proiduct moment, which is a parametric correlation coefficient

Use correlation matrix
Scatter plot

Top

## Exploring Correlation & Regression datasets

In these kinds of dataset you are looking for links between variables. Sometimes you have grouping variables and sometimes not.

Generally you will check the variables for normality, as this affects the sort of regression you can carry out. You may or may not choose to sumarize each variable, depending how many there are. In any event you need to work out correlation or regression values for the links between pairs of variables. There are two commonly used sorts of correlation coefficient:

• Spearman's Rank – this is used for non-parametric data.
• Pearson's Product Moment – this is used for normally distributed data. Pearson's coefficient is the one that Excel computes via the =CORREL function.

In the main you'll use Pearson's coefficient for basic exploration simply because it is convenient to compute in Excel.You can also use the Analysis ToolPak to make a correlation matrix, which allows you to see all the coefficients in one go. You can make a correlation matrix by other methods, but they are a bit more work.

When you need to visualize the relationships you will use scatter plots. You cannot use these from a Pivot Table (as a Pivot Chart) but it is relatively easy to make scatter plots, which you can alter. You can most easily do this using selection outlines, which appear when you click in a scatter plot. You simply drag the outline to a new variable, and the chart updates.

Time-related data:

Line plots allow you to follow a variable across (equal) time intervals

Sparklines are mini in-cell charts

Top

## Exploring Time-related datasets

When you have time-related data you'll often want to "follow" a variable over time. The line plot is the way to do this graphically. The line plot shows the value of a variable at various categories, these would correspond to the time intervals. They are equal in "size" so if the time intervals are not regular then you should consider an alternative, such as a scatter plot.

You can use a Line plot from a Pivot Table, as a Pivot Chart. This makes it easy to explore your data as you can alter the chart rapidly and easily. You can also use Sparklines, which are mini line charts (although there is a bar version), which can be useful to follow several variables at the same time as separate plots. The Sparklines are in-cell plots so you can place regular text or functions in the "background", i.e. the cells work like normal cells, they somply have a chart inside too.

Association data:

Contingency table cross-tabulates frequencies over categories

Visualize with bar charts or pie charts

Top

## Exploring Association datasets

When you have association data you have frequencies of observations at categories. The usual way to proceed is to make a contingency table – the Pivot Table is a useful tool for this. The contingency table shows the frequencies cross-tabulated over two ranges of category. You can have grouping variables and the Pivot Table makes it easy to incorporate them.

The contingency table itself is a useful summary but large tables can be hard to take in. Using charts is a good way to visualize association data. You can make bar charts from the Pivot Tables very easily as Pivot Charts, which allow you to arrange and rearrange the data easily.

Whilst bar charts are the most useful way to visualize the data Pie charts are also an option. Pie charts show proportions of things in a sample. This means that you may need several pie charts, one for each category. Pie charts are not ideal, as the human eye is not good with angular measures. You can make a pie chart as a Pivot Chart but a bar chart is likely to be a better option!

Differences data:

Regular summary statistics for sample groups

Visualize using:

Bar charts (with error bars)
Box-whisker plots

Charts can be filtered to help explore the data

Top

## Exploring Differences datasets

When you are looking for differences between sample groups you'll use summary statistics on each sample group. You can use Pivot Tables to produce mean values but not medians. If you need to use medians you'll have to assemble the data into sample groups (using the Pivot Table and an index variable) and regular Excel functions.

Visualizing differences data usually means the bar chart, which allows you to show the average and a measure of dispersion (as error bars). You can make bar charts easily from Pivot Tables, although you need a bit of work to add the error bars.

Box-whisker plots are a good alternative to bar charts. They show several statistics in a compact manner, usually median, IQR and the max and min. Many statistians use boxplots to show non-parametric statistics, even if the data are normally distributed. It is not so easy to make a boxplot in Excel, but it can be done. You need to use the "Other" types of chart, specifically a High-Low-Open-Close type of Stock chart. With a little bit of coercion you can show the median, IQR and range (as error bars) and make a very tidy plot.

If you arrange your data appropriately you can use filters on your data to alter the chart, even boxplots can be filtered. You start with a Pivot Table to get the data into sample groups, then prepare the summary statistics required for the chart. You then build the chart and can apply filters afterwards to explore the data.

Top

Providing training for:

• Ecology
• Data analysis
• Statistics
• R The statistical programming language
• Data management
• Data mining

Top
Courses
Publications
Contact DataAnalytics Homepage