Dr. Mark Gardener

 
About

Managing Data Using Exce, Cover

Managing Data Using Excel


Writer's Bloc

On this page you can find out about my latest writing project. I'll post updates on progress, tables of contents and also some of the R scripts (and possibly Excel spreadsheets) I am developing in support of the new book. I'll try to keep the material reasonably up to date.

The Writer's Bloc homepage contains a table of contents and an index of the pages that contain custom R commands and R scripts.


Managing Data Using Excel:
Organizing, Summarizing and Visualizing Scientific Data

Available now from
Pelagic Publishing

Pelagic Publishing

Get £5 discount when you buy direct from the publisher. Enter voucher code MDUE20 in the shopping basket at Pelagic Publishing

Validating your data

Mistakes are inevitable and the more data you have the greater the likelihood that you'll have errors creeping into your dataset. Checking for errors is therefore an important part of data management. You can validate data as it is being entered as well as checking for mistakes later on.

Error checking and data validation is the subject of Chapter 3 of Managing Data Using Excel. There are three main strands to consider:

  • Typographical errors – spelling mistakes and so on.
  • Validating entries – how to restrict data entry to minimize errors (and how to check entries later).
  • Numerical errors – spotting values that are obviously wrong.

There are several Excel tools that can help in the fight against errors. This page gives a brief overview of the material covered in Chapter 3 of the book.


Tools to help spot typographical errors:
Sort
Filter
Pivot Table

Use =RIGHT and =EXACT functions with an Advanced Filter to view entries with trailing spaces.

Top

Typographical Errors

It is best if you can avoid errors from the start of course but this is practically impossible. There are ways to minimize the occurrence of errors, by validating data as it is entered (see later), but you'll always need to look over your data. There is no need to wait until all the data are entered before error checking, and in many ways it is a good idea to start the process early. If you need to do some tinkering with the data entry it is best to make the changes early, when you have fewer data items to manipulate.

Typographical errors are most common, and the more complicated your data labels are the more likely it is that these typos will creep in. There are three main ways you can be a bit more systematic about error checking than simply "looking over" your data:

  • Sorting – putting your data in various orders can help you spot oddities in spelling.
  • Filtering – rearranging your data into logical chunks can help you spot mistakes.
  • Pivot Tables – allow you to arrange and rearrange your data in various ways, which can help you to spot errors.

The Advanced Filter option allows more options than a simple filter but you need to set-up the criteria in a separate worksheet. This can be especially helpful when looking for entries that contain a trailing space.

Detecting trailing spaces

It is all too easy to add a trailing space to an entry, this happens often with species or site names. When the datum is entered it looks "normal" because you cannot see the trailing space. However, Excel (and other programs) will treat the entry as different from the "regular" entry without the trailing space. You can use an Advanced Filter to help you spot these trailing spaces:

  1. Make a new sheet containing the filter criteria, this generally means copying the data heading row and adding the filter criteria underneath.
  2. Add a new heading/column for the special criterion.
  3. Then enter a formula, you'll need to use =EXACT and =RIGHT functions to search for a single space. Point to the first entry of the variable you want to check; the filter will work down the column itself later.
  4. Return to the main data worksheet and click once in the block of data.
  5. Bring up the Advanced Filter and filter in-place, making sure you select the special criterion you just created.
  6. The data are now filtered and any visible entries must contain extra spaces at the end. You can now edit the entries.
  7. Clear the filter once you have finished to restore to view the entire dataset.

The Advanced Filter and =EXACT function
The Advanced Filter can be used to help search for trailing spaces. In this example the filter is set-up to look for a trailing "a".

You can also use a Pivot Table, any entries that appear duplicated in Row Labels or Column Labels will likely have trailing spaces. This is less exact than the Advanced Filter but can be a useful way to look over your dataset.


Data validation can be carried out during entry, or afterwards.

Use the Data Validation tools in the Data Ribbon menu.

Top

Validating Entries

It is possible to restrict data entry to certain criteria, thus reducing possible errors. For example, if you are entering abundance on a DAFOR scale you can restrict the contents of the abundance variable to include only appropriate values (the letters DAFOR and maybe "N" for not present). If you are entering numerical data that are percentages you can restrict entries to 100% maximum.

There are various ways that you can set-up the validation criteria, as you'll see shortly. The validation tools are found in the Data menu.

Data Validation Tools in Excel
Data Validation Tools are found in the Data menu in the Excel Ribbon.

There are two main ways you can implement Data Validation:

  • During entry – this reduces the chances of errors as you restrict what can be entered into the data sheet. You set-up criteria, which are matched as the user enters data. There are options for how you deal with possible "violations".
  • After entry – this is a way to check already entered data for possible errors. You set-up entry criteria, allowing possible errors to be flagged in the main dataset.

In general the larger the dataset the more helpful it is to have your data validation carried out at the entry stage.


Data validation can pick up data that lie outside a fixed range.

Arranging data into sample "chunks" can help you spot possible errors.

Graphical methods are generally the best way to spot potential numerical errors.

Use a Pivot Table and a Pivot Chart to make a dot chart template from a line plot.

Top

Numerical Errors

Numerical errors are usually the hardest to spot, especially if the variable in question has not got a particularly "fixed" range. More usually you are looking at a sample that has a range of values that lie within the absolute boundaries of possible values. You are not looking for absolute errors but for entries that are "odd".

You can use data validation tools to help you pick out entries that lie outside of an obvious range. For those variables without an obvious range you need a different approach.

You can use Filter and Sort tools to help you arrange and rearrange your data into chunks. Pivot Tables will also let you reorganize your data in various ways. These tools allow you to scan the data and possibly spot those that appear "odd". Just because a datum is "odd" does not mean that it is wrong. Data are variable and having "way out" points could simply indicate that you haven't collected enough data. The point of the error highlighting is to give you the opportunity to differentiate between genuine errors and merely "variable" entries.

Although Filter tools and the like are helpful, it is usually visual methods that allow you to scan for oddities more easily. There are two scenarios you'll need to consider, each will require a different approach to checking for "odd" data:

  • Comparisons between samples.
  • Correlations and regressions.

When you are looking at various samples, with a view to exploring differences between them, you need to look at the data on a sample by sample basis. If you were to represent the scenario graphically you'd use bar charts and similar graphs.

By contrast when you are looking for correlations or carrying out regression you are comparing two (or more) numeric variables (although there are other sorts of regression). These cases would lend themselves to visualization with scatter plots. You may want to break the data down by other variables and use samples but in any event you're still going to need a scatter plot visualization.

When you have correlation data you can look for oddities by making a scatter plot, if any of the data points looks "way off" you can look at the data entry in more detail. Just because your datum looks odd doesn't mean it is wrong. Data scientists occasionally spark off low-level holy wars by suggesting one approach or another to "outliers". By spotting the outliers you can at least make a decision one way or another.

When you have separate samples you need a different approach. The simplest way is to make a dotchart of some kind.

A simple dot chart
A simple dot chart can help spot possible outliers in a numerical sample.

In the example here a sample of data has been plotted as a scatter plot with a "fixed value" for the y-axis. It is possible to do this using regular Excel Chart Tools. You can also make multiple sample plots but it is not especially convenient.

A better approach is to use a Pivot Table and a Pivot Chart to make a line plot. You don't want the lines, so can edit them out. If you then save your chart as a Template you can use it for any data at any time.

Multiple dot chart made using a pivot chart and a template
A line plot with the lines removed makes a good template for a dot chart.

The support material for the book will contain a template chart (I will put a link here when I make the support page). The chapter itself gives more details about how to set up and save such a template.

Once you have a dotchart template you can make a dotchart for any dataset. The strength of the Pivot Chart is that you can use filters to focus on parts of your data; this is especially helpful if there are large differences in the response variable between samples.


Top

Providing training for:

  • Ecology
  • Data analysis
  • Statistics
  • R The statistical programming language
  • Data management
  • Data mining

My Publications

Managing Data Using Excel, Cover

See my personal pages at GardenersOwn

Follow me...
Facebook Twitter Google+ Linkedin Amazon
Top
Courses
Publications
Contact DataAnalytics Homepage