Dr. Mark Gardener

 
About

Managing Data Using Exce, Cover

Managing Data Using Excel


Writer's Bloc

On this page you can find out about my latest writing project. I'll post updates on progress, tables of contents and also some of the R scripts (and possibly Excel spreadsheets) I am developing in support of the new book. I'll try to keep the material reasonably up to date.

The Writer's Bloc homepage contains a table of contents and an index of the pages that contain custom R commands and R scripts.


Managing Data Using Excel:
Organizing, Summarizing and Visualizing Scientific Data

Available now from
Pelagic Publishing

Pelagic Publishing

Get £5 discount when you buy direct from the publisher. Enter voucher code MDUE20 in the shopping basket at Pelagic Publishing

Scientific Recording Format

Introduction

The way you record your data underpins all your data management. It is easy to underestimate the importance of this aspect. Good data management can:

  • Save time.
  • Save money.
  • Save effort.
  • Reduce errors.

In the opening few sections of the book I am looking at ways to set out your data with the view to coming up with a system that is simple, flexible and allows the best use of your data.

There are various ways you could arrange your data and many scientists who ought to know better are guilty of not really thinking about data layout right at the outset.


Not so great ways to layout your data

Top

Common data layouts

A common (but not necessarily good) way of arranging data is to record your samples in separate columns. This doesn't seem so bad when your dataset is relatively simple:

Female

Male

110

120

111

107

107

110

108

116

110

114

105

111

107

113

106

117

111

114

111

112

In this case there are two samples, one for females and one for males (the data actually represent the length of jawbones of golden jackals from specimens in the British Museum). If you have more samples you simply add more columns.

However, you'll often have a situation where you have several variables like so:


A




B


H

L

M


H

L

M

36

26

18


20

27

42

21

30

21


21

14

26

24

54

29


24

29

19

18

25

17


17

19

16

10

70

12


13

29

39

43

52

18


15

31

28

28

51

35


15

41

21

15

26

30


16

20

39

26

67

36


28

44

29

Now these data show two sorts of woollen yarn (A and B) and three levels of tension on looms (High, Low and Medium). Notice that each of the tension columns is repeated. The values represent the numbe of times the yarn breaks in a given period.

If you add more variables the situation becomes increasingly difficult to represent in a "simple" table. In the following table there are two main variables and also several species.


Lower

Middle

Upper


Closed

Open

Closed

Open

Closed

Open

Sp.A

4

95

30

158

0

3


9

71

45

120

10

9


34

39

30

27

6

28

Sp.B

5

6

5

81

15

21


3

0

0

51

2

25


7

48

30

44

17

35

Sp.C

23

84

21

115

23

45


12

75

44

110

1

43


0

53

18

112

15

51

Here there are three locations (Lower, Middle and Upper) and at each there are two habitat types (Open and Closed). There are three species recorded, the values being the abundance of each species. Now each column is "split" and this is hardly an ideal situation. What's needed is a system to "standardize" the layout of data.


Scientific Recording Format (or Biological Recording Format)

Biological Records Centre (UK)

Top

A scientific recording format

To be maximally useful your data has to be laid out such that it:

  • Shows all the information.
  • Is flexible.
  • Can be checked for errors easily.
  • Can be analysed easily.
  • Can be extended and modified easily.

Meeting all those criteria could be a tall order but it is possible with a little thought.

Ideally you want to have your spreadsheet set out so that each column represents a single variable. Each row should then be a single "record" (also called an observation or replicate).

For the first of the examples, where there are two samples, the layout is simple:

Length (mm) Sex
120 Male
107 Male
110 Male
110 Female
111 Female
107 Female

You have one column for the length data and one for the sex (that is male or female). Each row now represents a single observation (record or replicate) and none of the columns are duplicated. In this case the Length column is the response variable and the Sex column is the predictor variable. You think that the sex of the jackal has an influence on the length of its jawbone.

The wool example is similar:

Breaks Wool Tension
36 A H
26 A L
18 A M
21 B H
29 B L
16 B M

The first column is the response variable, the number of breaks of yarn. The following columns are predictor variables, the type of wool and the level of tension.

You can see that this system is easily extensible, you simply add new columns for each new variable. This kind of layout is commonly used in Biological Recording, where species data is recorded and sent to Biological Records Centres (see BRC for the UK). In this way the occurrence of species can be monitored. In these Biological Records there are certain columns that you always want to see:

  • Who
  • What
  • Where
  • When

The name of the person recording the data (Who) is important, so that the data can be checked or verified at some later point. The What is the species being recorded, the Where element is often split into several parts, grid reference, site name and so on. The When part is of course the date of the record, it is important to know what time of year a species was found. Other columns (such as sex, life stage or abundance) may also be recorded.

It does not matter what branch of science or business you are in, these elements are still important. Having a rigid data layout gives you a good framework for your data and allows you to maximize its potential.


Top

Providing training for:

  • Ecology
  • Data analysis
  • Statistics
  • R The statistical programming language
  • Data management
  • Data mining

My Publications

Managing Data Using Excel, Cover

See my personal pages at GardenersOwn

Follow me...
Facebook Twitter Google+ Linkedin Amazon
Top
Courses
Publications
Contact DataAnalytics Homepage