Dr. Mark Gardener

Home
About
 

Index of MonogRaphs

R: MonogRaphs

A series of essays on random topics using R: The Statistical Programming Language

R is a powerful and flexible beast. Getting started using R is not too difficult and you can learn to start using R in an afternoon. However, mastering R takes rather longer! These monographs are my way of exploring various topics in a completely unstructured manner.

Tips & Tricks for R | An Introduction to R| Writer's Bloc | Courses


Data layout – stack()ing your data

There are two main ways you can layout your data. In sample-format each coloumn is a separate sample, which forms some kind of logical sampling unit. This is a typical way you layout data if you are using Excel because that's how you have to have your data to be able to make charts and carry out most forms of analysis.

In scientific recording format each column is a variable; you have response variables and predictor variables. This recording-layout is a more powerful and ultimately flexible layout because you can add new variables or observations easily. In R this layout is also essential for any kind of complicated analysis, such as regression or analysis of variance.

I've written about scientific recording format before, so see my Writer's Bloc page for a brief summary.

When you have data in the "wrong" layout you need to be able to rearrange them into a more "sensible" layout so that you can unleash the power of R most effectively. The stack() command is a useful tool that can help you achieve this layout.


Pelagic Publishing Logo

Contents


The stack() command combines numeric columns

Uses column names to form new predictor variable as a factor

Top

Convert a single predictor

The simplest situation is where you have a single response variable and a single predictor, but the values are laid out in sampe columns. Here is an example, the data are counts of a freshwater invertebrate at three different sampling locations:

> hog3
Upper Mid Lower
1 3 4 11
2 4 3 12
3 5 7 9
4 9 9 10
5 8 11 11
6 10 NA NA
7 9 NA NA

Note that the samples contain different numbers of observations. The "short" columns are padded by NA items when the data is read into R (usually via the read.csv() command).

The stack() command will take the columns and combine them into a single response variable. The names of the columns will be used to form the levels of a predictor variable.

> stack(hog3)
values ind
1 3 Upper
2 4 Upper
3 5 Upper
4 9 Upper
5 8 Upper
6 10 Upper
7 9 Upper
8 4 Mid
9 3 Mid
10 7 Mid
11 9 Mid
12 11 Mid
13 NA Mid
14 NA Mid
15 11 Lower
16 12 Lower
17 9 Lower
18 10 Lower
19 11 Lower
20 NA Lower
21 NA Lower

Note that the columns are labelled, values and ind. You can alter these easily enough using the names() command.

> hog = stack(hog3)
> names(hog) = c("count", "site")
> head(hog)
count site
1 3 Upper
2 4 Upper
3 5 Upper
4 9 Upper
5 8 Upper
6 10 Upper

Now you have a data.frame in recording layout, but the NA items are still in the data.


Use na.omit() to remove rows containing NA items

Top

Remove NA items

The na.omit() command will "knock-out" any NA items.

> hog
count site
1 3 Upper
2 4 Upper
3 5 Upper
4 9 Upper
5 8 Upper
6 10 Upper
7 9 Upper
8 4 Mid
9 3 Mid
10 7 Mid
11 9 Mid
12 11 Mid
13 NA Mid
14 NA Mid
15 11 Lower
16 12 Lower
17 9 Lower
18 10 Lower
19 11 Lower
20 NA Lower
21 NA Lower
> na.omit(hog)
count site
1 3 Upper
2 4 Upper
3 5 Upper
4 9 Upper
5 8 Upper
6 10 Upper
7 9 Upper
8 4 Mid
9 3 Mid
10 7 Mid
11 9 Mid
12 11 Mid
15 11 Lower
16 12 Lower
17 9 Lower
18 10 Lower
19 11 Lower

Note that the row names keep their original values.


Use row.names() to renumber rows if required

Top

Re-number row names

If you want to re-number the rows just use the row.names() command.

> hog = na.omit(hog)
> row.names(hog) = 1:length(hog$count)
> hog
count site
1 3 Upper
2 4 Upper
3 5 Upper
4 9 Upper
5 8 Upper
6 10 Upper
7 9 Upper
8 4 Mid
9 3 Mid
10 7 Mid
11 9 Mid
12 11 Mid
13 11 Lower
14 12 Lower
15 9 Lower
16 10 Lower
17 11 Lower

It's not really essential but it makes it easier to keep track of items if the numbers are sequential.


The stack() command will not deal with factor variables

Top

Convert multiple predictors

You may have data in a more complicated arrangement, perhaps mirroring the on-groud layout. This can happen for example in a two-way ANOVA design. In Excel the data have to be set out in a particular way for the Analysis ToolPak to carry out the anova. However, in R this layout is not helpful.

> wp
Water vulgaris sativa
1 Lo 9 7
2 Lo 11 6
3 Lo 6 5
4 Mid 14 14
5 Mid 17 17
6 Mid 19 15
7 Hi 28 44
8 Hi 31 38
9 Hi 32 37

This time you have a column representing one of the predictor variables and the other columns in sample layout. If you try a stack() command you'll find that only the numeric columns are stacked.

> wps = stack(wp)
Warning message:
In stack.data.frame(wp) : non-vector columns will be ignored

> wps
values ind
1 9 vulgaris
2 11 vulgaris
3 6 vulgaris
4 14 vulgaris
5 17 vulgaris
6 19 vulgaris
7 28 vulgaris
8 31 vulgaris
9 32 vulgaris
10 7 sativa
11 6 sativa
12 5 sativa
13 14 sativa
14 17 sativa
15 15 sativa
16 44 sativa
17 38 sativa
18 37 sativa

This is not a problem because you can re-build the first predictor variable by duplicating the original and adding it to the new data.frame.


Re-build a predictor variable by adding with cbind()

Repeat the variable for each of the original sample columns

Top

Replace a factor predictor variable

What you must do is take the original predictor variable and repeat it a number of times (equal to the number of sample columns).

> wps = cbind(wps, water = rep(wp$Water, 2))
values ind water
1 9 vulgaris Lo
2 11 vulgaris Lo
3 6 vulgaris Lo
4 14 vulgaris Mid
5 17 vulgaris Mid
6 19 vulgaris Mid
7 28 vulgaris Hi
8 31 vulgaris Hi
9 32 vulgaris Hi
10 7 sativa Lo
11 6 sativa Lo
12 5 sativa Lo
13 14 sativa Mid
14 17 sativa Mid
15 15 sativa Mid
16 44 sativa Hi
17 38 sativa Hi
18 37 sativa Hi

Note that you gave the new variable an explicit name. The other column names need replacing using the names() command:

> names(wps)[1:2] = c("height", "species")
> names(wps)
[1] "height" "species" "water"
> head(wps)
height species water
1 9 vulgaris Lo
2 11 vulgaris Lo
3 6 vulgaris Lo
4 14 vulgaris Mid
5 17 vulgaris Mid
6 19 vulgaris Mid

Now you have a more sensible layout that can be used for aov() and graphical commands.


  This approach will not work on every dataset that you get but it will take you a long way forward.
Top
MongRaphs Index

See my Publications about statistics and data analysis.

Writer's Bloc – my latest writing project includes R scripts

Courses in data analysis, data management and statistics.

Follow me...
Facebook Twitter Google+ Linkedin Amazon
Top Home
Data Analysis
MonogRaphs Index Contact