Dr. Mark Gardener

Home

Index of MonogRaphs

# R: MonogRaphs

### A series of essays on random topics using R: The Statistical Programming Language

R is a powerful and flexible beast. Getting started using R is not too difficult and you can learn to start using R in an afternoon. However, mastering R takes rather longer! These monographs are my way of exploring various topics in a completely unstructured manner.

# Data layout – stack()ing your data

There are two main ways you can layout your data. In sample-format each coloumn is a separate sample, which forms some kind of logical sampling unit. This is a typical way you layout data if you are using Excel because that's how you have to have your data to be able to make charts and carry out most forms of analysis.

In scientific recording format each column is a variable; you have response variables and predictor variables. This recording-layout is a more powerful and ultimately flexible layout because you can add new variables or observations easily. In R this layout is also essential for any kind of complicated analysis, such as regression or analysis of variance.

I've written about scientific recording format before, so see my Writer's Bloc page for a brief summary.

When you have data in the "wrong" layout you need to be able to rearrange them into a more "sensible" layout so that you can unleash the power of R most effectively. The stack() command is a useful tool that can help you achieve this layout.

## Contents

The stack() command combines numeric columns

Uses column names to form new predictor variable as a factor

Top

## Convert a single predictor

The simplest situation is where you have a single response variable and a single predictor, but the values are laid out in sampe columns. Here is an example, the data are counts of a freshwater invertebrate at three different sampling locations:

`> hog3  Upper Mid Lower1     3   4    112     4   3    123     5   7     94     9   9    105     8  11    116    10  NA    NA7     9  NA    NA`

Note that the samples contain different numbers of observations. The "short" columns are padded by NA items when the data is read into R (usually via the read.csv() command).

The stack() command will take the columns and combine them into a single response variable. The names of the columns will be used to form the levels of a predictor variable.

`> stack(hog3)   values   ind1       3 Upper2       4 Upper3       5 Upper4       9 Upper5       8 Upper6      10 Upper7       9 Upper8       4   Mid9       3   Mid10      7   Mid11      9   Mid12     11   Mid13     NA   Mid14     NA   Mid15     11 Lower16     12 Lower17      9 Lower18     10 Lower19     11 Lower20     NA Lower21     NA Lower`

Note that the columns are labelled, values and ind. You can alter these easily enough using the names() command.

```> hog = stack(hog3)
> names(hog) = c("count", "site")
> head(hog)  count  site1     3 Upper2     4 Upper3     5 Upper4     9 Upper5     8 Upper6    10 Upper```

Now you have a data.frame in recording layout, but the NA items are still in the data.

Use na.omit() to remove rows containing NA items

Top

### Remove NA items

The na.omit() command will "knock-out" any NA items.

```> hog   count  site1      3 Upper2      4 Upper3      5 Upper4      9 Upper5      8 Upper6     10 Upper7      9 Upper8      4   Mid9      3   Mid10     7   Mid11     9   Mid12    11   Mid13    NA   Mid14    NA   Mid15    11 Lower16    12 Lower17     9 Lower18    10 Lower19    11 Lower20    NA Lower21    NA Lower
> na.omit(hog)   count  site1      3 Upper2      4 Upper3      5 Upper4      9 Upper5      8 Upper6     10 Upper7      9 Upper8      4   Mid9      3   Mid10     7   Mid11     9   Mid12    11   Mid15    11 Lower16    12 Lower17     9 Lower18    10 Lower19    11 Lower```

Note that the row names keep their original values.

Use row.names() to renumber rows if required

Top

### Re-number row names

If you want to re-number the rows just use the row.names() command.

```> hog = na.omit(hog)
> row.names(hog) = 1:length(hog\$count)> hog   count  site1      3 Upper2      4 Upper3      5 Upper4      9 Upper5      8 Upper6     10 Upper7      9 Upper8      4   Mid9      3   Mid10     7   Mid11     9   Mid12    11   Mid13    11 Lower14    12 Lower15     9 Lower16    10 Lower17    11 Lower```

It's not really essential but it makes it easier to keep track of items if the numbers are sequential.

The stack() command will not deal with factor variables

Top

## Convert multiple predictors

You may have data in a more complicated arrangement, perhaps mirroring the on-groud layout. This can happen for example in a two-way ANOVA design. In Excel the data have to be set out in a particular way for the Analysis ToolPak to carry out the anova. However, in R this layout is not helpful.

`> wp  Water vulgaris sativa1    Lo        9      72    Lo       11      63    Lo        6      54   Mid       14     145   Mid       17     176   Mid       19     157    Hi       28     448    Hi       31     389    Hi       32     37`

This time you have a column representing one of the predictor variables and the other columns in sample layout. If you try a stack() command you'll find that only the numeric columns are stacked.

```> wps = stack(wp)Warning message:In stack.data.frame(wp) : non-vector columns will be ignored
> wps   values      ind1       9 vulgaris2      11 vulgaris3       6 vulgaris4      14 vulgaris5      17 vulgaris6      19 vulgaris7      28 vulgaris8      31 vulgaris9      32 vulgaris10      7   sativa11      6   sativa12      5   sativa13     14   sativa14     17   sativa15     15   sativa16     44   sativa17     38   sativa18     37   sativa```

This is not a problem because you can re-build the first predictor variable by duplicating the original and adding it to the new data.frame.

Re-build a predictor variable by adding with cbind()

Repeat the variable for each of the original sample columns

Top

### Replace a factor predictor variable

What you must do is take the original predictor variable and repeat it a number of times (equal to the number of sample columns).

`> wps = cbind(wps, water = rep(wp\$Water, 2))   values      ind water1       9 vulgaris    Lo2      11 vulgaris    Lo3       6 vulgaris    Lo4      14 vulgaris   Mid5      17 vulgaris   Mid6      19 vulgaris   Mid7      28 vulgaris    Hi8      31 vulgaris    Hi9      32 vulgaris    Hi10      7   sativa    Lo11      6   sativa    Lo12      5   sativa    Lo13     14   sativa   Mid14     17   sativa   Mid15     15   sativa   Mid16     44   sativa    Hi17     38   sativa    Hi18     37   sativa    Hi`

Note that you gave the new variable an explicit name. The other column names need replacing using the names() command:

```> names(wps)[1:2] = c("height", "species")
> names(wps)[1] "height"  "species" "water"
> head(wps)  height  species water1      9 vulgaris    Lo2     11 vulgaris    Lo3      6 vulgaris    Lo4     14 vulgaris   Mid5     17 vulgaris   Mid6     19 vulgaris   Mid```

Now you have a more sensible layout that can be used for aov() and graphical commands.

This approach will not work on every dataset that you get but it will take you a long way forward.
Top

See my Publications about statistics and data analysis.

Writer's Bloc – my latest writing project includes R scripts

Courses in data analysis, data management and statistics.