4 Tidy Format

Collect data in a simple, consistent format called tidy data,[1] such that minimal effort is required to clean the data once you get to the analysis: Rows represent observations, columns represent the variables measured for those observations:

The basic principle of tidy data. This data set has 5 observations of 4 variables.

Good example

The iris data set (preinstalled with R) is in tidy format:

The first 5 rows of the `iris` data set. Each row is a flower and each column a property.
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
...	...	...	...	...

Here, the rows each represent one observation (a distinct flower), and the columns represent the variables measured/recorded (physical dimensions and species).

Bad example

A common deviation from tidy format is to represent groups as columns:

Systolic blood pressure (SBP) of 10 individuals. (untidy)
women	men
114	123
121	117
125	117
108	117
122	116

Do you have different groups? Time points? Replicates of the experiment? Then try to adhere to the same principle: Columns are variables. Simply add a variable that indicates which group/time point/replicate this observation belongs to:

Sex and systolic blood pressure (SBP) of 10 individuals. (tidy)
sex	SBP
female	114
female	121
female	125
female	108
female	122
male	123
male	117
male	117
male	117
male	116

Converting to tidy format

If you have data split by group/time point/replicate here is how you can convert it to tidy format:

# The untidy data set
Untidy

  women men
1   114 123
2   121 117
3   125 117
4   108 117
5   122 116

# Convert by hand
Tidy <- data.frame(
  sex = rep(c("female", "male"), each = nrow(Untidy)),
  SBP = c(Untidy$women, Untidy$men)
)

# Convert using a package (install if missing)
library("reshape2")
melt(Untidy)

   variable value
1     women   114
2     women   121
3     women   125
4     women   108
5     women   122
6       men   123
7       men   117
8       men   117
9       men   117
10      men   116

If you have a simple data set like the one shown here, converting with the package reshape2 is easiest. Converting by hand may be slightly more work, but I prefer it because you can easily see what’s going on, add more variables if needed, etc.