4  Tidy Format

Collect data in a simple, consistent format called tidy data,[1] such that minimal effort is required to clean the data once you get to the analysis: Rows represent observations, columns represent the variables measured for those observations:

The basic principle of tidy data. This data set has 5 observations of 4 variables.
Good example

The iris data set (preinstalled with R) is in tidy format:

The first 5 rows of the iris data set. Each row is a flower and each column a property.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
... ... ... ... ...

Here, the rows each represent one observation (a distinct flower), and the columns represent the variables measured/recorded (physical dimensions and species).

Bad example

A common deviation from tidy format is to represent groups as columns:

Systolic blood pressure (SBP) of 10 individuals. (untidy)
women men
114 123
121 117
125 117
108 117
122 116

Do you have different groups? Time points? Replicates of the experiment? Then try to adhere to the same principle: Columns are variables. Simply add a variable that indicates which group/time point/replicate this observation belongs to:

Sex and systolic blood pressure (SBP) of 10 individuals. (tidy)
sex SBP
female 114
female 121
female 125
female 108
female 122
male 123
male 117
male 117
male 117
male 116
Converting to tidy format

If you have data split by group/time point/replicate here is how you can convert it to tidy format:

# The untidy data set
Untidy
  women men
1   114 123
2   121 117
3   125 117
4   108 117
5   122 116
# Convert by hand
Tidy <- data.frame(
  sex = rep(c("female", "male"), each = nrow(Untidy)),
  SBP = c(Untidy$women, Untidy$men)
)

# Convert using a package (install if missing)
library("reshape2")
melt(Untidy)
   variable value
1     women   114
2     women   121
3     women   125
4     women   108
5     women   122
6       men   123
7       men   117
8       men   117
9       men   117
10      men   116

If you have a simple data set like the one shown here, converting with the package reshape2 is easiest. Converting by hand may be slightly more work, but I prefer it because you can easily see what’s going on, add more variables if needed, etc.