4 Tidy Format
Collect data in a simple, consistent format called tidy data,[1] such that minimal effort is required to clean the data once you get to the analysis: Rows represent observations, columns represent the variables measured for those observations:
Good example
The iris
data set (preinstalled with R) is in tidy format:
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
... | ... | ... | ... | ... |
Here, the rows each represent one observation (a distinct flower), and the columns represent the variables measured/recorded (physical dimensions and species).
Bad example
A common deviation from tidy format is to represent groups as columns:
women | men |
---|---|
114 | 123 |
121 | 117 |
125 | 117 |
108 | 117 |
122 | 116 |
Do you have different groups? Time points? Replicates of the experiment? Then try to adhere to the same principle: Columns are variables. Simply add a variable that indicates which group/time point/replicate this observation belongs to:
sex | SBP |
---|---|
female | 114 |
female | 121 |
female | 125 |
female | 108 |
female | 122 |
male | 123 |
male | 117 |
male | 117 |
male | 117 |
male | 116 |
Converting to tidy format
If you have data split by group/time point/replicate here is how you can convert it to tidy format:
# The untidy data set
Untidy
women men
1 114 123
2 121 117
3 125 117
4 108 117
5 122 116
# Convert by hand
<- data.frame(
Tidy sex = rep(c("female", "male"), each = nrow(Untidy)),
SBP = c(Untidy$women, Untidy$men)
)
# Convert using a package (install if missing)
library("reshape2")
melt(Untidy)
variable value
1 women 114
2 women 121
3 women 125
4 women 108
5 women 122
6 men 123
7 men 117
8 men 117
9 men 117
10 men 116
If you have a simple data set like the one shown here, converting with the package reshape2
is easiest. Converting by hand may be slightly more work, but I prefer it because you can easily see what’s going on, add more variables if needed, etc.