Day 4 Reading & Manipulating Data

An introduction to storing and converting data using the tidy format.

Total video length: 4 min.
Exercise length: 30–90 min.

4.1 Reading Data

There are many ways to read data into R. A simple method is explained in the video below:

4.2 What Is Tidy Data?

A standardized way to store data is by having rows represent observations and columns represent variables.

For example, take a look at the first few rows of the iris data set:

Table 4.1: Tidy data. Every row is a single flower. Every column is a single variable.
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
5.1	3.5	1.4	0.2	setosa
4.9	3.0	1.4	0.2	setosa
4.7	3.2	1.3	0.2	setosa
4.6	3.1	1.5	0.2	setosa
5.0	3.6	1.4	0.2	setosa
5.4	3.9	1.7	0.4	setosa

4.3 What Is Untidy Data?

Sometimes it is tempting to use a different format. For example, here are the first few rows of a fictional qPCR data set performed in triplicates:

Table 4.2: Untidy data. A single row contains three different experimental observations.
Gene	Treatment	Ct value in exp. 1	Ct value in exp. 2	Ct value in exp. 3
A	control	22.8	16.0	17.2
B	control	22.2	15.7	19.5
C	control	23.0	20.4	23.7
D	control	14.5	18.9	18.9
E	control	17.6	22.2	17.5
A	ADH-/-	20.8	15.6	20.5

This may look like a more intuitive way to store your data, but most statistical software will not know what to do with it. Every row should contain a single gene’s C$_t$-value from a single replicate. You can view the correct way to store it here.

Storing data in tidy format will greatly simplify analysis and almost everything we use in the GRS course expects data to be in this format.

4.4 Exercise (easy)

Watch the video on how to read data in R;
Download this Excel file;
Read the data into R as explained in step 1;
Convert the data set to tidy format using R code.

There are different ways to complete this exercise. For example you could use the dollar sign ($) or square brackets ([ , ]) to indicate which column you want.

Run a $t$-test to compare male and female height. See the help page for t.test. If you don’t remember which $t$-test to use when, you may use the default;
Report your conclusion at a significance level of $0.05$.

HINT: The following are useful functions you might use for this: c (combine/concatenate), rep (repeat/replicate), rbind (rowbind), cbind (column bind). Use the help files, or Google if needed to find out how they work.

4.5 Exercise (medium)

The following Google Sheets file contains shows three versions of the same data set. The tab “tidy” is how I wish people sent me data. The tab “untidy” is what students often send me after at least trying to put it in the right format. The tab “headache” is closer to what people send me for consultancy.

Open the Google Sheets file;
Download the “untidy” tab as a .csv file and read it into R;
Convert the file to tidy format;
Compare the resulting data frame with the “tidy” tab, to see if you got it right.

HINT: The following are useful functions: c (combine/concatenate), rep (repeat/replicate), rbind (rowbind), cbind (column bind). Use the help files, or Google if needed to find out how they work.

4.6 Exercise (hard)

These two toy examples are very small data sets, so you might argue that it would be simpler to just fix them in Excel (I wouldn’t disagree). However, if these were several dozen, or even hundreds of genes, then copy-pasting is dangerous, error-prone business.

Open the Google Sheets file;
Download the “headache” tab as a .csv file and read it into R;
Convert the file to tidy format;
Compare the resulting data frame with the “tidy” tab, to see if you got it right;
Using the internet and any other resource you like, try plotting the different gene/treatment combinations in a single plot;
Bonus points if you can complete this exercise without developing a headache.