Data frames are one of the most useful ways of organizing and storing data in R
, and they are the format we will probably use most often. A data frame can be thought of like a spreadsheet. The data is arranged in rows and columns, where each row is a set of related data points (measurements from an individual, for example), and the columns are the different types of data that we collected (height, weight, eye color, etc.). This format makes it easy keep all related data together, while making it convenient to select subsets of the data for later analysis.
If you already have some data stored as vectors, you can put them together into a data frame using the data.frame()
command. This will create a table with the names of the vectors as the column names.
If you want to specify the column names as something different from the vector name, you can do that within the call to data.frame, using single equal signs: the column name you want on the left, the data you want it to contain on the right.
You can also look at the data frame in RStudio by clicking on it in the “Workspace” tab (it will be listed under “Data”). A tab will open in the upper left pane with the contents displayed in a neat table. Note that you can not edit the data there, only view it.
There are a number of ways to select subsets of data from a data frame. The first is to use the selection brackets, just as we did for vectors. The only difference is that we are now dealing with two dimensional data, so we have to specify both which row(s) and which column(s) we want, separated by a comma (rows first, then columns). If you want all rows or columns, you can leave the space before or after the comma, respectively, blank. For the columns, you can also give a vector of the column names that you want to select.
Often we want just a single column from the data frame, so there is a nice shorthand for that: the data frame followed by $
and the column name you want:
Another thing that we commonly want to do is to select rows based on some of the data in the data frame. We could do this with brackets and the dollar sign operator, but it can start to get unweildy, especially if you want to select on more than one aspect of the data at the same time (and I can’t tell you how many times I have gotten into trouble for forgetting the comma). Just for illustration, I am going to
Luckliy, there is a much more convenient way of selecting rows in a situation like this: the subset()
command. The first argument to subset()
is the data frame we want to select from, and the second argument is the condition that we want the selected rows to to satisfy. What is especially convenient here is that we don’t have to retype the name of the data frame every time we want to use a different column, just the names of the columns is sufficient.
The subset command can also be used to select particular columns for the output, with the select
argument:
There is at least one other way to work with data frames, which is found in the dplyr
package (and more generally, packages in the so-called “tidyverse”. This package is optimized for large data, and ease of use, and it is worth a mention partly because it is what I have mostly switched over to for my own work. You can find much more about it at the RStudio site, and in particular with the Data Wrangling cheatsheet and the website for the Tidyverse. Note that dplyr
uses a variant of data frames called “tibbles”, which you can create with tibble()
instead of data.frame()
. These two forms are mostly interchangable, but have different defaults, as discussed in part below.
To do the same selection as above, we use two different commands: filter()
to select rows based on criteria, and select()
to choose particular columns. Note that with filter we can set criteria in separate arguments (separated by commas), rather than having to use the &
symbol. Similarly, I don’t need to use c()
for the column names.
Note that you will only need to include the line library(dplyr)
part once per session (or once per file). Including it more is not a problem, but not necessary either. If you do not have the dplyr package installed, you can install it with install.packages("dplyr")
(or install all the tidyverse packages with install.packages("tidyverse")
), but you should only need to do this once per computer.
If you want to get really fancy, you can take advantage of the “piping” feature of dplyr
(which actually comes from a package called magrittr
: Ceci n’est pas une pipe.). The way this works is that the %>%
symbol puts whatever is to its left into the first argument of the function on its right (which you can then omit), allowing you to save typing, and also saving you the hassle of intermediate arguments. So the command above could be rewritten as follows:
You may have noticed that while we put a vector of strings into our data frame for fruit names and colors, what came out was not a vector of strings, but a factor. This is sometimes what you want, but not always. If you want to keep strings as strings, you can add one more argument to data.frame()
after you specify all of the columns: stringsAsFactors = FALSE
. If you want to, you can then convert individual rows to factors as I have done below, or you could create the data frame with explicitly described factors using factor()
.
Alternatively, you can use a tibble, a variant of a data frame with a few nice properties, one of which is that it does not use factors by default, and in general tries to avoid modifying data as much as possible.
You can join two data frames with the same kinds of columns together using rbind()
(row bind), and you can add columns (or data frames with the same number of rows) with cbind()
(column bind), or by naming a new column that doesn’t yet exist.
If you want to do this with tibbles, the commands are a bit different (bind_rows()
and bind_cols()
), but the ideas are the same. Note that the bind_cols requires the vector used for the new row to have a name; it does not automatically assign one.
There is also a function from dplyr()
that is handy here: mutate()
which is nice for creating new variables that depend on others (or modifying existing columns, though that can be dangerous):
Once you have your data in a data frame, it is time to start characterizing and describing it. There are a number of special functions you can use to make all of this easier, and I will go over some of those now. But first, we need some data to work with. The data we will use this time is measurements from rock crabs of the species Leptograpsus variegatus which were collected in Western Australia. The original data is from:
Campbell, N.A. and Mahon, R.J. (1974) A multivariate study of variation in two species of rock crab of genus Leptograpsus. Australian Journal of Zoology 22, 417–425.
but I actually got the data from a book on S
, the predecessor to R
:
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth edition. Springer.
A file with the data can be downloaded at the following link: crabs.csv. Put it into the project folder you are currently using, then you can load the data as follows with read_csv()
, then have a look at it with the str()
command. (Note that I am using read_csv()
, from the readr
package, rather than the standard read.csv()
, because it does not automatically make strings into factors. If you want to stick with read.csv, that would be fine, but you will have some slight differences in your data frames.)
The str()
command tells us the structure of data in a variable, and in this case it is telling us that crabs
is a tibble, or tbl_df
(which is also a kind of data.frame
) with 200 rows (obs.) and 8 variables (columns), but the column names are a bit cryptic. The meaning of each column name is shown below:
Variable | Description | |
---|---|---|
sp | species - “B” or “O” for blue or orange. | |
sex | “M” or “F” for male or female | |
index | index 1:50 within each of the four groups | |
FL | frontal lobe size (mm) | |
RW | rear width (mm) | |
CL | carapace length (mm) | |
CW | carapace width (mm) | |
BD | body depth (mm) |
You can get a very nice quick summary of the data overall using the function summary()
:
All that is nice, but it doesn’t really tell us too much, since what we really might want to know about this data is how the different kinds of crabs compare to each other. We have males and females, blue and orange crabs, so we should see if we can look at just one kind at a time. Lets look at the blue females first; we can select rows from the data frame by testing which rows have sp == "B"
and sex == "F"
. Notice the double equals sign. This is the test for equality, as distinct from the single equal sign that you can use for assigning a value to a variable or function argument. Then we will calculate the mean and standard deviation of frontal lobe size (FL
) for the female blue crabs.
If you were trying to put all the crabs in a storage cage that had a hole size of 25 mm, you might expect that any crabs with a carapace length (CL) smaller than the holes would be able to escape (since they move sideways).
a. Create a histogram showing the size distribution of the crabs that you would expect to stay in the cage (measured by carapace length). Be sure to label your plot completely, including the total number of crabs that remain.
b. What proportion of crabs remaining in your cage would be female? What proportion would be orange?
c. What is the median body depth of the female, blue crabs that you would expect to escape?
Doing these calculations separately for each possible grouping of variables can be a bit tiresome, and if you wanted to a caculate statistic of the measurement variables (other than the ones that summary gave us), you would start to get a bit annoyed with typing the same thing over and over. Since this is an extremely common task, R
has a variety of ways to help you do repetitive calculations like this more efficiently. The built-in functions are those in the “apply” family, so named because they allow you to apply any function to multiple subsets of your data at the same time. For example, you might want to calculate the median of every column of a data frame, or the mean of some measurement for each species of crab. Unfortunately, the built-in versions of these functions (eg. apply()
, sapply()
, lapply()
, tapply()
) are a bit quirky, so I tend not to use them. You should feel free to explore them on your own, but I almost never use them anymore. Instead, I use a set of replacement functions written by a statistician named Hadley Wickham, who also wrote the graphics package that I use most: ggplot2
. We will come back to ggplot2
, but for now lets focus on the data manipulation functions that are part of his dplyr
package. (There is also a previous version with similar goals called plyr
, but dplyr
is much faster and a bit simpler in some ways. You may see me use both at times, but I’m trying to convert over to dplyr
full time.)
Below is a brief introduction to working with dplyr
; I highly recommend you check out the more complete description available at the dplyr website.
group_by()
and summarize()
Some of the most common functions we will use are group_by()
and summarize()
, which do just what they say. group_by()
divides a data frame in to subgroups based on some condition, and summarize()
(or summarise()
if you are more comfortable with that) calculates statistics based on the data in those subgroups, returning the results as a new data frame, with one row per group. A simple example of its use is to find out how many observations (rows) are in each subset of the data, taking advantage of the function n()
, which is also part of dplyr
. (n()
is largely equivalent to the base function nrow()
which will tell you how many rows there are in a data frame, but it works with grouped data.)
So the steps are these: first divide up the data with group_by()
. To do this you give the data frame as the first argument (this will become a pattern), then the remaining variables are the names of the columns that you want to divide the data based on. You don’t need to put them in quotes.
Next, you apply your function to the grouped data with summarize()
. The first argument is the grouped data frame, and the rest are the summary statistics you wish to calculate. In this case, we will just use n()
to give us a count of the number of rows. (Normally I would save the output, but I’m not going to in this case.)
As you can see, this makes a new tibble with the variables you split by in the first two columns, and the result of the calculation in the third. The title of that third function is a bit nasty, but we can actually provide a better name quite easily, by ‘naming’ the argument, just as we did with the data frames before:
Counting rows is not exactly the most useful thing we could do with this data. What we really wanted to do was to calculate statistics on subsets of the data. If we wanted to calculate the mean of FL
and the minimum of RW
for the grouped crab data set, we could do that as follows.
The functions that you pass in to summarize don’t have to be as simple as the ones I just showed; you could calculate the 80% quantile of the difference between the square root of the carapace width and frontal lobe cubed, though I doubt you would want to. The only limitation is that each of the functions should return a single value, or you will get an error.
a. Calculate the mean, and variance, and standard error for each of carapace length, carapace width, and the difference between width and length for each of the species/sex combinations.
b. Which species tends to be larger (by these measures)? Which sex?
c. What can you tell about the relationship between carapace length and carapace width by comparing the variances of each of those quantities to the variance of their difference?
Once we have our data arranged nicely in a data frame, it is easy to use it in plots, and to take advantage of some of the fancier features in the ggplot2
package that I mentioned earlier. In particular, we can take advantage of “faceting”, the ability to make multiple small plots with the same axes, which makes comparison across groups easier. I’ll just present some examples here to give you a bit of inspiration, and as a preview for next week.