Manipulating data in R
Data transformation
Introduction
In the vast majority of your data science projects, you will get the data in a form not ready for analyses. All research projects include cleaning of the data file, creating new variables, redefining existing variables, aggregating/summarising data and so on. The task of carrying out these changes is referred to transformation of data. You will notice that 80% of your time will be spent on data transformation and the rest is spent of visualizing, describing the data and building statistical or machine learning models.
In this chapter you will learn how to transform your data using the dplyr package. This package has revolutionzed how data is transformed in R. Before dplyr, people basically hated data transformation in R. Dplyr made it easy, fast and beautyful. The dplyr package is included in the tidyverse, which you install and load like so:
#install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0.9000 ✔ purrr 0.3.0
## ✔ tibble 2.0.1 ✔ dplyr 0.8.0.1
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
Note that loading tidyverse results in loading of 4 packages: ggplot2, tibble, tidyr and readr. Also note that loading the tidyverse results in conflicts, which R informs you about. There are two conflicts:
- the dplyr function
filter()
will overwrite (mask) thefilter()
function in the stats package. - the dplyr function
lag()
will overwrite (mask) thelag()
function in the stats package.
The reason for this is simple: dplyr includes functions (filter, lag) with the same names as functions in the stats package, and loading dplyr will overwrite the functions in the stats package. If you still want to use the filter or lag functions in the stats package, after loading dplyr, you’ll have to use their full names: i.e stats::filter()
and stats::lag()
.
Prerequisites
In this chapter we’re going to focus on how to use the dplyr package. We’ll illustrate the key ideas using data from the survival package, namely the colon dataset. These are data from a trial of chemotherapy for colon cancer. Here is a list of the variables in the colon dataset:
id: id
study: 1 for all patients
rx: Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU
sex: 1=male
age: in years
obstruct: obstruction of colon by tumour
perfor: perforation of colon
adhere: adherence to nearby organs
nodes: number of lymph nodes with detectable cancer
time: days until event or censoring
status: censoring status
differ: differentiation of tumour (1=well, 2=moderate, 3=poor)
extent: Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures)
surg: time from surgery to registration (0=short, 1=long)
node4: more than 4 positive lymph nodes
etype: event type: 1=recurrence,2=death
Let’s load the survival package and then the colon dataset.
# Load the survival package
library(survival)
# Load the colon cancer dataset
data(colon)
# See what type of class the colon object is
class(colon)
## [1] "data.frame"
As evident, R interprets colon as a dataframe. Let’s use the head()
function to print the first 10 rows of the dataframe:
head(colon, 10)
## id study rx sex age obstruct perfor adhere nodes status differ
## 1 1 1 Lev+5FU 1 43 0 0 0 5 1 2
## 2 1 1 Lev+5FU 1 43 0 0 0 5 1 2
## 3 2 1 Lev+5FU 1 63 0 0 0 1 0 2
## 4 2 1 Lev+5FU 1 63 0 0 0 1 0 2
## 5 3 1 Obs 0 71 0 0 1 7 1 2
## 6 3 1 Obs 0 71 0 0 1 7 1 2
## 7 4 1 Lev+5FU 0 66 1 0 0 6 1 2
## 8 4 1 Lev+5FU 0 66 1 0 0 6 1 2
## 9 5 1 Obs 1 69 0 0 0 22 1 2
## 10 5 1 Obs 1 69 0 0 0 22 1 2
## extent surg node4 time etype
## 1 3 0 1 1521 2
## 2 3 0 1 968 1
## 3 3 0 0 3087 2
## 4 3 0 0 3087 1
## 5 2 0 1 963 2
## 6 2 0 1 542 1
## 7 3 1 1 293 2
## 8 3 1 1 245 1
## 9 3 1 1 659 2
## 10 3 1 1 523 1
R prints the first 10 rows and all columns (variables). It’s not very neat since there are more columns than the width fits, so five columns are printed on a second row. To see the whole dataset, you can run View(flights)
which will open the dataset in the RStudio viewer. Also, you can use the str()
function to see how R interprets the columns (variables):
str(colon)
## 'data.frame': 1858 obs. of 16 variables:
## $ id : num 1 1 2 2 3 3 4 4 5 5 ...
## $ study : num 1 1 1 1 1 1 1 1 1 1 ...
## $ rx : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
## $ sex : num 1 1 1 1 0 0 0 0 1 1 ...
## $ age : num 43 43 63 63 71 71 66 66 69 69 ...
## $ obstruct: num 0 0 0 0 0 0 1 1 0 0 ...
## $ perfor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ adhere : num 0 0 0 0 1 1 0 0 0 0 ...
## $ nodes : num 5 5 1 1 7 7 6 6 22 22 ...
## $ status : num 1 1 0 0 1 1 1 1 1 1 ...
## $ differ : num 2 2 2 2 2 2 2 2 2 2 ...
## $ extent : num 3 3 3 3 2 2 3 3 3 3 ...
## $ surg : num 0 0 0 0 0 0 1 1 1 1 ...
## $ node4 : num 1 1 0 0 1 1 1 1 1 1 ...
## $ time : num 1521 968 3087 3087 963 ...
## $ etype : num 2 1 2 1 2 1 2 1 2 1 ...
We can see that all variables are defined as numerical, with the exception of rx which R has interpreted as a factor variable. This is because rxcontains characters and _R has a tendency to convert character variables to factors.
Introducing tibble
There is a special type of dataframe which is more suitable for use with the tidyverse packages. That type of dataframe is called a tibble. The tibble is basically an ordinary data frame but its strucutre has been optimized for use with other tidyverse packages. Let’s create a copy of the colon dataframe and make the copy a tibble:
colon2 <- as.tibble(colon)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.
Now we’ve created the object colon2 which is a copy of the colon dataframe, but in the form of a tibble. We can check whether it is a tibble by using the class()
function:
class(colon2)
## [1] "tbl_df" "tbl" "data.frame"
This function states that colon2 has 3 types: data.frame, tbl_df and tbl, where “tbl” is short for tibble.
Let’s print the first 10 rows:
head(colon2, 10)
## # A tibble: 10 x 16
## id study rx sex age obstruct perfor adhere nodes status differ
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 Lev+… 1 43 0 0 0 5 1 2
## 2 1 1 Lev+… 1 43 0 0 0 5 1 2
## 3 2 1 Lev+… 1 63 0 0 0 1 0 2
## 4 2 1 Lev+… 1 63 0 0 0 1 0 2
## 5 3 1 Obs 0 71 0 0 1 7 1 2
## 6 3 1 Obs 0 71 0 0 1 7 1 2
## 7 4 1 Lev+… 0 66 1 0 0 6 1 2
## 8 4 1 Lev+… 0 66 1 0 0 6 1 2
## 9 5 1 Obs 1 69 0 0 0 22 1 2
## 10 5 1 Obs 1 69 0 0 0 22 1 2
## # … with 5 more variables: extent <dbl>, surg <dbl>, node4 <dbl>,
## # time <dbl>, etype <dbl>
The printed tibble shows the first 10 rows and as many columns as the width fits. Any additional columns (of which there were 5) are denoted below. Notice the letter abbreviations under the column names. These letters describe the type of each variable:
<int>
= integers.<dbl>
= doubles, or real numbers.<chr>
= character vectors, strings.<lgl>
= logical, boolean (vectors that contain onlyTRUE
orFALSE
)<fctr>
= factors, categorical variables<date>
= dates.<dttm>
= date-times (a date + a time).
Let’s see if the transformation to the tibble format changed the variable types:
str(colon2)
## Classes 'tbl_df', 'tbl' and 'data.frame': 1858 obs. of 16 variables:
## $ id : num 1 1 2 2 3 3 4 4 5 5 ...
## $ study : num 1 1 1 1 1 1 1 1 1 1 ...
## $ rx : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
## $ sex : num 1 1 1 1 0 0 0 0 1 1 ...
## $ age : num 43 43 63 63 71 71 66 66 69 69 ...
## $ obstruct: num 0 0 0 0 0 0 1 1 0 0 ...
## $ perfor : num 0 0 0 0 0 0 0 0 0 0 ...
## $ adhere : num 0 0 0 0 1 1 0 0 0 0 ...
## $ nodes : num 5 5 1 1 7 7 6 6 22 22 ...
## $ status : num 1 1 0 0 1 1 1 1 1 1 ...
## $ differ : num 2 2 2 2 2 2 2 2 2 2 ...
## $ extent : num 3 3 3 3 2 2 3 3 3 3 ...
## $ surg : num 0 0 0 0 0 0 1 1 1 1 ...
## $ node4 : num 1 1 0 0 1 1 1 1 1 1 ...
## $ time : num 1521 968 3087 3087 963 ...
## $ etype : num 2 1 2 1 2 1 2 1 2 1 ...
No, it did not. All are numericals, with the exception of rx which is still a factor variable. The distinction between factors and characters was discussed /////////////earlier/////////////. We could convert rx to a character variable, but it’s not necessary since factors and characters will be handled similarly when we perform operations on them.
Basic data transformation with dplyr
The majority of data transformation tasks can be solved with the following dplyr functions:
filter()
let’s you filter rows (observations) based on their valuesarrange()
let’s you reorder (rearrange) rows based on one or multiple variables (columns).select()
let’s you select variables (columns) by their names or positions..mutate()
let’s you create new variables (columns).summarise()
let’s you aggregate (summarise) data from other variables (columns).
The beauty: group_by
You will often need to perform operations group-wise. For example, you may wish to calculate the incidence, or mean value or median or whatever, for women and men separately. For such purposes you have the group_by
function. When you use the group_by
function, all functions thereafter are execute on each group separately.
The beast: pipe (%>%)
Dplyr popularized the pipe operator, which is %>%
. Pipe (%>%
) tells R to chain operations together. Using pipe, you can chain together as many operations as you want and this makes data transformation a piece of cake.
Before we get into more elaborate examples, we’ll start with the basics.
Filter rows (observations) with filter()
The filter
function allows you to subset rows (observations) based on their values. The first argument is the name of the data frame. All subsequent arguments are the expressions that filter the data frame. For example, we can select all patients younger than 60 years of age:
filter(colon2, age<60)
## # A tibble: 828 x 16
## id study rx sex age obstruct perfor adhere nodes status differ
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 Lev+… 1 43 0 0 0 5 1 2
## 2 1 1 Lev+… 1 43 0 0 0 5 1 2
## 3 6 1 Lev+… 0 57 0 0 0 9 1 2
## 4 6 1 Lev+… 0 57 0 0 0 9 1 2
## 5 8 1 Obs 1 54 0 0 0 1 0 2
## 6 8 1 Obs 1 54 0 0 0 1 0 2
## 7 9 1 Lev 1 46 0 0 1 2 0 2
## 8 9 1 Lev 1 46 0 0 1 2 0 2
## 9 11 1 Lev 0 47 0 0 1 1 0 2
## 10 11 1 Lev 0 47 0 0 1 1 0 2
## # … with 818 more rows, and 5 more variables: extent <dbl>, surg <dbl>,
## # node4 <dbl>, time <dbl>, etype <dbl>
dplyr filters the colon2 data frame so that it only includes rows (observations) with age less than 60. This does not, however, save the result. If you wish to save the result, you need to assign the new data to a new R object and this is done using the assignment operator (<-
). Let’s try this by assigning the resulting data to a new data frame called colon3:
colon3 <- filter(colon2, age<60)
Let’s select all rows where age is less than 60 years and sex is 1:
colon3 <- filter(colon2, age<60, sex==1)
You can use all of the usual R comparison operators to filter your data. These operators are as follows:
==
, equal to!=
, not equal to<
, less than>
, greater than<=
, less than or equal to>=
, greater than or equal to
Let’s select all rows which fulfill all of the following conditions:
- age is less than 60 years
- sex is equal to 1,
- rx is not equal to “Obs”
- time is less than or equal to 1000:
colon3 <- filter(colon2, age < 60, sex == 1, rx != "Obs", time <= 1000)
Only rows that fulfill all the filtering arguments will be returned in the new data frame.
Filtering with boolean expressions
Our colon2 data frame does not include any boolean variable, so we will create that variable (only for the purpose of demonstrating how to filter with boolean expressions). This will be done using the R function sample
which generates a random sample of elements (don’t worry about the sample
function for now). We will generate as many values as there are rows in the colon data frame (for which the R function nrow
comes to handy). The new varaible will be named new_variable:
# Assign the number of rows to an R object
number_of_rows <- nrow(colon2)
# Creating a boolean variable with TRUE and FALSE as possible values
colon2$new_variable <- sample(x=c(TRUE, FALSE), size=number_of_rows, replace=T)
# Filtering the data using boolean expression
# Only rows where new_variable is TRUE will be saved to colon3
colon3 <- filter(colon2, new_variable == TRUE)
Note that we assigned the filtered data frame to a new data frame called colon3. If we don’t specify that we want to assign the filtered data to a new data frame, then R will simply print the result of the filter operation, like so:
filter(colon2, new_variable == TRUE)
## # A tibble: 898 x 17
## id study rx sex age obstruct perfor adhere nodes status differ
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 Lev+… 1 43 0 0 0 5 1 2
## 2 1 1 Lev+… 1 43 0 0 0 5 1 2
## 3 2 1 Lev+… 1 63 0 0 0 1 0 2
## 4 3 1 Obs 0 71 0 0 1 7 1 2
## 5 4 1 Lev+… 0 66 1 0 0 6 1 2
## 6 4 1 Lev+… 0 66 1 0 0 6 1 2
## 7 5 1 Obs 1 69 0 0 0 22 1 2
## 8 5 1 Obs 1 69 0 0 0 22 1 2
## 9 7 1 Lev 1 77 0 0 0 5 1 2
## 10 8 1 Obs 1 54 0 0 0 1 0 2
## # … with 888 more rows, and 6 more variables: extent <dbl>, surg <dbl>,
## # node4 <dbl>, time <dbl>, etype <dbl>, new_variable <lgl>
Filtering with logical operators
You can use the logical operators &
(“and”), |
(“or”) and and !
(“is not”) when filtering. The following operation will save all rows for which rx is equal to “Obs” or where time is less than 1000:
colon3 <- filter(colon2, rx == "Obs" | time < 1000)
If you specify multiple filtering arguments, separated by commas, then R will only include the rows which fulfill all arguments. Thus, the following command implies that we will save rows where rx is equal to “Obs” and time is less than 1000:
colon3 <- filter(colon2, rx == "Obs", time < 1000)
Separating filtering arguments with commas implies that only rows which fulfill all arguments will be included. However, if you need to mix filtering arguments, you can always specify “and” explicitly by using the &
character, like so:
colon3 <- filter(colon2, rx == "Obs" & time < 1000)
More advanced filtering
Let’s say you want to filter all rows in the data frame where the variable nodes is either 0, 3, 7 or 22. This can be done using the %in%
operator. The function is most often used in combination with the c
function, which you feed with the values that you’re looking for, like so:
colon3 <- filter(colon2, nodes %in% c(0, 3, 7, 22))
This command tells R to include all rows where nodes is equal to 0, 3, 7 or 22. You can use this code for character variables as well. Let’s save all rows where rx is either “Obs” or “Lev”:
colon3 <- filter(colon2, rx %in% c("Obs", "Lev"))
The order of operations doesn’t work like English. You can’t write filter(flights, month == 11 | 12)
, which you might literally translate into “finds all flights that departed in November or December”. Instead it finds all months that equal 11 | 12
, an expression that evaluates to TRUE
. In a numeric context (like here), TRUE
becomes one, so this finds all flights in January, not November or December. This is quite confusing!
A useful short-hand for this problem is x %in% y
. This will select every row where x
is one of the values in y
. We could use it to rewrite the code above:
nov_dec <- filter(flights, month %in% c(11, 12))
More elaborate filtering functions are carried out using filter_all
, filter_at
and filter_if
: https://dplyr.tidyverse.org/reference/filter_all.html
Filtering with missing values
Missing values are depicted by NA
(Not Available) in R. Missing values are tricky. In essence, R avoids any operations which run into missing values. This can come to expression in different ways, for example:
- If you try to calculate the mean body weight on 10 persons, but 1 of the persons has
NA
(missing) for weight, then R will not calculate the mean. You have to tell R explicitly to ignore missing values. - If you create a regression model with 5 variables, all individuals with missing data for any of those variables will be excluded from the regression model.
- If you filter observations using a couple of arguments, then individuals with missing data relating to those arguments will not be considered in the filtering process.
- Missing values can make comparisons tricky. Any comparison including a missing value will be aborted.
Let’s execute 4 operations including NA. None of these operations will return a result. Instead, they all return NA.
NA > 10
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA
The filter()
function only includes rows where the condition is TRUE
; it excludes both FALSE
and NA
values. So, how do you preserve missing values? You have to ask for them explicitly. We will illustrate this using 3 examples.
- Save all rows with fewer than 4 nodes:
filter(colon3, nodes<4 | is.na(nodes))
## # A tibble: 802 x 17
## id study rx sex age obstruct perfor adhere nodes status differ
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8 1 Obs 1 54 0 0 0 1 0 2
## 2 8 1 Obs 1 54 0 0 0 1 0 2
## 3 9 1 Lev 1 46 0 0 1 2 0 2
## 4 9 1 Lev 1 46 0 0 1 2 0 2
## 5 11 1 Lev 0 47 0 0 1 1 0 2
## 6 11 1 Lev 0 47 0 0 1 1 0 2
## 7 13 1 Obs 1 64 0 0 0 1 1 2
## 8 13 1 Obs 1 64 0 0 0 1 1 2
## 9 14 1 Lev 1 68 1 0 0 3 1 2
## 10 14 1 Lev 1 68 1 0 0 3 1 2
## # … with 792 more rows, and 6 more variables: extent <dbl>, surg <dbl>,
## # node4 <dbl>, time <dbl>, etype <dbl>, new_variable <lgl>
- Save all rows missing (NA) information on nodes:
filter(colon3, is.na(nodes))
## # A tibble: 18 x 17
## id study rx sex age obstruct perfor adhere nodes status differ
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 99 1 Lev 1 71 0 0 1 NA 1 2
## 2 99 1 Lev 1 71 0 0 1 NA 1 2
## 3 189 1 Lev 1 72 0 0 0 NA 0 2
## 4 189 1 Lev 1 72 0 0 0 NA 0 2
## 5 338 1 Lev 1 58 0 0 0 NA 1 3
## 6 338 1 Lev 1 58 0 0 0 NA 1 3
## 7 383 1 Lev 1 63 0 0 0 NA 1 2
## 8 383 1 Lev 1 63 0 0 0 NA 1 2
## 9 522 1 Obs 0 72 0 0 0 NA 0 2
## 10 522 1 Obs 0 72 0 0 0 NA 0 2
## 11 590 1 Lev 1 54 0 0 1 NA 1 2
## 12 590 1 Lev 1 54 0 0 1 NA 1 2
## 13 736 1 Obs 1 57 0 0 0 NA 0 2
## 14 736 1 Obs 1 57 0 0 0 NA 1 2
## 15 771 1 Lev 0 65 1 0 0 NA 1 3
## 16 771 1 Lev 0 65 1 0 0 NA 1 3
## 17 787 1 Obs 0 57 0 0 0 NA 1 2
## 18 787 1 Obs 0 57 0 0 0 NA 1 2
## # … with 6 more variables: extent <dbl>, surg <dbl>, node4 <dbl>,
## # time <dbl>, etype <dbl>, new_variable <lgl>
- Save all rows with fewer than 4 nodes or missing information
filter(colon3, nodes<4 | is.na(nodes))
## # A tibble: 802 x 17
## id study rx sex age obstruct perfor adhere nodes status differ
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 8 1 Obs 1 54 0 0 0 1 0 2
## 2 8 1 Obs 1 54 0 0 0 1 0 2
## 3 9 1 Lev 1 46 0 0 1 2 0 2
## 4 9 1 Lev 1 46 0 0 1 2 0 2
## 5 11 1 Lev 0 47 0 0 1 1 0 2
## 6 11 1 Lev 0 47 0 0 1 1 0 2
## 7 13 1 Obs 1 64 0 0 0 1 1 2
## 8 13 1 Obs 1 64 0 0 0 1 1 2
## 9 14 1 Lev 1 68 1 0 0 3 1 2
## 10 14 1 Lev 1 68 1 0 0 3 1 2
## # … with 792 more rows, and 6 more variables: extent <dbl>, surg <dbl>,
## # node4 <dbl>, time <dbl>, etype <dbl>, new_variable <lgl>
As shown above, the function is.na()
may be used to evaluate whether a value is missing or not.
Arrange (sort) rows with arrange()
arrange()
changes the order of the rows. It takes a data frame and a set of columns to order by. If you provide more than one column, it will sort on the columns sequentially. To illustrate this, we’ll create a new data frame:
x <- c(1, 2, 3, 4, 5, 6, 7, 7, 7)
y <- c(5, 10, 15, 20, 25, 30, 40, 90, 20)
df <- data.frame(x, y)
Now we’ll arrange df
by x and assign the result to a new data frame calle df2:
df2 <- df %>%
arrange(x)
df2
## x y
## 1 1 5
## 2 2 10
## 3 3 15
## 4 4 20
## 5 5 25
## 6 6 30
## 7 7 40
## 8 7 90
## 9 7 20
Note that the value 7 occcurs three times and these values are not sorted. We can sort on y in addition to x, by executing the following command: Now we’ll arrange df
by x and assign the result to a new data frame calle df2:
df2 <- df %>%
arrange(x, y)
df2
## x y
## 1 1 5
## 2 2 10
## 3 3 15
## 4 4 20
## 5 5 25
## 6 6 30
## 7 7 20
## 8 7 40
## 9 7 90
Use desc()
to sort in descending order:
df2 <- df %>%
arrange(desc(x))
df2
## x y
## 1 7 40
## 2 7 90
## 3 7 20
## 4 6 30
## 5 5 25
## 6 4 20
## 7 3 15
## 8 2 10
## 9 1 5
On the contrary to filter, arrange does not delete rows with missing values. Instead, arrange palces missing values at the end. Let’s see an example by introducing a missing values to our data frame:
x <- c(NA, 2, NA, 4, 5, NA, 6, 7, 7, 7)
y <- c(5, 10, 15, 20, 33, 25, 30, 40, 90, 20)
df <- data.frame(x, y)
Missing values are always sorted at the end:
df2 <- df %>%
arrange(x)
df2
## x y
## 1 2 10
## 2 4 20
## 3 5 33
## 4 6 30
## 5 7 40
## 6 7 90
## 7 7 20
## 8 NA 5
## 9 NA 15
## 10 NA 25
Select columns with select()
select()
allows you to easily select variables (columns). This function is straightforward to use and some examples follows.
# Select id, study, rx, sex
select(colon3, id, study, rx, sex)
## # A tibble: 1,250 x 4
## id study rx sex
## <dbl> <dbl> <fct> <dbl>
## 1 3 1 Obs 0
## 2 3 1 Obs 0
## 3 5 1 Obs 1
## 4 5 1 Obs 1
## 5 7 1 Lev 1
## 6 7 1 Lev 1
## 7 8 1 Obs 1
## 8 8 1 Obs 1
## 9 9 1 Lev 1
## 10 9 1 Lev 1
## # … with 1,240 more rows
# Select all columns between id and perfor
select(colon3, id:perfor)
## # A tibble: 1,250 x 7
## id study rx sex age obstruct perfor
## <dbl> <dbl> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 3 1 Obs 0 71 0 0
## 2 3 1 Obs 0 71 0 0
## 3 5 1 Obs 1 69 0 0
## 4 5 1 Obs 1 69 0 0
## 5 7 1 Lev 1 77 0 0
## 6 7 1 Lev 1 77 0 0
## 7 8 1 Obs 1 54 0 0
## 8 8 1 Obs 1 54 0 0
## 9 9 1 Lev 1 46 0 0
## 10 9 1 Lev 1 46 0 0
## # … with 1,240 more rows
# Excude all columns between id and perfor
select(colon3, -(id:perfor))
## # A tibble: 1,250 x 10
## adhere nodes status differ extent surg node4 time etype new_variable
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
## 1 1 7 1 2 2 0 1 963 2 FALSE
## 2 1 7 1 2 2 0 1 542 1 TRUE
## 3 0 22 1 2 3 1 1 659 2 TRUE
## 4 0 22 1 2 3 1 1 523 1 TRUE
## 5 0 5 1 2 3 1 1 420 2 TRUE
## 6 0 5 1 2 3 1 1 229 1 FALSE
## 7 0 1 0 2 3 0 0 3192 2 FALSE
## 8 0 1 0 2 3 0 0 3192 1 TRUE
## 9 1 2 0 2 3 0 0 3173 2 TRUE
## 10 1 2 0 2 3 0 0 3173 1 TRUE
## # … with 1,240 more rows
Note that we did not assign the result to a new data frame in these examples. IF you want to assign the result to a new data frame, you should use the assignment operator, as in the examples above. For safety sake, an example follows, in which we create the data frame colon4:
colon4 <- colon3 %>%
select(id, study, rx, sex)
Helper functions improve the selection
There are several helper functions you can use with select()
to improve the selection:
starts_with("r")
: matches names at their beginning
# Get all columns starting with 'r'
colon4 <- colon3 %>%
select(starts_with("r"))
# Check existing variables in colon4
names(colon4)
## [1] "rx"
ends_with()
: matches names at their end
# Get all columns ending with 'e'
colon4 <- colon3 %>%
select(ends_with("e"))
# Check existing variables in colon4
names(colon4)
## [1] "age" "adhere" "time" "etype"
## [5] "new_variable"
contains()
: matches names that contain “ijk”.
# Get all columns containing 'st'
colon4 <- colon3 %>%
select(contains("st"))
# Check existing variables in colon4
names(colon4)
## [1] "study" "obstruct" "status"
The everything()
helper is useful when you want to reorder the columns. It captures every column that is not explicitly mentioned. Int he following example, we will move the sex variable to the first position by using the everything()
helper:
colon_new <- colon %>%
select(sex, everything())
names(colon_new)
## [1] "sex" "id" "study" "rx" "age" "obstruct"
## [7] "perfor" "adhere" "nodes" "status" "differ" "extent"
## [13] "surg" "node4" "time" "etype"
Change variable names with rename()
This one is really easy. It allows you to change variable names. We will now create a new data frame (colon5) in which we’ve changed the name of nodes to lymph_nodes, and obstruct to obstruction:
colon5 <- colon3 %>%
rename(lymph_nodes=nodes,
obstruction=obstruct)
names(colon5)
## [1] "id" "study" "rx" "sex"
## [5] "age" "obstruction" "perfor" "adhere"
## [9] "lymph_nodes" "status" "differ" "extent"
## [13] "surg" "node4" "time" "etype"
## [17] "new_variable"
Create new variables with mutate()
Every data science project requires creating new variables (columns). New columns may be functions of existing columns or derived by other means. The mutate()
function creates new variables.
colon_new <- colon %>%
mutate(years = time/365,
months = years*12)
# Select and view only time and years
select(colon_new, time, years, months)
## time years months
## 1 1521 4.16712329 50.0054795
## 2 968 2.65205479 31.8246575
## 3 3087 8.45753425 101.4904110
## 4 3087 8.45753425 101.4904110
## 5 963 2.63835616 31.6602740
## 6 542 1.48493151 17.8191781
## 7 293 0.80273973 9.6328767
## 8 245 0.67123288 8.0547945
## 9 659 1.80547945 21.6657534
## 10 523 1.43287671 17.1945205
## 11 1767 4.84109589 58.0931507
## 12 904 2.47671233 29.7205479
## 13 420 1.15068493 13.8082192
## 14 229 0.62739726 7.5287671
## 15 3192 8.74520548 104.9424658
## 16 3192 8.74520548 104.9424658
## 17 3173 8.69315068 104.3178082
## 18 3173 8.69315068 104.3178082
## 19 3308 9.06301370 108.7561644
## 20 3308 9.06301370 108.7561644
## 21 2908 7.96712329 95.6054795
## 22 2908 7.96712329 95.6054795
## 23 3309 9.06575342 108.7890411
## 24 3309 9.06575342 108.7890411
## 25 2085 5.71232877 68.5479452
## 26 1130 3.09589041 37.1506849
## 27 2910 7.97260274 95.6712329
## 28 2231 6.11232877 73.3479452
## 29 2754 7.54520548 90.5424658
## 30 2754 7.54520548 90.5424658
## 31 3214 8.80547945 105.6657534
## 32 1323 3.62465753 43.4958904
## 33 406 1.11232877 13.3479452
## 34 258 0.70684932 8.4821918
## 35 522 1.43013699 17.1616438
## 36 389 1.06575342 12.7890411
## 37 887 2.43013699 29.1616438
## 38 604 1.65479452 19.8575342
## 39 3329 9.12054795 109.4465753
## 40 3329 9.12054795 109.4465753
## 41 2789 7.64109589 91.6931507
## 42 2789 7.64109589 91.6931507
## 43 739 2.02465753 24.2958904
## 44 527 1.44383562 17.3260274
## 45 709 1.94246575 23.3095890
## 46 348 0.95342466 11.4410959
## 47 2969 8.13424658 97.6109589
## 48 2969 8.13424658 97.6109589
## 49 2889 7.91506849 94.9808219
## 50 2889 7.91506849 94.9808219
[only first 50 rows displayed]
Note that you can refer to columns created in the same step:
If you only want to keep the new variables, use transmute()
instead of mutate()
:
colon_new <- colon %>%
transmute(years = time/365,
months = years*12)
colon_new
## years months
## 1 4.16712329 50.0054795
## 2 2.65205479 31.8246575
## 3 8.45753425 101.4904110
## 4 8.45753425 101.4904110
## 5 2.63835616 31.6602740
## 6 1.48493151 17.8191781
## 7 0.80273973 9.6328767
## 8 0.67123288 8.0547945
## 9 1.80547945 21.6657534
## 10 1.43287671 17.1945205
## 11 4.84109589 58.0931507
## 12 2.47671233 29.7205479
## 13 1.15068493 13.8082192
## 14 0.62739726 7.5287671
## 15 8.74520548 104.9424658
## 16 8.74520548 104.9424658
## 17 8.69315068 104.3178082
## 18 8.69315068 104.3178082
## 19 9.06301370 108.7561644
## 20 9.06301370 108.7561644
## 21 7.96712329 95.6054795
## 22 7.96712329 95.6054795
## 23 9.06575342 108.7890411
## 24 9.06575342 108.7890411
## 25 5.71232877 68.5479452
## 26 3.09589041 37.1506849
## 27 7.97260274 95.6712329
## 28 6.11232877 73.3479452
## 29 7.54520548 90.5424658
## 30 7.54520548 90.5424658
## 31 8.80547945 105.6657534
## 32 3.62465753 43.4958904
## 33 1.11232877 13.3479452
## 34 0.70684932 8.4821918
## 35 1.43013699 17.1616438
## 36 1.06575342 12.7890411
## 37 2.43013699 29.1616438
## 38 1.65479452 19.8575342
## 39 9.12054795 109.4465753
## 40 9.12054795 109.4465753
## 41 7.64109589 91.6931507
## 42 7.64109589 91.6931507
## 43 2.02465753 24.2958904
## 44 1.44383562 17.3260274
## 45 1.94246575 23.3095890
## 46 0.95342466 11.4410959
## 47 8.13424658 97.6109589
## 48 8.13424658 97.6109589
## 49 7.91506849 94.9808219
## 50 7.91506849 94.9808219
[only first 50 rows displayed]
Useful functions in mutate
You can use thousands of different functions with mutate. The requirement is that the function must be vectorized, such that it takes a vector of values as input and returns a vector as output. The following functions are frequently used with mutate()
and transmute()
.
- Arithmetics:
+
,-
,*
,/
,^
. These arithemetic operators are easy to use. - Logarithms:
log()
,log2()
,log10()
. lead()
andlag()
are very useful to carry values forward and backward. Using these you can calculate running differences or detect when values change.
df <- data.frame(x=1:10)
df2 <- df %>%
mutate(previous_value=lag(x),
next_value=lead(x))
- Running sums, products, mins and maxes:
cumsum()
,cumprod()
,cummin()
,cummax()
andcummean()
can be used to compute aggregated values.
x
## [1] NA 2 NA 4 5 NA 6 7 7 7
cumsum(x)
## [1] NA NA NA NA NA NA NA NA NA NA
cummean(x)
## [1] NA NA NA NA NA NA NA NA NA NA
- Logical comparisons,
<
,<=
,>
,>=
,!=
, and==
. - Ranking: e.g
min_rank()
,row_number()
,dense_rank()
,percent_rank()
,cume_dist()
andntile()
Group summaries with summarise()
summarise()
collapses a data frame into a single row.
colon %>%
summarise(mean_age = mean(age))
## mean_age
## 1 59.75457
This isn’t very useful since you could simply write mean(colon$age)
instead. However, summarise()
becomes very useful when used in conjunction with group_by()
, which executes the desired operations on individual groups. Let’s calculate the mean age again, but this time in relation to number of lymph nodes. To do this, we have to chain two operatiosn together. In the first operation, we use group_by()
which tells R to perform all subsequent operations on individual groups. Then we use summarise()
to get summaries.
colon %>%
group_by(nodes) %>%
summarise(mean_age = mean(age))
## # A tibble: 25 x 2
## nodes mean_age
## <dbl> <dbl>
## 1 0 47.5
## 2 1 62.0
## 3 2 58.7
## 4 3 60.9
## 5 4 58.8
## 6 5 61.4
## 7 6 57.5
## 8 7 55.3
## 9 8 57.7
## 10 9 54.8
## # … with 15 more rows
Did you notice the %>%
operator? that’s the famous pipe operator, which enables you to chain multiple operations together.
Chaining multiple operations with the pipe (%>%
)
Using %>%
we can chain multiple operations which speeds up coding and also improves the readability of code. Perhaps most obvious, it eliminates the necessity to create intermediate data frames, which is fantastic!
colon_new <- colon %>%
filter(age>50) %>%
mutate(years = time/365) %>%
filter(years > 5) %>%
group_by(nodes) %>%
summarise(
count = n(), # this calculates the number of individuals in each group
mean_age = mean(age), # mean age
mean_years = mean(years, na.rm = TRUE)) # mean survival time
colon_new
## # A tibble: 15 x 4
## nodes count mean_age mean_years
## <dbl> <int> <dbl> <dbl>
## 1 1 300 64.6 6.54
## 2 2 188 63.3 6.54
## 3 3 113 64.8 6.15
## 4 4 58 64.3 6.71
## 5 5 22 64.3 6.57
## 6 6 23 65.5 6.72
## 7 7 20 61.2 6.16
## 8 8 8 60.4 6.23
## 9 9 6 57 7.90
## 10 10 7 69.3 5.71
## 11 11 1 59 5.07
## 12 12 4 66 6.22
## 13 15 2 74 7.10
## 14 19 2 69 5.25
## 15 NA 9 68.3 6.47
A good way to pronounce %>%
when reading code is “then”.
Don’t forget to ungroup when you’re done grouping! When you use group_by()
, the data frame will continue to be grouped, which means that you must ungroup it, unless you want the grouping to continue. Let’s write the above code again and add the ungroup statement:
colon_new <- colon %>%
filter(age>50) %>%
mutate(years = time/365) %>%
filter(years > 5) %>%
group_by(nodes) %>%
summarise(
count = n(),
mean_age = mean(age),
mean_years = mean(years, na.rm = TRUE)) %>%
ungroup()
Also note the useful n()
(counts) command which computes the number of individuals in each group. In addition to n()
, you can use several other summary functions:
- Central tendency:
mean()
,median()
- The standard deviation
sd()
, is the standard measure of spread. The interquartile rangeIQR()
and median absolute deviationmad()
are robust equivalents that may be more useful if you have outliers. - Measures of rank:
min()
,quantile()
,max()
. - Measures of position:
first()
,nth()
,last()
. These functions allows you to retrieve the first, last or nth value for each group. - To count the number of non-missing values, use
sum(!is.na(x))
. To count the number of distinct (unique) values, usen_distinct(x)
.
Counts are so useful that dplyr provides a simple helper if all you want is a count:
colon %>%
count(rx, sex)
## # A tibble: 6 x 3
## rx sex n
## <fct> <dbl> <int>
## 1 Obs 0 298
## 2 Obs 1 332
## 3 Lev 0 266
## 4 Lev 1 354
## 5 Lev+5FU 0 326
## 6 Lev+5FU 1 282
You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:
colon %>%
count(rx, sex, wt=nodes)
## # A tibble: 6 x 3
## rx sex n
## <fct> <dbl> <dbl>
## 1 Obs 0 1090
## 2 Obs 1 1272
## 3 Lev 0 972
## 4 Lev 1 1274
## 5 Lev+5FU 0 1184
## 6 Lev+5FU 1 876
- Counts and proportions of logical values:
sum(x > 10)
,mean(y == 0)
. When used with numeric functions,TRUE
is converted to 1 andFALSE
to 0. This makessum()
andmean()
very useful:sum(x)
gives the number ofTRUE
s inx
, andmean(x)
gives the proportion.
By the way, you can group on multiple variables!