Avsnitt 1 av 0
Startad

Manipulating data in R

Data transformation

Introduction

In the vast majority of your data science projects, you will get the data in a form not ready for analyses. All research projects include cleaning of the data file, creating new variables, redefining existing variables, aggregating/summarising data and so on. The task of carrying out these changes is referred to transformation of data. You will notice that 80% of your time will be spent on data transformation and the rest is spent of visualizing, describing the data and building statistical or machine learning models.

In this chapter you will learn how to transform your data using the dplyr package. This package has revolutionzed how data is transformed in R. Before dplyr, people basically hated data transformation in R. Dplyr made it easy, fast and beautyful. The dplyr package is included in the tidyverse, which you install and load like so:

#install.packages("tidyverse")
library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0.9000     ✔ purrr   0.3.0     
## ✔ tibble  2.0.1          ✔ dplyr   0.8.0.1   
## ✔ tidyr   0.8.2          ✔ stringr 1.3.1     
## ✔ readr   1.1.1          ✔ forcats 0.3.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Note that loading tidyverse results in loading of 4 packages: ggplot2tibbletidyr and readr. Also note that loading the tidyverse results in conflicts, which R informs you about. There are two conflicts:

  • the dplyr function filter() will overwrite (mask) the filter() function in the stats package.
  • the dplyr function lag() will overwrite (mask) the lag() function in the stats package.

The reason for this is simple: dplyr includes functions (filter, lag) with the same names as functions in the stats package, and loading dplyr will overwrite the functions in the stats package. If you still want to use the filter or lag functions in the stats package, after loading dplyr, you’ll have to use their full names: i.e stats::filter() and stats::lag().

Prerequisites

In this chapter we’re going to focus on how to use the dplyr package. We’ll illustrate the key ideas using data from the survival package, namely the colon dataset. These are data from a trial of chemotherapy for colon cancer. Here is a list of the variables in the colon dataset:

id:       id
study:    1 for all patients
rx:       Treatment - Obs(ervation), Lev(amisole), Lev(amisole)+5-FU
sex:      1=male
age:      in years
obstruct: obstruction of colon by tumour
perfor:   perforation of colon
adhere:   adherence to nearby organs
nodes:    number of lymph nodes with detectable cancer
time:     days until event or censoring
status:   censoring status
differ:   differentiation of tumour (1=well, 2=moderate, 3=poor)
extent:   Extent of local spread (1=submucosa, 2=muscle, 3=serosa, 4=contiguous structures)
surg:     time from surgery to registration (0=short, 1=long)
node4:    more than 4 positive lymph nodes
etype:    event type: 1=recurrence,2=death

Let’s load the survival package and then the colon dataset.

# Load the survival package
library(survival)

# Load the colon cancer dataset
data(colon)

# See what type of class the colon object is
class(colon)
## [1] "data.frame"

As evident, R interprets colon as a dataframe. Let’s use the head() function to print the first 10 rows of the dataframe:

head(colon, 10)
##    id study      rx sex age obstruct perfor adhere nodes status differ
## 1   1     1 Lev+5FU   1  43        0      0      0     5      1      2
## 2   1     1 Lev+5FU   1  43        0      0      0     5      1      2
## 3   2     1 Lev+5FU   1  63        0      0      0     1      0      2
## 4   2     1 Lev+5FU   1  63        0      0      0     1      0      2
## 5   3     1     Obs   0  71        0      0      1     7      1      2
## 6   3     1     Obs   0  71        0      0      1     7      1      2
## 7   4     1 Lev+5FU   0  66        1      0      0     6      1      2
## 8   4     1 Lev+5FU   0  66        1      0      0     6      1      2
## 9   5     1     Obs   1  69        0      0      0    22      1      2
## 10  5     1     Obs   1  69        0      0      0    22      1      2
##    extent surg node4 time etype
## 1       3    0     1 1521     2
## 2       3    0     1  968     1
## 3       3    0     0 3087     2
## 4       3    0     0 3087     1
## 5       2    0     1  963     2
## 6       2    0     1  542     1
## 7       3    1     1  293     2
## 8       3    1     1  245     1
## 9       3    1     1  659     2
## 10      3    1     1  523     1

R prints the first 10 rows and all columns (variables). It’s not very neat since there are more columns than the width fits, so five columns are printed on a second row. To see the whole dataset, you can run View(flights) which will open the dataset in the RStudio viewer. Also, you can use the str() function to see how R interprets the columns (variables):

str(colon)
## 'data.frame':    1858 obs. of  16 variables:
##  $ id      : num  1 1 2 2 3 3 4 4 5 5 ...
##  $ study   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ rx      : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
##  $ sex     : num  1 1 1 1 0 0 0 0 1 1 ...
##  $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
##  $ obstruct: num  0 0 0 0 0 0 1 1 0 0 ...
##  $ perfor  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ adhere  : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ nodes   : num  5 5 1 1 7 7 6 6 22 22 ...
##  $ status  : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ differ  : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ extent  : num  3 3 3 3 2 2 3 3 3 3 ...
##  $ surg    : num  0 0 0 0 0 0 1 1 1 1 ...
##  $ node4   : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ time    : num  1521 968 3087 3087 963 ...
##  $ etype   : num  2 1 2 1 2 1 2 1 2 1 ...

We can see that all variables are defined as numerical, with the exception of rx which R has interpreted as a factor variable. This is because rxcontains characters and _R has a tendency to convert character variables to factors.

Introducing tibble

There is a special type of dataframe which is more suitable for use with the tidyverse packages. That type of dataframe is called a tibble. The tibble is basically an ordinary data frame but its strucutre has been optimized for use with other tidyverse packages. Let’s create a copy of the colon dataframe and make the copy a tibble:

colon2 <- as.tibble(colon)
## Warning: `as.tibble()` is deprecated, use `as_tibble()` (but mind the new semantics).
## This warning is displayed once per session.

Now we’ve created the object colon2 which is a copy of the colon dataframe, but in the form of a tibble. We can check whether it is a tibble by using the class() function:

class(colon2)
## [1] "tbl_df"     "tbl"        "data.frame"

This function states that colon2 has 3 types: data.frame, tbl_df and tbl, where “tbl” is short for tibble.

Let’s print the first 10 rows:

head(colon2, 10)
## # A tibble: 10 x 16
##       id study rx      sex   age obstruct perfor adhere nodes status differ
##    <dbl> <dbl> <fct> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
##  1     1     1 Lev+…     1    43        0      0      0     5      1      2
##  2     1     1 Lev+…     1    43        0      0      0     5      1      2
##  3     2     1 Lev+…     1    63        0      0      0     1      0      2
##  4     2     1 Lev+…     1    63        0      0      0     1      0      2
##  5     3     1 Obs       0    71        0      0      1     7      1      2
##  6     3     1 Obs       0    71        0      0      1     7      1      2
##  7     4     1 Lev+…     0    66        1      0      0     6      1      2
##  8     4     1 Lev+…     0    66        1      0      0     6      1      2
##  9     5     1 Obs       1    69        0      0      0    22      1      2
## 10     5     1 Obs       1    69        0      0      0    22      1      2
## # … with 5 more variables: extent <dbl>, surg <dbl>, node4 <dbl>,
## #   time <dbl>, etype <dbl>

The printed tibble shows the first 10 rows and as many columns as the width fits. Any additional columns (of which there were 5) are denoted below. Notice the letter abbreviations under the column names. These letters describe the type of each variable:

  • <int> = integers.
  • <dbl> = doubles, or real numbers.
  • <chr> = character vectors, strings.
  • <lgl> = logical, boolean (vectors that contain only TRUE or FALSE)
  • <fctr> = factors, categorical variables
  • <date> = dates.
  • <dttm> = date-times (a date + a time).

Let’s see if the transformation to the tibble format changed the variable types:

str(colon2)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1858 obs. of  16 variables:
##  $ id      : num  1 1 2 2 3 3 4 4 5 5 ...
##  $ study   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ rx      : Factor w/ 3 levels "Obs","Lev","Lev+5FU": 3 3 3 3 1 1 3 3 1 1 ...
##  $ sex     : num  1 1 1 1 0 0 0 0 1 1 ...
##  $ age     : num  43 43 63 63 71 71 66 66 69 69 ...
##  $ obstruct: num  0 0 0 0 0 0 1 1 0 0 ...
##  $ perfor  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ adhere  : num  0 0 0 0 1 1 0 0 0 0 ...
##  $ nodes   : num  5 5 1 1 7 7 6 6 22 22 ...
##  $ status  : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ differ  : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ extent  : num  3 3 3 3 2 2 3 3 3 3 ...
##  $ surg    : num  0 0 0 0 0 0 1 1 1 1 ...
##  $ node4   : num  1 1 0 0 1 1 1 1 1 1 ...
##  $ time    : num  1521 968 3087 3087 963 ...
##  $ etype   : num  2 1 2 1 2 1 2 1 2 1 ...

No, it did not. All are numericals, with the exception of rx which is still a factor variable. The distinction between factors and characters was discussed /////////////earlier/////////////. We could convert rx to a character variable, but it’s not necessary since factors and characters will be handled similarly when we perform operations on them.

Basic data transformation with dplyr

The majority of data transformation tasks can be solved with the following dplyr functions:

  • filter() let’s you filter rows (observations) based on their values
  • arrange() let’s you reorder (rearrange) rows based on one or multiple variables (columns).
  • select() let’s you select variables (columns) by their names or positions..
  • mutate() let’s you create new variables (columns).
  • summarise() let’s you aggregate (summarise) data from other variables (columns).

The beauty: group_by

You will often need to perform operations group-wise. For example, you may wish to calculate the incidence, or mean value or median or whatever, for women and men separately. For such purposes you have the group_by function. When you use the group_by function, all functions thereafter are execute on each group separately.

The beast: pipe (%>%)

Dplyr popularized the pipe operator, which is %>%. Pipe (%>%) tells R to chain operations together. Using pipe, you can chain together as many operations as you want and this makes data transformation a piece of cake.

Before we get into more elaborate examples, we’ll start with the basics.

Filter rows (observations) with filter()

The filter function allows you to subset rows (observations) based on their values. The first argument is the name of the data frame. All subsequent arguments are the expressions that filter the data frame. For example, we can select all patients younger than 60 years of age:

filter(colon2, age<60)
## # A tibble: 828 x 16
##       id study rx      sex   age obstruct perfor adhere nodes status differ
##    <dbl> <dbl> <fct> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
##  1     1     1 Lev+…     1    43        0      0      0     5      1      2
##  2     1     1 Lev+…     1    43        0      0      0     5      1      2
##  3     6     1 Lev+…     0    57        0      0      0     9      1      2
##  4     6     1 Lev+…     0    57        0      0      0     9      1      2
##  5     8     1 Obs       1    54        0      0      0     1      0      2
##  6     8     1 Obs       1    54        0      0      0     1      0      2
##  7     9     1 Lev       1    46        0      0      1     2      0      2
##  8     9     1 Lev       1    46        0      0      1     2      0      2
##  9    11     1 Lev       0    47        0      0      1     1      0      2
## 10    11     1 Lev       0    47        0      0      1     1      0      2
## # … with 818 more rows, and 5 more variables: extent <dbl>, surg <dbl>,
## #   node4 <dbl>, time <dbl>, etype <dbl>

dplyr filters the colon2 data frame so that it only includes rows (observations) with age less than 60. This does not, however, save the result. If you wish to save the result, you need to assign the new data to a new R object and this is done using the assignment operator (<-). Let’s try this by assigning the resulting data to a new data frame called colon3:

colon3 <- filter(colon2, age<60)

Let’s select all rows where age is less than 60 years and sex is 1:

colon3 <- filter(colon2, age<60, sex==1)

You can use all of the usual R comparison operators to filter your data. These operators are as follows:

  • ==, equal to
  • !=, not equal to
  • <, less than
  • >, greater than
  • <=, less than or equal to
  • >=, greater than or equal to

Let’s select all rows which fulfill all of the following conditions:

  • age is less than 60 years
  • sex is equal to 1,
  • rx is not equal to “Obs”
  • time is less than or equal to 1000:
colon3 <- filter(colon2, age < 60, sex == 1, rx != "Obs", time <= 1000)

Only rows that fulfill all the filtering arguments will be returned in the new data frame.

Filtering with boolean expressions

Our colon2 data frame does not include any boolean variable, so we will create that variable (only for the purpose of demonstrating how to filter with boolean expressions). This will be done using the R function sample which generates a random sample of elements (don’t worry about the sample function for now). We will generate as many values as there are rows in the colon data frame (for which the R function nrow comes to handy). The new varaible will be named new_variable:

# Assign the number of rows to an R object
number_of_rows <- nrow(colon2)

# Creating a boolean variable with TRUE and FALSE as possible values
colon2$new_variable <- sample(x=c(TRUE, FALSE), size=number_of_rows, replace=T)

# Filtering the data using boolean expression
# Only rows where new_variable is TRUE will be saved to colon3
colon3 <- filter(colon2, new_variable == TRUE)

Note that we assigned the filtered data frame to a new data frame called colon3. If we don’t specify that we want to assign the filtered data to a new data frame, then R will simply print the result of the filter operation, like so:

filter(colon2, new_variable == TRUE)
## # A tibble: 898 x 17
##       id study rx      sex   age obstruct perfor adhere nodes status differ
##    <dbl> <dbl> <fct> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
##  1     1     1 Lev+…     1    43        0      0      0     5      1      2
##  2     1     1 Lev+…     1    43        0      0      0     5      1      2
##  3     2     1 Lev+…     1    63        0      0      0     1      0      2
##  4     3     1 Obs       0    71        0      0      1     7      1      2
##  5     4     1 Lev+…     0    66        1      0      0     6      1      2
##  6     4     1 Lev+…     0    66        1      0      0     6      1      2
##  7     5     1 Obs       1    69        0      0      0    22      1      2
##  8     5     1 Obs       1    69        0      0      0    22      1      2
##  9     7     1 Lev       1    77        0      0      0     5      1      2
## 10     8     1 Obs       1    54        0      0      0     1      0      2
## # … with 888 more rows, and 6 more variables: extent <dbl>, surg <dbl>,
## #   node4 <dbl>, time <dbl>, etype <dbl>, new_variable <lgl>

Filtering with logical operators

You can use the logical operators & (“and”), | (“or”) and and ! (“is not”) when filtering. The following operation will save all rows for which rx is equal to “Obs” or where time is less than 1000:

colon3 <- filter(colon2, rx == "Obs" | time < 1000)

If you specify multiple filtering arguments, separated by commas, then R will only include the rows which fulfill all arguments. Thus, the following command implies that we will save rows where rx is equal to “Obs” and time is less than 1000:

colon3 <- filter(colon2, rx == "Obs", time < 1000)

Separating filtering arguments with commas implies that only rows which fulfill all arguments will be included. However, if you need to mix filtering arguments, you can always specify “and” explicitly by using the & character, like so:

colon3 <- filter(colon2, rx == "Obs" & time < 1000)

More advanced filtering

Let’s say you want to filter all rows in the data frame where the variable nodes is either 0, 3, 7 or 22. This can be done using the %in% operator. The function is most often used in combination with the c function, which you feed with the values that you’re looking for, like so:

colon3 <- filter(colon2, nodes %in% c(0, 3, 7, 22))

This command tells R to include all rows where nodes is equal to 0, 3, 7 or 22. You can use this code for character variables as well. Let’s save all rows where rx is either “Obs” or “Lev”:

colon3 <- filter(colon2, rx %in% c("Obs", "Lev"))

The order of operations doesn’t work like English. You can’t write filter(flights, month == 11 | 12), which you might literally translate into “finds all flights that departed in November or December”. Instead it finds all months that equal 11 | 12, an expression that evaluates to TRUE. In a numeric context (like here), TRUE becomes one, so this finds all flights in January, not November or December. This is quite confusing!

A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code above:

nov_dec <- filter(flights, month %in% c(11, 12))

More elaborate filtering functions are carried out using filter_allfilter_at and filter_ifhttps://dplyr.tidyverse.org/reference/filter_all.html

Filtering with missing values

Missing values are depicted by NA (Not Available) in R. Missing values are tricky. In essence, R avoids any operations which run into missing values. This can come to expression in different ways, for example:

  • If you try to calculate the mean body weight on 10 persons, but 1 of the persons has NA (missing) for weight, then R will not calculate the mean. You have to tell R explicitly to ignore missing values.
  • If you create a regression model with 5 variables, all individuals with missing data for any of those variables will be excluded from the regression model.
  • If you filter observations using a couple of arguments, then individuals with missing data relating to those arguments will not be considered in the filtering process.
  • Missing values can make comparisons tricky. Any comparison including a missing value will be aborted.

Let’s execute 4 operations including NA. None of these operations will return a result. Instead, they all return NA.

NA > 10
## [1] NA
10 == NA
## [1] NA
NA + 10
## [1] NA
NA / 2
## [1] NA

The filter() function only includes rows where the condition is TRUE; it excludes both FALSE and NA values. So, how do you preserve missing values? You have to ask for them explicitly. We will illustrate this using 3 examples.

  1. Save all rows with fewer than 4 nodes:
filter(colon3, nodes<4 | is.na(nodes))
## # A tibble: 802 x 17
##       id study rx      sex   age obstruct perfor adhere nodes status differ
##    <dbl> <dbl> <fct> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
##  1     8     1 Obs       1    54        0      0      0     1      0      2
##  2     8     1 Obs       1    54        0      0      0     1      0      2
##  3     9     1 Lev       1    46        0      0      1     2      0      2
##  4     9     1 Lev       1    46        0      0      1     2      0      2
##  5    11     1 Lev       0    47        0      0      1     1      0      2
##  6    11     1 Lev       0    47        0      0      1     1      0      2
##  7    13     1 Obs       1    64        0      0      0     1      1      2
##  8    13     1 Obs       1    64        0      0      0     1      1      2
##  9    14     1 Lev       1    68        1      0      0     3      1      2
## 10    14     1 Lev       1    68        1      0      0     3      1      2
## # … with 792 more rows, and 6 more variables: extent <dbl>, surg <dbl>,
## #   node4 <dbl>, time <dbl>, etype <dbl>, new_variable <lgl>
  1. Save all rows missing (NA) information on nodes:
filter(colon3, is.na(nodes))
## # A tibble: 18 x 17
##       id study rx      sex   age obstruct perfor adhere nodes status differ
##    <dbl> <dbl> <fct> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
##  1    99     1 Lev       1    71        0      0      1    NA      1      2
##  2    99     1 Lev       1    71        0      0      1    NA      1      2
##  3   189     1 Lev       1    72        0      0      0    NA      0      2
##  4   189     1 Lev       1    72        0      0      0    NA      0      2
##  5   338     1 Lev       1    58        0      0      0    NA      1      3
##  6   338     1 Lev       1    58        0      0      0    NA      1      3
##  7   383     1 Lev       1    63        0      0      0    NA      1      2
##  8   383     1 Lev       1    63        0      0      0    NA      1      2
##  9   522     1 Obs       0    72        0      0      0    NA      0      2
## 10   522     1 Obs       0    72        0      0      0    NA      0      2
## 11   590     1 Lev       1    54        0      0      1    NA      1      2
## 12   590     1 Lev       1    54        0      0      1    NA      1      2
## 13   736     1 Obs       1    57        0      0      0    NA      0      2
## 14   736     1 Obs       1    57        0      0      0    NA      1      2
## 15   771     1 Lev       0    65        1      0      0    NA      1      3
## 16   771     1 Lev       0    65        1      0      0    NA      1      3
## 17   787     1 Obs       0    57        0      0      0    NA      1      2
## 18   787     1 Obs       0    57        0      0      0    NA      1      2
## # … with 6 more variables: extent <dbl>, surg <dbl>, node4 <dbl>,
## #   time <dbl>, etype <dbl>, new_variable <lgl>
  1. Save all rows with fewer than 4 nodes or missing information
filter(colon3, nodes<4 | is.na(nodes))
## # A tibble: 802 x 17
##       id study rx      sex   age obstruct perfor adhere nodes status differ
##    <dbl> <dbl> <fct> <dbl> <dbl>    <dbl>  <dbl>  <dbl> <dbl>  <dbl>  <dbl>
##  1     8     1 Obs       1    54        0      0      0     1      0      2
##  2     8     1 Obs       1    54        0      0      0     1      0      2
##  3     9     1 Lev       1    46        0      0      1     2      0      2
##  4     9     1 Lev       1    46        0      0      1     2      0      2
##  5    11     1 Lev       0    47        0      0      1     1      0      2
##  6    11     1 Lev       0    47        0      0      1     1      0      2
##  7    13     1 Obs       1    64        0      0      0     1      1      2
##  8    13     1 Obs       1    64        0      0      0     1      1      2
##  9    14     1 Lev       1    68        1      0      0     3      1      2
## 10    14     1 Lev       1    68        1      0      0     3      1      2
## # … with 792 more rows, and 6 more variables: extent <dbl>, surg <dbl>,
## #   node4 <dbl>, time <dbl>, etype <dbl>, new_variable <lgl>

As shown above, the function is.na() may be used to evaluate whether a value is missing or not.

Arrange (sort) rows with arrange()

arrange() changes the order of the rows. It takes a data frame and a set of columns to order by. If you provide more than one column, it will sort on the columns sequentially. To illustrate this, we’ll create a new data frame:

x <- c(1, 2, 3, 4, 5, 6, 7, 7, 7)
y <- c(5, 10, 15, 20, 25, 30, 40, 90, 20)
df <- data.frame(x, y)

Now we’ll arrange df by x and assign the result to a new data frame calle df2:

df2 <- df %>%
  arrange(x)
df2
##   x  y
## 1 1  5
## 2 2 10
## 3 3 15
## 4 4 20
## 5 5 25
## 6 6 30
## 7 7 40
## 8 7 90
## 9 7 20

Note that the value 7 occcurs three times and these values are not sorted. We can sort on y in addition to x, by executing the following command: Now we’ll arrange df by x and assign the result to a new data frame calle df2:

df2 <- df %>%
  arrange(x, y)
df2
##   x  y
## 1 1  5
## 2 2 10
## 3 3 15
## 4 4 20
## 5 5 25
## 6 6 30
## 7 7 20
## 8 7 40
## 9 7 90

Use desc() to sort in descending order:

df2 <- df %>%
  arrange(desc(x))
df2
##   x  y
## 1 7 40
## 2 7 90
## 3 7 20
## 4 6 30
## 5 5 25
## 6 4 20
## 7 3 15
## 8 2 10
## 9 1  5

On the contrary to filter, arrange does not delete rows with missing values. Instead, arrange palces missing values at the end. Let’s see an example by introducing a missing values to our data frame:

x <- c(NA, 2, NA, 4, 5, NA, 6, 7, 7, 7)
y <- c(5, 10, 15, 20, 33, 25, 30, 40, 90, 20)
df <- data.frame(x, y)

Missing values are always sorted at the end:

df2 <- df %>%
  arrange(x)
df2
##     x  y
## 1   2 10
## 2   4 20
## 3   5 33
## 4   6 30
## 5   7 40
## 6   7 90
## 7   7 20
## 8  NA  5
## 9  NA 15
## 10 NA 25

Select columns with select()

select() allows you to easily select variables (columns). This function is straightforward to use and some examples follows.

# Select id, study, rx, sex
select(colon3, id, study, rx, sex)
## # A tibble: 1,250 x 4
##       id study rx      sex
##    <dbl> <dbl> <fct> <dbl>
##  1     3     1 Obs       0
##  2     3     1 Obs       0
##  3     5     1 Obs       1
##  4     5     1 Obs       1
##  5     7     1 Lev       1
##  6     7     1 Lev       1
##  7     8     1 Obs       1
##  8     8     1 Obs       1
##  9     9     1 Lev       1
## 10     9     1 Lev       1
## # … with 1,240 more rows
# Select all columns between id and perfor
select(colon3, id:perfor)
## # A tibble: 1,250 x 7
##       id study rx      sex   age obstruct perfor
##    <dbl> <dbl> <fct> <dbl> <dbl>    <dbl>  <dbl>
##  1     3     1 Obs       0    71        0      0
##  2     3     1 Obs       0    71        0      0
##  3     5     1 Obs       1    69        0      0
##  4     5     1 Obs       1    69        0      0
##  5     7     1 Lev       1    77        0      0
##  6     7     1 Lev       1    77        0      0
##  7     8     1 Obs       1    54        0      0
##  8     8     1 Obs       1    54        0      0
##  9     9     1 Lev       1    46        0      0
## 10     9     1 Lev       1    46        0      0
## # … with 1,240 more rows
# Excude all columns between id and perfor
select(colon3, -(id:perfor))
## # A tibble: 1,250 x 10
##    adhere nodes status differ extent  surg node4  time etype new_variable
##     <dbl> <dbl>  <dbl>  <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>       
##  1      1     7      1      2      2     0     1   963     2 FALSE       
##  2      1     7      1      2      2     0     1   542     1 TRUE        
##  3      0    22      1      2      3     1     1   659     2 TRUE        
##  4      0    22      1      2      3     1     1   523     1 TRUE        
##  5      0     5      1      2      3     1     1   420     2 TRUE        
##  6      0     5      1      2      3     1     1   229     1 FALSE       
##  7      0     1      0      2      3     0     0  3192     2 FALSE       
##  8      0     1      0      2      3     0     0  3192     1 TRUE        
##  9      1     2      0      2      3     0     0  3173     2 TRUE        
## 10      1     2      0      2      3     0     0  3173     1 TRUE        
## # … with 1,240 more rows

Note that we did not assign the result to a new data frame in these examples. IF you want to assign the result to a new data frame, you should use the assignment operator, as in the examples above. For safety sake, an example follows, in which we create the data frame colon4:

colon4 <- colon3 %>%
  select(id, study, rx, sex)

Helper functions improve the selection

There are several helper functions you can use with select() to improve the selection:

starts_with("r"): matches names at their beginning

# Get all columns starting with 'r'
colon4 <- colon3 %>%
  select(starts_with("r"))

# Check existing variables in colon4
names(colon4)
## [1] "rx"

ends_with(): matches names at their end

# Get all columns ending with 'e'
colon4 <- colon3 %>%
  select(ends_with("e"))

# Check existing variables in colon4
names(colon4)
## [1] "age"          "adhere"       "time"         "etype"       
## [5] "new_variable"

contains(): matches names that contain “ijk”.

# Get all columns containing 'st'
colon4 <- colon3 %>%
  select(contains("st"))

# Check existing variables in colon4
names(colon4)
## [1] "study"    "obstruct" "status"

The everything() helper is useful when you want to reorder the columns. It captures every column that is not explicitly mentioned. Int he following example, we will move the sex variable to the first position by using the everything() helper:

colon_new <- colon %>%
  select(sex, everything())
names(colon_new)
##  [1] "sex"      "id"       "study"    "rx"       "age"      "obstruct"
##  [7] "perfor"   "adhere"   "nodes"    "status"   "differ"   "extent"  
## [13] "surg"     "node4"    "time"     "etype"

Change variable names with rename()

This one is really easy. It allows you to change variable names. We will now create a new data frame (colon5) in which we’ve changed the name of nodes to lymph_nodes, and obstruct to obstruction:

colon5 <- colon3 %>%
  rename(lymph_nodes=nodes,
         obstruction=obstruct)

names(colon5)
##  [1] "id"           "study"        "rx"           "sex"         
##  [5] "age"          "obstruction"  "perfor"       "adhere"      
##  [9] "lymph_nodes"  "status"       "differ"       "extent"      
## [13] "surg"         "node4"        "time"         "etype"       
## [17] "new_variable"

Create new variables with mutate()

Every data science project requires creating new variables (columns). New columns may be functions of existing columns or derived by other means. The mutate() function creates new variables.

colon_new <- colon %>%
  mutate(years = time/365,
         months = years*12)

# Select and view only time and years
select(colon_new, time, years, months)
##      time      years      months
## 1    1521 4.16712329  50.0054795
## 2     968 2.65205479  31.8246575
## 3    3087 8.45753425 101.4904110
## 4    3087 8.45753425 101.4904110
## 5     963 2.63835616  31.6602740
## 6     542 1.48493151  17.8191781
## 7     293 0.80273973   9.6328767
## 8     245 0.67123288   8.0547945
## 9     659 1.80547945  21.6657534
## 10    523 1.43287671  17.1945205
## 11   1767 4.84109589  58.0931507
## 12    904 2.47671233  29.7205479
## 13    420 1.15068493  13.8082192
## 14    229 0.62739726   7.5287671
## 15   3192 8.74520548 104.9424658
## 16   3192 8.74520548 104.9424658
## 17   3173 8.69315068 104.3178082
## 18   3173 8.69315068 104.3178082
## 19   3308 9.06301370 108.7561644
## 20   3308 9.06301370 108.7561644
## 21   2908 7.96712329  95.6054795
## 22   2908 7.96712329  95.6054795
## 23   3309 9.06575342 108.7890411
## 24   3309 9.06575342 108.7890411
## 25   2085 5.71232877  68.5479452
## 26   1130 3.09589041  37.1506849
## 27   2910 7.97260274  95.6712329
## 28   2231 6.11232877  73.3479452
## 29   2754 7.54520548  90.5424658
## 30   2754 7.54520548  90.5424658
## 31   3214 8.80547945 105.6657534
## 32   1323 3.62465753  43.4958904
## 33    406 1.11232877  13.3479452
## 34    258 0.70684932   8.4821918
## 35    522 1.43013699  17.1616438
## 36    389 1.06575342  12.7890411
## 37    887 2.43013699  29.1616438
## 38    604 1.65479452  19.8575342
## 39   3329 9.12054795 109.4465753
## 40   3329 9.12054795 109.4465753
## 41   2789 7.64109589  91.6931507
## 42   2789 7.64109589  91.6931507
## 43    739 2.02465753  24.2958904
## 44    527 1.44383562  17.3260274
## 45    709 1.94246575  23.3095890
## 46    348 0.95342466  11.4410959
## 47   2969 8.13424658  97.6109589
## 48   2969 8.13424658  97.6109589
## 49   2889 7.91506849  94.9808219
## 50   2889 7.91506849  94.9808219
[only first 50 rows displayed]

Note that you can refer to columns created in the same step:

If you only want to keep the new variables, use transmute() instead of mutate():

colon_new <- colon %>%
  transmute(years = time/365,
         months = years*12)

colon_new
##           years      months
## 1    4.16712329  50.0054795
## 2    2.65205479  31.8246575
## 3    8.45753425 101.4904110
## 4    8.45753425 101.4904110
## 5    2.63835616  31.6602740
## 6    1.48493151  17.8191781
## 7    0.80273973   9.6328767
## 8    0.67123288   8.0547945
## 9    1.80547945  21.6657534
## 10   1.43287671  17.1945205
## 11   4.84109589  58.0931507
## 12   2.47671233  29.7205479
## 13   1.15068493  13.8082192
## 14   0.62739726   7.5287671
## 15   8.74520548 104.9424658
## 16   8.74520548 104.9424658
## 17   8.69315068 104.3178082
## 18   8.69315068 104.3178082
## 19   9.06301370 108.7561644
## 20   9.06301370 108.7561644
## 21   7.96712329  95.6054795
## 22   7.96712329  95.6054795
## 23   9.06575342 108.7890411
## 24   9.06575342 108.7890411
## 25   5.71232877  68.5479452
## 26   3.09589041  37.1506849
## 27   7.97260274  95.6712329
## 28   6.11232877  73.3479452
## 29   7.54520548  90.5424658
## 30   7.54520548  90.5424658
## 31   8.80547945 105.6657534
## 32   3.62465753  43.4958904
## 33   1.11232877  13.3479452
## 34   0.70684932   8.4821918
## 35   1.43013699  17.1616438
## 36   1.06575342  12.7890411
## 37   2.43013699  29.1616438
## 38   1.65479452  19.8575342
## 39   9.12054795 109.4465753
## 40   9.12054795 109.4465753
## 41   7.64109589  91.6931507
## 42   7.64109589  91.6931507
## 43   2.02465753  24.2958904
## 44   1.44383562  17.3260274
## 45   1.94246575  23.3095890
## 46   0.95342466  11.4410959
## 47   8.13424658  97.6109589
## 48   8.13424658  97.6109589
## 49   7.91506849  94.9808219
## 50   7.91506849  94.9808219
[only first 50 rows displayed]

Useful functions in mutate

You can use thousands of different functions with mutate. The requirement is that the function must be vectorized, such that it takes a vector of values as input and returns a vector as output. The following functions are frequently used with mutate() and transmute().

  • Arithmetics: +-*/^. These arithemetic operators are easy to use.
  • Logarithms: log()log2()log10().
  • lead() and lag() are very useful to carry values forward and backward. Using these you can calculate running differences or detect when values change.
df  <- data.frame(x=1:10)
df2 <- df %>%
  mutate(previous_value=lag(x),
         next_value=lead(x))
  • Running sums, products, mins and maxes: cumsum()cumprod()cummin()cummax() and cummean() can be used to compute aggregated values.
x
##  [1] NA  2 NA  4  5 NA  6  7  7  7
cumsum(x)
##  [1] NA NA NA NA NA NA NA NA NA NA
cummean(x)
##  [1] NA NA NA NA NA NA NA NA NA NA
  • Logical comparisons, <<=>>=!=, and ==.
  • Ranking: e.g min_rank()row_number()dense_rank()percent_rank()cume_dist() and ntile()

Group summaries with summarise()

summarise() collapses a data frame into a single row.

colon %>%
  summarise(mean_age = mean(age))
##   mean_age
## 1 59.75457

This isn’t very useful since you could simply write mean(colon$age) instead. However, summarise() becomes very useful when used in conjunction with group_by(), which executes the desired operations on individual groups. Let’s calculate the mean age again, but this time in relation to number of lymph nodes. To do this, we have to chain two operatiosn together. In the first operation, we use group_by() which tells R to perform all subsequent operations on individual groups. Then we use summarise() to get summaries.

colon %>%
  group_by(nodes) %>%
  summarise(mean_age = mean(age))
## # A tibble: 25 x 2
##    nodes mean_age
##    <dbl>    <dbl>
##  1     0     47.5
##  2     1     62.0
##  3     2     58.7
##  4     3     60.9
##  5     4     58.8
##  6     5     61.4
##  7     6     57.5
##  8     7     55.3
##  9     8     57.7
## 10     9     54.8
## # … with 15 more rows

Did you notice the %>% operator? that’s the famous pipe operator, which enables you to chain multiple operations together.

Chaining multiple operations with the pipe (%>%)

Using %>% we can chain multiple operations which speeds up coding and also improves the readability of code. Perhaps most obvious, it eliminates the necessity to create intermediate data frames, which is fantastic!

colon_new <- colon %>% 
  filter(age>50) %>%
  mutate(years = time/365) %>%
  filter(years > 5) %>%
  group_by(nodes) %>%
  summarise(
    count = n(), # this calculates the number of individuals in each group
    mean_age = mean(age), # mean age
    mean_years = mean(years, na.rm = TRUE)) # mean survival time

colon_new
## # A tibble: 15 x 4
##    nodes count mean_age mean_years
##    <dbl> <int>    <dbl>      <dbl>
##  1     1   300     64.6       6.54
##  2     2   188     63.3       6.54
##  3     3   113     64.8       6.15
##  4     4    58     64.3       6.71
##  5     5    22     64.3       6.57
##  6     6    23     65.5       6.72
##  7     7    20     61.2       6.16
##  8     8     8     60.4       6.23
##  9     9     6     57         7.90
## 10    10     7     69.3       5.71
## 11    11     1     59         5.07
## 12    12     4     66         6.22
## 13    15     2     74         7.10
## 14    19     2     69         5.25
## 15    NA     9     68.3       6.47

A good way to pronounce %>% when reading code is “then”.

Don’t forget to ungroup when you’re done grouping! When you use group_by(), the data frame will continue to be grouped, which means that you must ungroup it, unless you want the grouping to continue. Let’s write the above code again and add the ungroup statement:

colon_new <- colon %>% 
  filter(age>50) %>%
  mutate(years = time/365) %>%
  filter(years > 5) %>%
  group_by(nodes) %>%
  summarise(
    count = n(),
    mean_age = mean(age),
    mean_years = mean(years, na.rm = TRUE)) %>%
  ungroup()

Also note the useful n() (counts) command which computes the number of individuals in each group. In addition to n(), you can use several other summary functions:

  • Central tendency: mean()median()
  • The standard deviation sd(), is the standard measure of spread. The interquartile range IQR() and median absolute deviation mad()are robust equivalents that may be more useful if you have outliers.
  • Measures of rank: min()quantile()max().
  • Measures of position: first()nth()last(). These functions allows you to retrieve the first, last or nth value for each group.
  • To count the number of non-missing values, use sum(!is.na(x)). To count the number of distinct (unique) values, use n_distinct(x).

Counts are so useful that dplyr provides a simple helper if all you want is a count:

colon %>%
  count(rx, sex)
## # A tibble: 6 x 3
##   rx        sex     n
##   <fct>   <dbl> <int>
## 1 Obs         0   298
## 2 Obs         1   332
## 3 Lev         0   266
## 4 Lev         1   354
## 5 Lev+5FU     0   326
## 6 Lev+5FU     1   282

You can optionally provide a weight variable. For example, you could use this to “count” (sum) the total number of miles a plane flew:

colon %>%
  count(rx, sex, wt=nodes)
## # A tibble: 6 x 3
##   rx        sex     n
##   <fct>   <dbl> <dbl>
## 1 Obs         0  1090
## 2 Obs         1  1272
## 3 Lev         0   972
## 4 Lev         1  1274
## 5 Lev+5FU     0  1184
## 6 Lev+5FU     1   876
  • Counts and proportions of logical values: sum(x > 10)mean(y == 0). When used with numeric functions, TRUE is converted to 1 and FALSE to 0. This makes sum() and mean() very useful: sum(x) gives the number of TRUEs in x, and mean(x) gives the proportion.

By the way, you can group on multiple variables!