avsnitt 11 av 17
Pågår

Writing custom functions in R

Johan Svensson mars 29, 2020

Writing Your Own Functions

Functions offer great ways to improve your work flow. By creating your own functions, you can automate common tasks, create reproducible and customized tools. Once created, your functions will be treated like any other function in R and you can use them everywhere.

Functions are R objects which carry out specific tasks. Functions are created using the function constructor, which has a special format. Functions may at first seem difficult to compose but it is only a matter of time before you will be writing great functions.

The Anatomy of Functions in R

Functions have tree components: a namearguments and a body of code. To create a function you use the function() function. Arguments are placed in the parenthesis of function() and the code to be executed is placed within a pair of braces, {}. The anatomy of a function is shown in the following diagram:

Let’s create a simple function which will take a numeric value and multiply it by 10.

my_function <- function(my_value) {
  my_value*10
}

We will now try the function. When the function is executed, R will replace the argument name in the body with the value we supply the argument in the function call. Let’s supply the value 10 to my_value:

my_function(my_value=10)
## [1] 100

You can also place other functions inside functions. Let’s see an example where we place the round() function within a function we create. We will then pass the value 3.14 to this function and see what happens:

my_function <- function(my_value) {
  round(my_value)
}

my_function(my_value=3.14)
## [1] 3

In both examples above, we have indented each line of code between the braces. This facilitates reading of the function. R, on the contrary to Python, does not interpret spaces, line breaks or indentations. Try to write each operation on a new line in order to make it easier to read the function.

You assign the function to a new R object (above called my_function). This object will then be used to execute the function; this is done by typing the name of the function followed by parenthesis, which contains the arguments.

Functions return the results of the last line of code When you execute a function, R will run each line in the body of the function and then return the result of the last line of code. The function will only return a value if the final line returns a value.

This function will return a result:

# This function returns the result of a multiplication
my_function <- function(my_value) {
  my_value*5
}

# Test the function
my_function(5)
## [1] 25

This function will not return a result:

# This function saves the value to a new object
my_function <- function(my_value) {
  my_new_value = my_value*5
}

# Test the function
my_function(5)

Functions with multiple arguments

You can assign as many arguments as you wish. When the function is executed, R will replace each argument name in the body with the value supplied for the argument. Let’s create a function with two arguments:

my_function <- function(height, weight) {
  height*weight
}

# Test the function
my_function(height=10, weight=15)

What would happen if we forget to supply a value for weight? Let’s see:

my_function <- function(height, weight) {
  height*weight
}

# Test the function
my_function(height=10)

R informs you that the argument weight is missing, with no default. This means that you forgot to supply a value for weight and the function does not have a default value which can be used instead. Hence, we would be well adviced to supply default values when appropriate. Default values are provided in the parenthesis, as follows:

# We set the default value of weight to 10
my_function <- function(height, weight=10) {
  height*weight
}

# Test the function
my_function(height=10)
## [1] 100

Now the function returns the results 100, which is correct.

When should you write a function?

You should consider writing a function when you have three or more copies of the same code. Copying and pasting code is not adviceable; write a function instead.

How to write a function in R

Before writing a function, study your code to see how many inputs you need. Let’s see an example, for which we will create a new data frame and call it ´df´. This data frame will contain two numeric variables, age and score.

# create variables
age <- c(29, 22, 33, 41, 15, 26, 17, 28, 39, 32, 33, 21, 25, 46, 27, 18)
score <- c(2, 23, 25, 40, 33, 3, 33, 49, 19, 5, 32, 5, 67, 87, 32, 2)

# merge variables into data frame
df <- data.frame(age, score)

We want to create a function that performs two calculations, namely the following:

  • We want to multiply the number by two.
  • Calculate the mean of all values.

We will not create the function directly. Instead, we’ll start with ordinary code and then modify it into a function. To do this we will create intermediary variables which we will call intermediary... and these are used to temporarily hold values:

intermediary1 <- df$age*2
intermediary2 <- mean(intermediary1)
intermediary2
## [1] 56.5

Studying the code, we see that we have only 1 input, namely df$age. Let’s replace that input with x:

x <- df$age
intermediary1 <- x*2
intermediary2 <- mean(intermediary1)
intermediary2
## [1] 56.5

Note that it is wise to use intermediate variables (result1result2) because it makes the code more clear. Let’s turn the code into a function, which we will call my_function:

# Create the function
my_function <- function(x) {
  intermediary1 <- x*2
  intermediary2 <- mean(intermediary1)
  intermediary2
}

# Try the function
my_function(df$age)
## [1] 56.5

We can actually use this function on any variable. Let’s try it on score:

my_function(df$score)
## [1] 57.125

Let’s make things slightly more complicated. Let’s modify the function such that it multiplies the final result with the variable score. To do this, we’ll start with ordinary code again. This time however, we have two inputs, x and y.

x <- df$age
y <- df$score

intermediary1 <- x*2
intermediary2 <- mean(intermediary1)
intermediary2 * y
##  [1]  113.0 1299.5 1412.5 2260.0 1864.5  169.5 1864.5 2768.5 1073.5  282.5
## [11] 1808.0  282.5 3785.5 4915.5 1808.0  113.0

Now let’s turn that code inteo a function with two inputs, x and y:

# Create the function
my_function <- function(x, y) {
  intermediary1 <- x*2
  intermediary2 <- mean(intermediary1)
  intermediary2*y
}

# Try the function
my_function(df$age, df$score)
##  [1]  113.0 1299.5 1412.5 2260.0 1864.5  169.5 1864.5 2768.5 1073.5  282.5
## [11] 1808.0  282.5 3785.5 4915.5 1808.0  113.0

Naming functions

You can call your function whatever you like. If your function calculates incidence rates, you may want to call it incidence()incidence_rate()inc_rate() and so on. It is important to use a name that clearly implies what the function does. It is also preferred to use short names. You may use multiple words to name your function, e.g incidence_rate(). It is generally recommended to use lowercase words connected with underscore (_). You may prefer other ways of naming your function and that it perfecly fine, as long as you are consistent and clear.

Avoid overriding existing functions and variables

R has many built in functions and all loaded packages also contain functions. If you create a new function with the same name as an existing function or variable, then your new function will override the former. This occurs rather frequently and may cause unexpected results.

Comments

With regards to being clear, make sure that you have comments embedded in your function. Comments are fundamental when writing code. Let’s rewrite the function with comments:

my_function <- function(x, y) {
  # Multiply by 2
  intermediary1 <- x*2
  # Calculate mean
  intermediary2 <- mean(intermediary1)
  # Multiply by y
  intermediary2*y
}

In the example above, we’ve commented on what the code does, which is arguably redundant since the code already testifies what it does. Hence, these comments are superflous. It is much better to write comments that answers the question why, i.e why the lines were written.

Conditional execution: if statements

You can use conditional statements to execute code depending on a condition. Conditionals are fundamental to writing efficient code. The strucuture of a conditional statement is as follows:

if (condition) {
  # lines executed when condition is TRUE
} else {
  # lines executed when condition is FALSE
}

The statement starts with the if statement, followed by the condition in parenthesis. If the condition is true, then all lines within the following brackets ({}) will be executed. If the condition is false, then all lines after else will be run.

Let’s use the same data as previously (age and score) to create a function with a conditional statement.

# create variables
age <- c(29, 22, 33, 41, 15, 26, 17, 28, 39, 32, 33, 21, 25, 46, 27, 18)
score <- c(2, 23, 25, 40, 33, 3, 33, 49, 19, 5, 32, 5, 67, 87, 32, 2)

# create data frame
df <- data.frame(age, score)
my_function <- function(x, y) {
  if (score<20) {
    age*2
  } else {
    age*3
  }
}

my_function(df$age, df$score)
## Warning in if (score < 20) {: the condition has length > 1 and only the
## first element will be used
##  [1] 58 44 66 82 30 52 34 56 78 64 66 42 50 92 54 36

As usual, the function returns the last calue it computes. IF the condition is true, the function will return age*2, otherwise it will return age*3.

Conditions

You can use || (or) and && (and) to combine multiple logical expressions. These operators are “short-circuiting”: as soon as || sees the first TRUE it returns TRUE without computing anything else. As soon as && sees the first FALSE it returns FALSE. You should never use | or & in an if statement: these are vectorised operations that apply to multiple values (that’s why you use them in filter()). If you do have a logical vector, you can use any() or all() to collapse it to a single value.

Be careful when testing for equality. == is vectorised, which means that it’s easy to get more than one output. Either check the length is already 1, collapse with all() or any(), or use the non-vectorised identical()identical() is very strict: it always returns either a single TRUEor a single FALSE, and doesn’t coerce types. This means that you need to be careful when comparing integers and doubles:

identical(0L, 0)
## [1] FALSE

You also need to be wary of floating point numbers:

x <- sqrt(2) ^ 2
x
## [1] 2
x == 2
## [1] FALSE
x - 2
## [1] 4.440892e-16

Instead use dplyr::near() for comparisons, as described in [comparisons].

And remember, x == NA doesn’t do anything useful!

Multiple conditions

You can chain multiple if statements together:

if (this) {
  # do that
} else if (that) {
  # do something else
} else {
  # 
}

But if you end up with a very long series of chained if statements, you should consider rewriting. One useful technique is the switch()function. It allows you to evaluate selected code based on position or name.

## function(x, y, op) {
##   switch(op,
##     plus = x + y,
##     minus = x - y,
##     times = x * y,
##     divide = x / y,
##     stop("Unknown op!")
##   )
## }

Another useful function that can often eliminate long chains of if statements is cut(). It’s used to discretise continuous variables.

Code style

Both if and function should (almost) always be followed by squiggly brackets ({}), and the contents should be indented by two spaces. This makes it easier to see the hierarchy in your code by skimming the left-hand margin.

An opening curly brace should never go on its own line and should always be followed by a new line. A closing curly brace should always go on its own line, unless it’s followed by else. Always indent the code inside curly braces.

# Good
if (y < 0 && debug) {
  message("Y is negative")
}

if (y == 0) {
  log(x)
} else {
  y ^ x
}

# Bad
if (y < 0 && debug)
message("Y is negative")

if (y == 0) {
  log(x)
} 
else {
  y ^ x
}

It’s ok to drop the curly braces if you have a very short if statement that can fit on one line:

y <- 10
x <- if (y < 20) "Too low" else "Too high"

I recommend this only for very brief if statements. Otherwise, the full form is easier to read:

if (y < 20) {
  x <- "Too low" 
} else {
  x <- "Too high"
}

Function arguments

The arguments to a function typically fall into two broad sets: one set supplies the data to compute on, and the other supplies arguments that control the details of the computation. For example:

  • In log(), the data is x, and the detail is the base of the logarithm.
  • In mean(), the data is x, and the details are how much data to trim from the ends (trim) and how to handle missing values (na.rm).
  • In t.test(), the data are x and y, and the details of the test are alternativemupairedvar.equal, and conf.level.
  • In str_c() you can supply any number of strings to ..., and the details of the concatenation are controlled by sep and collapse.

Generally, data arguments should come first. Detail arguments should go on the end, and usually should have default values. You specify a default value in the same way you call a function with a named argument:

# Compute confidence interval around mean using normal approximation
mean_ci <- function(x, conf = 0.95) {
  se <- sd(x) / sqrt(length(x))
  alpha <- 1 - conf
  mean(x) + se * qnorm(c(alpha / 2, 1 - alpha / 2))
}

x <- runif(100)
mean_ci(x)
## [1] 0.4057673 0.5229811
mean_ci(x, conf = 0.99)
## [1] 0.3873516 0.5413968

The default value should almost always be the most common value. The few exceptions to this rule are to do with safety. For example, it makes sense for na.rm to default to FALSE because missing values are important. Even though na.rm = TRUE is what you usually put in your code, it’s a bad idea to silently ignore missing values by default.

When you call a function, you typically omit the names of the data arguments, because they are used so commonly. If you override the default value of a detail argument, you should use the full name:

# Good
mean(1:10, na.rm = TRUE)

# Bad
mean(x = 1:10, , FALSE)
mean(, TRUE, x = c(1:10, NA))

You can refer to an argument by its unique prefix (e.g. mean(x, n = TRUE)), but this is generally best avoided given the possibilities for confusion.

Notice that when you call a function, you should place a space around = in function calls, and always put a space after a comma, not before (just like in regular English). Using whitespace makes it easier to skim the function for the important components.

# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)

Choosing names

The names of the arguments are also important. R doesn’t care, but the readers of your code (including future-you!) will. Generally you should prefer longer, more descriptive names, but there are a handful of very common, very short names. It’s worth memorising these:

  • xyz: vectors.
  • w: a vector of weights.
  • df: a data frame.
  • ij: numeric indices (typically rows and columns).
  • n: length, or number of rows.
  • p: number of columns.

Otherwise, consider matching names of arguments in existing R functions. For example, use na.rm to determine if missing values should be removed.

Checking values

As you start to write more functions, you’ll eventually get to the point where you don’t remember exactly how your function works. At this point it’s easy to call your function with invalid inputs. To avoid this problem, it’s often useful to make constraints explicit. For example, imagine you’ve written some functions for computing weighted summary statistics:

wt_mean <- function(x, w) {
  sum(x * w) / sum(w)
}
wt_var <- function(x, w) {
  mu <- wt_mean(x, w)
  sum(w * (x - mu) ^ 2) / sum(w)
}
wt_sd <- function(x, w) {
  sqrt(wt_var(x, w))
}

What happens if x and w are not the same length?

wt_mean(1:6, 1:3)
## [1] 7.666667

In this case, because of R’s vector recycling rules, we don’t get an error.

It’s good practice to check important preconditions, and throw an error (with stop()), if they are not true:

wt_mean <- function(x, w) {
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  sum(w * x) / sum(w)
}

Be careful not to take this too far. There’s a tradeoff between how much time you spend making your function robust, versus how long you spend writing it. For example, if you also added a na.rm argument, I probably wouldn’t check it carefully:

wt_mean <- function(x, w, na.rm = FALSE) {
  if (!is.logical(na.rm)) {
    stop("`na.rm` must be logical")
  }
  if (length(na.rm) != 1) {
    stop("`na.rm` must be length 1")
  }
  if (length(x) != length(w)) {
    stop("`x` and `w` must be the same length", call. = FALSE)
  }
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}

This is a lot of extra work for little additional gain. A useful compromise is the built-in stopifnot(): it checks that each argument is TRUE, and produces a generic error message if not.

wt_mean <- function(x, w, na.rm = FALSE) {
  stopifnot(is.logical(na.rm), length(na.rm) == 1)
  stopifnot(length(x) == length(w))
  
  if (na.rm) {
    miss <- is.na(x) | is.na(w)
    x <- x[!miss]
    w <- w[!miss]
  }
  sum(w * x) / sum(w)
}
wt_mean(1:6, 6:1, na.rm = "foo")
## Error in wt_mean(1:6, 6:1, na.rm = "foo"): is.logical(na.rm) is not TRUE

Note that when using stopifnot() you assert what should be true rather than checking for what might be wrong.

Dot-dot-dot (…)

Many functions in R take an arbitrary number of inputs:

sum(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
## [1] 55
stringr::str_c("a", "b", "c", "d", "e", "f")
## [1] "abcdef"

How do these functions work? They rely on a special argument: ... (pronounced dot-dot-dot). This special argument captures any number of arguments that aren’t otherwise matched.

It’s useful because you can then send those ... on to another function. This is a useful catch-all if your function primarily wraps another function. For example, I commonly create these helper functions that wrap around str_c():

commas <- function(...) stringr::str_c(..., collapse = ", ")
commas(letters[1:10])
## [1] "a, b, c, d, e, f, g, h, i, j"
rule <- function(..., pad = "-") {
  title <- paste0(...)
  width <- getOption("width") - nchar(title) - 5
  cat(title, " ", stringr::str_dup(pad, width), "\n", sep = "")
}
rule("Important output")
## Important output ------------------------------------------------------

Here ... lets me forward on any arguments that I don’t want to deal with to str_c(). It’s a very convenient technique. But it does come at a price: any misspelled arguments will not raise an error. This makes it easy for typos to go unnoticed:

x <- c(1, 2)
sum(x, na.mr = TRUE)
## [1] 4

If you just want to capture the values of the ..., use list(...).

Lazy evaluation

Arguments in R are lazily evaluated: they’re not computed until they’re needed. That means if they’re never used, they’re never called. This is an important property of R as a programming language, but is generally not important when you’re writing your own functions for data analysis. You can read more about lazy evaluation at http://adv-r.had.co.nz/Functions.html#lazy-evaluation.

Exercises

  1. What does commas(letters, collapse = "-") do? Why?
  2. It’d be nice if you could supply multiple characters to the pad argument, e.g. rule("Title", pad = "-+"). Why doesn’t this currently work? How could you fix it?
  3. What does the trim argument to mean() do? When might you use it?
  4. The default value for the method argument to cor() is c("pearson", "kendall", "spearman"). What does that mean? What value is used by default?

Return values

Figuring out what your function should return is usually straightforward: it’s why you created the function in the first place! There are two things you should consider when returning a value:

  1. Does returning early make your function easier to read?
  2. Can you make your function pipeable?

Explicit return statements

The value returned by the function is usually the last statement it evaluates, but you can choose to return early by using return(). I think it’s best to save the use of return() to signal that you can return early with a simpler solution. A common reason to do this is because the inputs are empty:

complicated_function <- function(x, y, z) {
  if (length(x) == 0 || length(y) == 0) {
    return(0)
  }
    
  # Complicated code here
}

Another reason is because you have a if statement with one complex block and one simple block. For example, you might write an if statement like this:

f <- function() {
  if (x) {
    # Do 
    # something
    # that
    # takes
    # many
    # lines
    # to
    # express
  } else {
    # return something short
  }
}

But if the first block is very long, by the time you get to the else, you’ve forgotten the condition. One way to rewrite it is to use an early return for the simple case:

f <- function() {
  if (!x) {
    return(something_short)
  }

  # Do 
  # something
  # that
  # takes
  # many
  # lines
  # to
  # express
}

This tends to make the code easier to understand, because you don’t need quite so much context to understand it.

Writing pipeable functions

If you want to write your own pipeable functions, it’s important to think about the return value. Knowing the return value’s object type will mean that your pipeline will “just work”. For example, with dplyr and tidyr the object type is the data frame.

There are two basic types of pipeable functions: transformations and side-effects. With transformations, an object is passed to the function’s first argument and a modified object is returned. With side-effects, the passed object is not transformed. Instead, the function performs an action on the object, like drawing a plot or saving a file. Side-effects functions should “invisibly” return the first argument, so that while they’re not printed they can still be used in a pipeline. For example, this simple function prints the number of missing values in a data frame:

show_missings <- function(df) {
  n <- sum(is.na(df))
  cat("Missing values: ", n, "\n", sep = "")
  
  invisible(df)
}

If we call it interactively, the invisible() means that the input df doesn’t get printed out:

show_missings(mtcars)
## Missing values: 0

But it’s still there, it’s just not printed by default:

x <- show_missings(mtcars) 
## Missing values: 0
class(x)
## [1] "data.frame"
dim(x)
## [1] 32 11

And we can still use it in a pipe:

mtcars %>% 
  show_missings() %>% 
  mutate(mpg = ifelse(mpg < 20, NA, mpg)) %>% 
  show_missings() 
## Missing values: 0
## Missing values: 18

Environment

The last component of a function is its environment. This is not something you need to understand deeply when you first start writing functions. However, it’s important to know a little bit about environments because they are crucial to how functions work. The environment of a function controls how R finds the value associated with a name. For example, take this function:

f <- function(x) {
  x + y
} 

In many programming languages, this would be an error, because y is not defined inside the function. In R, this is valid code because R uses rules called lexical scoping to find the value associated with a name. Since y is not defined inside the function, R will look in the environmentwhere the function was defined:

y <- 100
f(10)
## [1] 110
y <- 1000
f(10)
## [1] 1010

This behaviour seems like a recipe for bugs, and indeed you should avoid creating functions like this deliberately, but by and large it doesn’t cause too many problems (especially if you regularly restart R to get to a clean slate).

The advantage of this behaviour is that from a language standpoint it allows R to be very consistent. Every name is looked up using the same set of rules. For f() that includes the behaviour of two things that you might not expect: { and +. This allows you to do devious things like:

`+` <- function(x, y) {
  if (runif(1) < 0.1) {
    sum(x, y)
  } else {
    sum(x, y) * 1.1
  }
}
table(replicate(1000, 1 + 2))
## 
##   3 3.3 
##  95 905
rm(`+`)

This is a common phenomenon in R. R places few limits on your power. You can do many things that you can’t do in other programming languages. You can do many things that 99% of the time are extremely ill-advised (like overriding how addition works!). But this power and flexibility is what makes tools like ggplot2 and dplyr possible. Learning how to make best use of this flexibility is beyond the scope of this book, but you can read about in Advanced R.

5/5 (1 Review)