Custom Functions

Putting the funk in functional programming

Author

Rob Chavez

Introduction

According to Hadley Wickham, “R, at its heart, is a functional language. This means that it has certain technical properties, but more importantly that it lends itself to a style of problem solving centred on functions.”

When many of us think of R, we often think of all the great tools and packages available for it. However, the real power of scientific programming is the ability to have full control in customizing your own set of of tools for use in your research and workflow. Learning to take advantage of this can really streamline analyses that are specific to your projects.

Functions have to be defined using function() and set to an object to be used later. Within the function parentheses, you get to tell it how many augments to take and what they will be called within the function. Finally, your function needs needs to output its results using return().

Here is an example of a custom function that takes to character strings and makes them into a single character string using paste(), separated by a comma (not super useful, but it is illustrative). We will first define the function then test on some data.

# define function and agruments
combine_two_strings <- function(string1, string2){
  
  combo <- paste(string1, string2, sep = ",")
  return(combo)
  
}

# test function on random data
var1 <- "apple"
var2 <- "mango"

# run function on data
combine_two_strings(var1, var2)

[1] "apple,mango"

Sometimes you want more flexibility in the number of arguments your function can take. If that is the case, you can use the ... argument that allows users to add additional arguments. These will still need to be called within the function you define.

# define function including ...
combine_all_strings <- function(string1, ...){

  combo <- paste(string1, ..., sep = ",")
  return(combo)
  
}

Now when we run this function, we can include any number of arguments as long as they work with the procedure you defined within the function.

# test function on random data
var1 <- "apple"
var2 <- "mango"
var3 <- "raspberry"

# run function on data
combine_all_strings(var1, var2, var3, var1, var3)

[1] "apple,mango,raspberry,apple,raspberry"

In practice, I don’t find myself using the ... option very often when making custom functions for my data analyses. However, they can be very useful as you get deeper into tool building for your lab or package creation when others might be using your code for different things than you are.

Write your own `mean()`

Let’s do a more quantitative example that you might have seen before in the Quarto template on Canvas. This is a example of a function that calculates the mean of a vector that we ill use as a toy example of making a quantitative custom function. Obviously, R already has a function for calculating the mean of a vector mean(). However, this will make it easy to test our custom function against the real one to make sure it is working correctly.

Our function will just be taking the standard formula for a mean and making it into R code. To do this, we need to tell the function to sum a numeric vector then divide by the number of values in that vector.

# define custom function where x is a numeric vector
my_mean <- function(x){
  s <- sum(x)
  n <- length(x)
  m <- s/n
  
  return(m)
}

Now that we have defined our function, we can test it on some random data.

# make a vector of numbers
random_numbers <- c(2,5,4,7,5)

# use custom function
custom_average <- my_mean(random_numbers)

print(custom_average)

[1] 4.6

Finally, let’s test it against the real, built-in mean() function.

# compare to built-in mean() function
builtin_average <- mean(random_numbers)

custom_average == builtin_average

[1] TRUE

As you can see, both come up with the same value so we are confident that our function worked! In the real world, you should stress test your functions against kind of data your are using them for and check for places where they may be doing unexpected things.

Including additional options

As part of defining your functions arguments, you can start to add in additional options that might be useful—and sometimes very critical—too.

For example, our my_mean() custom function cannot handle NA values.

# make vector with NA value
na_numbs <- c(2,5,4,7,NA)

# run function
my_mean(na_numbs)

[1] NA

To deal with this issue, we will make a new function called my_better_mean() that has some options for how to deal with NA values. However, that sometimes means adding some if and/or else statements within the function.

my_better_mean <- function(x, na.rm = FALSE){
  
  if(na.rm==FALSE){
    s <- sum(x)
    n <- length(x)
    m <- s/n
  } 
  
  if(na.rm==TRUE){
    y <- x[!is.na(x)]  # removes NA values from input
    s <- sum(y)
    n <- length(y)
    m <- s/n
  }
  
  return(m)
}

# mean removing NA
my_better_mean(na_numbs, na.rm = TRUE)

[1] 4.5

# mean without removing NAs will just return an NA
my_better_mean(na_numbs)

[1] NA

As you can see, the first version of the function will remove the NA values before calculating the mean. However, the second won’t error but it will return a NA value because of the missing data.

Your custom functions can be used in other parts of R

Once they are written, you can use your custom functions anywhere in your code where their output is the appropriate data class.

Here are some quick examples using our my_better_mean function on the mtcars data included in R in conjunction with the apply function and a dplyr summary.

# use custom function in apply
apply(mtcars, 2, my_better_mean)

       mpg        cyl       disp         hp       drat         wt       qsec 
 20.090625   6.187500 230.721875 146.687500   3.596563   3.217250  17.848750 
        vs         am       gear       carb 
  0.437500   0.406250   3.687500   2.812500

# use custom function in a dplyr summary 
library(dplyr)


Attaching package: 'dplyr'

The following objects are masked from 'package:stats':

    filter, lag

The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union

mtcars %>% 
  group_by(cyl) %>% 
  summarise(cyl_average= my_better_mean(mpg))

# A tibble: 3 × 2
    cyl cyl_average
  <dbl>       <dbl>
1     4        26.7
2     6        19.7
3     8        15.1

More Complex Output and Indexing

In principle, you can have the return() command return any kind of data type. However, you can only have one return() statement per function. So what do you do if you want to return multiple things at the same time? What if they are not all the same class or type of data? For this, it is list() to the rescue!

Below is an example of a making a custom function called qcor() that will take the input of two vectors and do a quick correlation of them. Here, we will ask it to return the correlation coefficient and a simple regression model (so we can see the p-value and other info) for the two input vectors.

# make random vectors to correlate 
set.seed(17)
var1 <- rnorm(100)
var2 <- rnorm(100)

# define function
qcor <- function(x, y){
  cor_out <- cor(x, y)
  reg_out <- summary(lm(y ~ x))
  out_list <- list(cor_out, reg_out)
  return(out_list)
}

# run function on data
qcor(var1, var2)

[[1]]
[1] -0.04510737

[[2]]

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3481 -0.9818 -0.1057  0.7808  3.0222 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09089    0.11397  -0.797    0.427
x           -0.04840    0.10828  -0.447    0.656

Residual standard error: 1.14 on 98 degrees of freedom
Multiple R-squared:  0.002035,  Adjusted R-squared:  -0.008149 
F-statistic: 0.1998 on 1 and 98 DF,  p-value: 0.6559

Let’s do the same thing, but on the variables from the data frame mtcars that comes built-in with R.

# run function on data from data frame
qcor(mtcars$mpg, mtcars$wt)

[[1]]
[1] -0.8676594

[[2]]

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6516 -0.3490 -0.1381  0.3190  1.3684 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.04726    0.30869  19.590  < 2e-16 ***
x           -0.14086    0.01474  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4945 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

You can see that the returned value is a list that contains both kinds of information. If you want, you can assign this output of this function to a list object. Then you index this object for other information any time you want just like you would any other list() object.

Below, we will extract the correlation coefficient by itself and then separately the p-value by itself. We do this by indexing the object in different ways. Note: the indexing can get quite complicated and depend heavily on what you have included in the list. Always test it to make sure you are getting the right info!

# assign function output to object
qcor_object <- qcor(var1, var2)

# index the correlation coefficient list object only
qcor_object[[1]]

[1] -0.04510737

# a more complex indexing of the only the p-value value from lm list object
qcor_object[[2]]$coefficients[8]

[1] 0.6558649

Incorporating functions from other packages

It is common to want to use functions from other packages within your function. This is called a dependency, and they can be useful in incorporating other tools into yours. However, because they depend on other packages, you have to change them as those packages change themselves (and can sometimes break your function). There are a few different ways to do this, but the most common way is to use the require() command within your function to load a package.

Below, we are going to take our previous function qcor() and rewrite it to also include a scatter plot from the package ggplot2. It will include this as the 3rd object in the list and can be indexed accordingly.

# define function
qcor <- function(x, y){
  
  require(ggplot2)
  cor_out <- cor(x, y)
  reg_out <- summary(lm(y ~ x))
  
  temp_df <- data.frame(x,y)
  plot_out <- ggplot(temp_df, aes(x, y)) + 
    geom_point() +
    geom_smooth(method = 'lm')
    
  out_list <- list(cor_out, reg_out, plot_out)
  return(out_list)
  
}

# run the new qcor function that will include the plot 
qcor(var1, var2)

Loading required package: ggplot2

[[1]]
[1] -0.04510737

[[2]]

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3481 -0.9818 -0.1057  0.7808  3.0222 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09089    0.11397  -0.797    0.427
x           -0.04840    0.10828  -0.447    0.656

Residual standard error: 1.14 on 98 degrees of freedom
Multiple R-squared:  0.002035,  Adjusted R-squared:  -0.008149 
F-statistic: 0.1998 on 1 and 98 DF,  p-value: 0.6559


[[3]]

`geom_smooth()` using formula = 'y ~ x'

# run it on the mtcars data 
qcor(mtcars$mpg, mtcars$wt)

[[1]]
[1] -0.8676594

[[2]]

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.6516 -0.3490 -0.1381  0.3190  1.3684 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  6.04726    0.30869  19.590  < 2e-16 ***
x           -0.14086    0.01474  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4945 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10


[[3]]

`geom_smooth()` using formula = 'y ~ x'

Modifying and tweaking functions from other packages

As we’ve been doing so far, you can write your own functions from scratch. However, it can also be helpful to just make some minor changes from other packages for your own use. This doesn’t work with functions from the base R package, but it works well with most other packages.

For example, here is the code for the theme_minimal(), a function in ggplot2 for having a stripped down theme to plot the data.

You just evaluate (i.e., press enter) the name of the function without the () and it will show how the function is defined. You can then copy and paste the output into a new function you can modify. (Note: do not include the <bytecode> or <environment> info when you copy the function.)

# print funciton by entering the name with out the ()
theme_minimal

function (base_size = 11, base_family = "", header_family = NULL, 
    base_line_size = base_size/22, base_rect_size = base_size/22, 
    ink = "black", paper = "white", accent = "#3366FF") 
{
    theme_bw(base_size = base_size, base_family = base_family, 
        header_family = header_family, base_line_size = base_line_size, 
        base_rect_size = base_rect_size, ink = ink, paper = paper, 
        accent = accent) %+replace% theme(axis.ticks = element_blank(), 
        axis.text.x.bottom = element_text(margin = margin(t = 0.45 * 
            base_size)), axis.text.x.top = element_text(margin = margin(b = 0.45 * 
            base_size)), axis.text.y.left = element_text(margin = margin(r = 0.45 * 
            base_size)), axis.text.y.right = element_text(margin = margin(l = 0.45 * 
            base_size)), legend.background = element_blank(), 
        legend.key = element_blank(), panel.background = element_blank(), 
        panel.border = element_blank(), strip.background = element_blank(), 
        plot.background = element_rect(fill = paper, colour = NA), 
        complete = TRUE)
}
<bytecode: 0x132469200>
<environment: namespace:ggplot2>

Now, we can just copy this code and change it however we want!

Let’s say we like everything about this, but we want the background to be black and the text to be white. We can just make a new function that copies the output of the original code, assigns it to a new function called my_theme_minimal. Then we change the code the code so that the options default to ink = "white" and paper = "black" instead of the reverse of that in the original code. That is the only change from the original function.

# assign new function to copied code, but change ink and paper

my_theme_minimal <- function (base_size = 11, base_family = "", header_family = NULL, 
    base_line_size = base_size/22, base_rect_size = base_size/22, 
    ink = "white", paper = "black", accent = "#3366FF")   # this is the only line that is different
  {
    theme_bw(base_size = base_size, base_family = base_family, 
        header_family = header_family, base_line_size = base_line_size, 
        base_rect_size = base_rect_size, ink = ink, paper = paper, 
        accent = accent) %+replace% theme(axis.ticks = element_blank(), 
        axis.text.x.bottom = element_text(margin = margin(t = 0.45 * 
            base_size)), axis.text.x.top = element_text(margin = margin(b = 0.45 * 
            base_size)), axis.text.y.left = element_text(margin = margin(r = 0.45 * 
            base_size)), axis.text.y.right = element_text(margin = margin(l = 0.45 * 
            base_size)), legend.background = element_blank(), 
        legend.key = element_blank(), panel.background = element_blank(), 
        panel.border = element_blank(), strip.background = element_blank(), 
        plot.background = element_rect(fill = paper, colour = NA), 
        complete = TRUE)
  }

Now we can compare the originial function to the new one using the mtcars data.

# make data frame for ploting
mtcars2 <- within(mtcars, {
  vs <- factor(vs, labels = c("V-shaped", "Straight"))
  am <- factor(am, labels = c("Automatic", "Manual"))
  cyl  <- factor(cyl)
  gear <- factor(gear)
})

# make basic plot to use below
p1 <- ggplot(mtcars2) +
  geom_point(aes(x = wt, y = mpg, colour = gear)) 

# plot original with no theme
p1 + labs(title = "no theme")

# plot with the original theme_minimal()
p1 + theme_minimal() + labs(title = "original theme_minimal()")

Now plot the same thing but with our new function:

# plot with our new custom my_theme_minimal()
p1 + my_theme_minimal() + labs(title = "custom my_theme_minimal()")

Don’t be shy about modifying other people’s functions for your own use in your code (as long as you give credit where it due, of course). This is the benefit of open sources software and a big part of the reason R and similar languages like Python are so useful and popular.

Mini-hacks

Mini-hack 1: Write a function to define the standard error

Unlike mean(), R doesn’t have a built-in method for calculating the standard error of the mean for a vector, which is often information folks want to including in a table or graph. For this mini-hack, write a custom function that takes in a numeric vector and calculates 1 standard error of the mean for that vector. Test this function on a vector of random data.

Mini-hack 2: Write your own custom functions (single output)

Write two custom functions that do something that you think might be useful in your data! This can be as complex or simple as you want it to be, but it can only have a single output variable (e.g., a vector, matrix, or single value, but not a list of things). However, it can’t already exist in a package you know about anything you want it to be, so try and make it interesting and useful to you. Briefly describe them in the Quarto file or by commenting in the code, so I know what they are doing.

Mini-hack 3: Write your own custom function (multiple output)

Do the same as above, but have a function that has multiple outputs in a list(), like discussed above.