According to Hadley Wickham, “R, at its heart, is a functional language. This means that it has certain technical properties, but more importantly that it lends itself to a style of problem solving centred on functions.”
When many of us think of R, we often think of all the great tools and packages available for it. However, the real power of scientific programming is the ability to have full control in customizing your own set of of tools for use in your research and workflow. Learning to take advantage of this can really streamline analyses that are specific to your projects.
Functions have to be defined using function() and set to an object to be used later. Within the function parentheses, you get to tell it how many augments to take and what they will be called within the function. Finally, your function needs needs to output its results using return().
Here is an example of a custom function that takes to character strings and makes them into a single character string using paste(), separated by a comma (not super useful, but it is illustrative). We will first define the function then test on some data.
# define function and agrumentscombine_two_strings <-function(string1, string2){ combo <-paste(string1, string2, sep =",")return(combo)}# test function on random datavar1 <-"apple"var2 <-"mango"# run function on datacombine_two_strings(var1, var2)
[1] "apple,mango"
Sometimes you want more flexibility in the number of arguments your function can take. If that is the case, you can use the ... argument that allows users to add additional arguments. These will still need to be called within the function you define.
# define function including ...combine_all_strings <-function(string1, ...){ combo <-paste(string1, ..., sep =",")return(combo)}
Now when we run this function, we can include any number of arguments as long as they work with the procedure you defined within the function.
# test function on random datavar1 <-"apple"var2 <-"mango"var3 <-"raspberry"# run function on datacombine_all_strings(var1, var2, var3, var1, var3)
[1] "apple,mango,raspberry,apple,raspberry"
In practice, I don’t find myself using the ... option very often when making custom functions for my data analyses. However, they can be very useful as you get deeper into tool building for your lab or package creation when others might be using your code for different things than you are.
Write your own mean()
Let’s do a more quantitative example that you might have seen before in the Quarto template on Canvas. This is a example of a function that calculates the mean of a vector that we ill use as a toy example of making a quantitative custom function. Obviously, R already has a function for calculating the mean of a vector mean(). However, this will make it easy to test our custom function against the real one to make sure it is working correctly.
Our function will just be taking the standard formula for a mean and making it into R code. To do this, we need to tell the function to sum a numeric vector then divide by the number of values in that vector.
# define custom function where x is a numeric vectormy_mean <-function(x){ s <-sum(x) n <-length(x) m <- s/nreturn(m)}
Now that we have defined our function, we can test it on some random data.
# make a vector of numbersrandom_numbers <-c(2,5,4,7,5)# use custom functioncustom_average <-my_mean(random_numbers)print(custom_average)
[1] 4.6
Finally, let’s test it against the real, built-in mean() function.
# compare to built-in mean() functionbuiltin_average <-mean(random_numbers)custom_average == builtin_average
[1] TRUE
As you can see, both come up with the same value so we are confident that our function worked! In the real world, you should stress test your functions against kind of data your are using them for and check for places where they may be doing unexpected things.
Including additional options
As part of defining your functions arguments, you can start to add in additional options that might be useful—and sometimes very critical—too.
For example, our my_mean() custom function cannot handle NA values.
# make vector with NA valuena_numbs <-c(2,5,4,7,NA)# run functionmy_mean(na_numbs)
[1] NA
To deal with this issue, we will make a new function called my_better_mean() that has some options for how to deal with NA values. However, that sometimes means adding some if and/or else statements within the function.
my_better_mean <-function(x, na.rm =FALSE){if(na.rm==FALSE){ s <-sum(x) n <-length(x) m <- s/n } if(na.rm==TRUE){ y <- x[!is.na(x)] # removes NA values from input s <-sum(y) n <-length(y) m <- s/n }return(m)}# mean removing NAmy_better_mean(na_numbs, na.rm =TRUE)
[1] 4.5
# mean without removing NAs will just return an NAmy_better_mean(na_numbs)
[1] NA
As you can see, the first version of the function will remove the NA values before calculating the mean. However, the second won’t error but it will return a NA value because of the missing data.
Your custom functions can be used in other parts of R
Once they are written, you can use your custom functions anywhere in your code where their output is the appropriate data class.
Here are some quick examples using our my_better_mean function on the mtcars data included in R in conjunction with the apply function and a dplyr summary.
# use custom function in applyapply(mtcars, 2, my_better_mean)
mpg cyl disp hp drat wt qsec
20.090625 6.187500 230.721875 146.687500 3.596563 3.217250 17.848750
vs am gear carb
0.437500 0.406250 3.687500 2.812500
# use custom function in a dplyr summary library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
In principle, you can have the return() command return any kind of data type. However, you can only have one return() statement per function. So what do you do if you want to return multiple things at the same time? What if they are not all the same class or type of data? For this, it is list() to the rescue!
Below is an example of a making a custom function called qcor() that will take the input of two vectors and do a quick correlation of them. Here, we will ask it to return the correlation coefficient and a simple regression model (so we can see the p-value and other info) for the two input vectors.
# make random vectors to correlate set.seed(17)var1 <-rnorm(100)var2 <-rnorm(100)# define functionqcor <-function(x, y){ cor_out <-cor(x, y) reg_out <-summary(lm(y ~ x)) out_list <-list(cor_out, reg_out)return(out_list)}# run function on dataqcor(var1, var2)
[[1]]
[1] -0.04510737
[[2]]
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.3481 -0.9818 -0.1057 0.7808 3.0222
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09089 0.11397 -0.797 0.427
x -0.04840 0.10828 -0.447 0.656
Residual standard error: 1.14 on 98 degrees of freedom
Multiple R-squared: 0.002035, Adjusted R-squared: -0.008149
F-statistic: 0.1998 on 1 and 98 DF, p-value: 0.6559
Let’s do the same thing, but on the variables from the data frame mtcars that comes built-in with R.
# run function on data from data frameqcor(mtcars$mpg, mtcars$wt)
[[1]]
[1] -0.8676594
[[2]]
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.6516 -0.3490 -0.1381 0.3190 1.3684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.04726 0.30869 19.590 < 2e-16 ***
x -0.14086 0.01474 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4945 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
You can see that the returned value is a list that contains both kinds of information. If you want, you can assign this output of this function to a list object. Then you index this object for other information any time you want just like you would any other list() object.
Below, we will extract the correlation coefficient by itself and then separately the p-value by itself. We do this by indexing the object in different ways. Note: the indexing can get quite complicated and depend heavily on what you have included in the list. Always test it to make sure you are getting the right info!
# assign function output to objectqcor_object <-qcor(var1, var2)# index the correlation coefficient list object onlyqcor_object[[1]]
[1] -0.04510737
# a more complex indexing of the only the p-value value from lm list objectqcor_object[[2]]$coefficients[8]
[1] 0.6558649
Incorporating functions from other packages
It is common to want to use functions from other packages within your function. This is called a dependency, and they can be useful in incorporating other tools into yours. However, because they depend on other packages, you have to change them as those packages change themselves (and can sometimes break your function). There are a few different ways to do this, but the most common way is to use the require() command within your function to load a package.
Below, we are going to take our previous function qcor() and rewrite it to also include a scatter plot from the package ggplot2. It will include this as the 3rd object in the list and can be indexed accordingly.
# define functionqcor <-function(x, y){require(ggplot2) cor_out <-cor(x, y) reg_out <-summary(lm(y ~ x)) temp_df <-data.frame(x,y) plot_out <-ggplot(temp_df, aes(x, y)) +geom_point() +geom_smooth(method ='lm') out_list <-list(cor_out, reg_out, plot_out)return(out_list)}# run the new qcor function that will include the plot qcor(var1, var2)
Loading required package: ggplot2
[[1]]
[1] -0.04510737
[[2]]
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-2.3481 -0.9818 -0.1057 0.7808 3.0222
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.09089 0.11397 -0.797 0.427
x -0.04840 0.10828 -0.447 0.656
Residual standard error: 1.14 on 98 degrees of freedom
Multiple R-squared: 0.002035, Adjusted R-squared: -0.008149
F-statistic: 0.1998 on 1 and 98 DF, p-value: 0.6559
[[3]]
`geom_smooth()` using formula = 'y ~ x'
# run it on the mtcars data qcor(mtcars$mpg, mtcars$wt)
[[1]]
[1] -0.8676594
[[2]]
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-0.6516 -0.3490 -0.1381 0.3190 1.3684
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.04726 0.30869 19.590 < 2e-16 ***
x -0.14086 0.01474 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4945 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
[[3]]
`geom_smooth()` using formula = 'y ~ x'
Modifying and tweaking functions from other packages
As we’ve been doing so far, you can write your own functions from scratch. However, it can also be helpful to just make some minor changes from other packages for your own use. This doesn’t work with functions from the base R package, but it works well with most other packages.
For example, here is the code for the theme_minimal(), a function in ggplot2 for having a stripped down theme to plot the data.
You just evaluate (i.e., press enter) the name of the function without the () and it will show how the function is defined. You can then copy and paste the output into a new function you can modify. (Note: do not include the <bytecode> or <environment> info when you copy the function.)
# print funciton by entering the name with out the ()theme_minimal
Now, we can just copy this code and change it however we want!
Let’s say we like everything about this, but we want the background to be black and the text to be white. We can just make a new function that copies the output of the original code, assigns it to a new function called my_theme_minimal. Then we change the code the code so that the options default to ink = "white" and paper = "black" instead of the reverse of that in the original code. That is the only change from the original function.
# assign new function to copied code, but change ink and papermy_theme_minimal <-function (base_size =11, base_family ="", header_family =NULL, base_line_size = base_size/22, base_rect_size = base_size/22, ink ="white", paper ="black", accent ="#3366FF") # this is the only line that is different {theme_bw(base_size = base_size, base_family = base_family, header_family = header_family, base_line_size = base_line_size, base_rect_size = base_rect_size, ink = ink, paper = paper, accent = accent) %+replace%theme(axis.ticks =element_blank(), axis.text.x.bottom =element_text(margin =margin(t =0.45* base_size)), axis.text.x.top =element_text(margin =margin(b =0.45* base_size)), axis.text.y.left =element_text(margin =margin(r =0.45* base_size)), axis.text.y.right =element_text(margin =margin(l =0.45* base_size)), legend.background =element_blank(), legend.key =element_blank(), panel.background =element_blank(), panel.border =element_blank(), strip.background =element_blank(), plot.background =element_rect(fill = paper, colour =NA), complete =TRUE) }
Now we can compare the originial function to the new one using the mtcars data.
# make data frame for plotingmtcars2 <-within(mtcars, { vs <-factor(vs, labels =c("V-shaped", "Straight")) am <-factor(am, labels =c("Automatic", "Manual")) cyl <-factor(cyl) gear <-factor(gear)})# make basic plot to use belowp1 <-ggplot(mtcars2) +geom_point(aes(x = wt, y = mpg, colour = gear)) # plot original with no themep1 +labs(title ="no theme")
# plot with the original theme_minimal()p1 +theme_minimal() +labs(title ="original theme_minimal()")
Now plot the same thing but with our new function:
# plot with our new custom my_theme_minimal()p1 +my_theme_minimal() +labs(title ="custom my_theme_minimal()")
Don’t be shy about modifying other people’s functions for your own use in your code (as long as you give credit where it due, of course). This is the benefit of open sources software and a big part of the reason R and similar languages like Python are so useful and popular.
Mini-hacks
Mini-hack 1: Write a function to define the standard error
Unlike mean(), R doesn’t have a built-in method for calculating the standard error of the mean for a vector, which is often information folks want to including in a table or graph. For this mini-hack, write a custom function that takes in a numeric vector and calculates 1 standard error of the mean for that vector. Test this function on a vector of random data.
Mini-hack 2: Write your own custom functions (single output)
Write two custom functions that do something that you think might be useful in your data! This can be as complex or simple as you want it to be, but it can only have a single output variable (e.g., a vector, matrix, or single value, but not a list of things). However, it can’t already exist in a package you know about anything you want it to be, so try and make it interesting and useful to you. Briefly describe them in the Quarto file or by commenting in the code, so I know what they are doing.
Mini-hack 3: Write your own custom function (multiple output)
Do the same as above, but have a function that has multiple outputs in a list(), like discussed above.