Born in Bell Labs as a child of the statistically oriented S language, R has been developed over the years through open source contributions. Hence, it’s a bit of a mishmash with loose “rules” that don’t particularly fit any one programming approach. R uses the techniques of both object oriented and functional programming langauages. As articulated by Chambers (2014), in R:
• Everything that exists is an object.
• Everything that happens is a function call.
As Tom Brady once said, “I think hygiene is so important.” It is, Tom, it really is. And that’s true not only for people, but for code as well. Because R evolved from so many disparate contributors, there are almost always many ways to realize a single outcome. This can make it hard to read code from others since there is so much wiggle room to take one’s own approach. Luckily, there are numerous resources available that provide a framework for styling your code to maximize readibility. These include Google’s R Style Guide, a shorter style guide by Hadley Wickham, and a built in RStudio linter. We’ll return to these ideas at the end of class.
It is important to note that the objects we work with in R are of different classes. Knowing the class of an object is important because objects behave differently based on their class.
This table from Advanced R summarizes the basic data structures:
Homogenous data | Heterogenous data | |
---|---|---|
1-Dimensional | Atomic Vector | List |
2-Dimensional | Matrix | Data frame |
N-Dimensional | Array |
A vector is a series of combined data elements (or “components”). There are two types of vectors: atomic vectors (homogeneous data) and lists (heterogeneous data).
All vectors have three common properties:
typeof()
length()
, the number of elements in the vectorattributes()
, additional metadataAtomic vectors are homogeneous, i.e. contain one type of data. The most commonly used types of atomic vectors are:
Atomic Vector Type | Example |
---|---|
Logical | booleans <- c(TRUE, FALSE, NA) |
Integer | integers <- c(1L, 2L, 3L) |
Double (== numeric ) |
doubles <- c(1, 2.5, 0.005) |
Character | characters <- c("rick", "morty") |
logical_vector <- c(TRUE, FALSE, TRUE)
integer_vector <- c(1L,2L,3L)
double_vector <- c(1,2,3)
character_vector <- c("this", "is", "a", "vector")
Now that we’ve created several vectors, let’s examine their type and length.
# Get type using typeof()
typeof(logical_vector)
## [1] "logical"
typeof(integer_vector)
## [1] "integer"
typeof(double_vector)
## [1] "double"
typeof(character_vector)
## [1] "character"
# Get length using length()
length(double_vector)
## [1] 3
length(character_vector)
## [1] 4
Here we will illustrate how to re-order levels of factors and how to coerce objects into different classes. We will use the gapminder dataset from R as an example, which contains data on life expectancy, GDP per capita, and population by country.
library(gapminder) # load in the gapminder dataset
gapminder <- gapminder # save the dataset to an object
head(gapminder) # check out the first few rows
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779.
## 2 Afghanistan Asia 1957 30.3 9240934 821.
## 3 Afghanistan Asia 1962 32.0 10267083 853.
## 4 Afghanistan Asia 1967 34.0 11537966 836.
## 5 Afghanistan Asia 1972 36.1 13079460 740.
## 6 Afghanistan Asia 1977 38.4 14880372 786.
# Check the class of the variables
class(gapminder$continent) # "continent" is a factor
## [1] "factor"
class(gapminder$gdpPercap) # "gdpPercap" is numeric
## [1] "numeric"
# Check the order of the levels of "continent"
levels(gapminder$continent)
## [1] "Africa" "Americas" "Asia" "Europe" "Oceania"
# Plot GDP per capita by continent.
library(ggplot2)
(plot1 <- ggplot(data = gapminder, aes(x = continent, y = gdpPercap)) +
geom_bar(stat = "identity"))
# This graph would look better if we re-ordered the levels of "continent". Let's do it.
gapminder$continent <- factor(gapminder$continent, levels = c("Oceania", "Africa", "Americas", "Asia", "Europe")) # manually re-order the levels of "continent"
(plot2 <- ggplot(data = gapminder, aes(x = continent, y = gdpPercap, reorder(x, y))) +
geom_bar(stat = "identity")) # re-plot the data
# What happens if we mess with the classes of the variables?
gapminder$continent <- as.numeric(gapminder$continent) # coerce "continent" into a numeric object
(plot3 <- ggplot(data = gapminder, aes(x = continent, y = gdpPercap)) +
geom_bar(stat = "identity"))
Vectors are constructed with the simple function c()
.
### Creating vectors
victor <- c(1,2,3,4,5) # Join elements into a vector and name it victor
victor
## [1] 1 2 3 4 5
(victor <- 1:5) # Different way to do the same thing. Notice the parentheses automatically prints the result.
## [1] 1 2 3 4 5
victor_jr <- rep(1:5, times = 2) # victor had a son. Now there are two victors.
victor_jr
## [1] 1 2 3 4 5 1 2 3 4 5
Now that we’ve created our family of vectors, how do we navigate them? With the smörgåsbord of indexing options:
### Indexing Vectors by Position
victor[4] # get the 4th element of Victor
## [1] 4
victor[-4] # get all elements except the 4th
## [1] 1 2 3 5
victor[2:4] # get elements 2 through 4
## [1] 2 3 4
victor[-(2:4)] # get all elements except elements 2 through 4
## [1] 1 5
victor[c(1,5)] # get the 1st and 5th elements
## [1] 1 5
victor[-(2:4)] == victor[c(1,5)] # verify that the previous two steps are equivalent
## [1] TRUE TRUE
# Indexing Vectors by Value
victor[victor == 5] # get elements in Victor that are equal to 5
## [1] 5
victor[victor >= 3] # get elements in Victor that are greater to or equal to 3
## [1] 3 4 5
victor[victor %in% c(1,5,10,15)] # get elements in Victor that are ALSO in the set 1, 5, 10, 15
## [1] 1 5
reverse_victor <- rev(victor) # reverse the elements in victor
reverse_victor
## [1] 5 4 3 2 1
sort(reverse_victor) # sort reverse-victor (sorts smallest to largest by default). This gives us back regular victor.
## [1] 1 2 3 4 5
sort(reverse_victor) == victor # verify that this is true with Boolean logic
## [1] TRUE TRUE TRUE TRUE TRUE
sort(victor, decreasing = T) # you can also sort from largest to smallest
## [1] 5 4 3 2 1
sort(victor, decreasing = T) == reverse_victor # sorting regular victor by largest to smallest gives us reverse-Victor
## [1] TRUE TRUE TRUE TRUE TRUE
# What if victor had repeating elements?
victor_rep <- c(1,1,1,3,5,5,7,10)
table(victor_rep) # gives a frequency of all the different elements in victor
## victor_rep
## 1 3 5 7 10
## 3 1 2 1 1
unique(victor_rep) # gives all the unique elements in victor
## [1] 1 3 5 7 10
The second type of vector is a list, which is heterogeneous, i.e. contains more than one type of data. Here’s an illustration from Hadley Wickham on how lists work. From Hadley Wickham’s e-book “R for Data Science”
x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))
.
Here’s another illustration from Hadley Wickham on how to index lists.
a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))
.
# General form
for (variable in sequence){
Do something
}
for (i in (1:10)) {
x <- i^2
print(x)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100
# General form:
while (condition){
Do something
}
i <- 1
while (i < 11){
print(i)
i <- i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10
# General form:
if (condition){
Do something
} else {
Do something different
}
if (i > 5) {
print("This number is greater than 5.")
} else {
print("This number is less than 5.")
}
## [1] "This number is greater than 5."
# General form
function_name <- function(var){
Do something
return(new_variable)
}
Functions have:
{ }
inverse <- function(x){
inverse <- 1/x
return(inverse)
}
inverse(2)
## [1] 0.5
R-bloggers, lapply() and sapply()
FUNCTION | INPUT | OUTPUT |
---|---|---|
apply | matrix | vector or matrix |
sapply | vector or list | vector or matrix |
lapply | vector or list | list |
apply()
functions are a type functional, i.e. a function that takes a function as an input and returns a vector or a list as output. This can be used as an alternative to for loops.
X <- matrix(data = c(1,2,3, 1,2,3, 1,2,3), nrow = 3, ncol = 3, byrow = T)
X
## [,1] [,2] [,3]
## [1,] 1 2 3
## [2,] 1 2 3
## [3,] 1 2 3
# Sum the values of each column with `apply()`
apply(X, 2, sum) # the second argument refers to a vector giving the subscripts which the function will be applied over, e.g. for a matrix 1 indicates rows, 2 indicates columns. The third argument specifies the function to be applied
## [1] 3 6 9
# Now sum the values of each row
apply(X, 1, sum)
## [1] 6 6 6
lapply()
takes a function, applies it to each element in a list, and returns the results in the form of a list. Recall the for loop from the above example:
for (i in (1:10)){
x <- i^2
print(x)
}
We can accomplish the same result without using a for loop by instead using lapply()
. Notice that lapply()
returns a list.
lapply(1:10, function(x) x^2)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 4
##
## [[3]]
## [1] 9
##
## [[4]]
## [1] 16
##
## [[5]]
## [1] 25
##
## [[6]]
## [1] 36
##
## [[7]]
## [1] 49
##
## [[8]]
## [1] 64
##
## [[9]]
## [1] 81
##
## [[10]]
## [1] 100
If we want to get back an atomic vector instead of a list, we can use sapply()
.
sapply(1:10, function(x) x^2)
## [1] 1 4 9 16 25 36 49 64 81 100
Import multi-subject data from different directories at the same time.
In this section, we will implement what we’ve covered so far to write a function that contains a for loop in order to collect data files from multiple subject directories, link the data to the subject’s ID, and put the data into a single data frame.
First, we will set the file paths. We can use the working_dir
variable that we created in the set-up chunk as the base for our other file paths.
# Paths
data_dir = paste0(working_dir, "data/") # Where our data live
Then, we set a variable to recognize our subjects and to subsequently create a list of the subject directories.
# Variables
sub_pattern = "sub[0-9]{2}" # Pattern of the subject IDs
# Get subjects list
subjects = list.files(data_dir, pattern = sub_pattern)
We could just use lapply()
within a for loop to iterate over our subject files.
for (sub in subjects) {
data_file = list.files(paste0(data_dir, sub), pattern = "*.csv") # list of csv files in the data dir for a subject
path_tmp <- paste0(data_dir, sub, "/", data_file)
df_tmp <- lapply(path_tmp, read.csv)
assign(paste0(sub, "_df"), df_tmp)
}
Which returns a data frame for each subject. For example, here’s sub01:
str(sub01_df)
## List of 1
## $ :'data.frame': 12 obs. of 3 variables:
## ..$ month : Factor w/ 12 levels "april","august",..: 5 4 8 1 9 7 6 2 12 11 ...
## ..$ tacos_eaten: int [1:12] 14 22 17 27 20 17 13 18 24 25 ...
## ..$ happiness : int [1:12] 2 4 3 5 5 3 3 4 4 3 ...
Or, we can create a single data frame for the subject data by importing the CSV file from their subject directory using a for
loop inside a function
. We could use lapply()
again here, but let’s not, just to change things up.
# get data from each subject's directory and make a data frame for each subject
create_df <- function(subjects) {
df <- data.frame() # make an empty data frame
for (sub in subjects) {
data_file = list.files(paste0(data_dir, sub), pattern = "*.csv") # list of csv files in the subject's datadir
path_tmp <- paste0(data_dir, sub, "/", data_file) # path to subject's data file
df_tmp <- read.csv(path_tmp, sep = ",") # read in the data
df_tmp$id <- rep(as.character(sub), nrow(df_tmp)) # make a column for the subject ID based
df <- rbind(df, df_tmp) # add the subject's data to the main data frame
}
assign("dat", df, envir = .GlobalEnv) # name the data frame 'dat' and make available in the global environment
}
create_df(subjects = subjects)
head(dat)
## month tacos_eaten happiness id
## 1 january 14 2 sub01
## 2 february 22 4 sub01
## 3 march 17 3 sub01
## 4 april 27 5 sub01
## 5 may 20 5 sub01
## 6 june 17 3 sub01
Let’s find the average number of tacos eaten each month and put that information in the dat
data frame.
mean_tacos <- aggregate(tacos_eaten ~ month, data = dat, FUN = mean)
colnames(mean_tacos) <- c("month", "mean_tacos")
dat <- merge(dat,mean_tacos, by = "month", all.x = TRUE)
head(dat)
## month tacos_eaten happiness id mean_tacos
## 1 april 27 5 sub01 22
## 2 april 4 1 sub02 22
## 3 april 35 3 sub03 22
## 4 august 18 4 sub01 18
## 5 august 18 3 sub02 18
## 6 august 18 2 sub03 18
It can often be useful to create output directories within a subject’s existing directory where data created in a script can be saved. In such cases, functions like subset
, dir.create
, and write.csv()
come in handy.
for
loop you created to simulate a normal distribution and turn it into a function that takes the following as inputs:The function should also generate a histogram with title. Run your function with any values you want.
\[ SEM = \sqrt{\sigma^2/n} \]
output
in each subject’s directoryRStudio > Preferences > Code > Diagnostics
and checking all the boxes. Click “OK.” Once that’s done, go back through the code you’ve written for the minihack blocks, check for any issues, and correct them. You may need to save your file or start typing some code before the linter will activate. (Also, the linter is still a little buggy. If there’s an indicator that says a variable has not yet been assigned, it may be that it was assigned in a different code block. Use your judgment.)~~