Background

Born in Bell Labs as a child of the statistically oriented S language, R has been developed over the years through open source contributions. Hence, it’s a bit of a mishmash with loose “rules” that don’t particularly fit any one programming approach. R uses the techniques of both object oriented and functional programming langauages. As articulated by Chambers (2014), in R:

• Everything that exists is an object.  
• Everything that happens is a function call.  

Style

As Tom Brady once said, “I think hygiene is so important.” It is, Tom, it really is. And that’s true not only for people, but for code as well. Because R evolved from so many disparate contributors, there are almost always many ways to realize a single outcome. This can make it hard to read code from others since there is so much wiggle room to take one’s own approach. Luckily, there are numerous resources available that provide a framework for styling your code to maximize readibility. These include Google’s R Style Guide, a shorter style guide by Hadley Wickham, and a built in RStudio linter. We’ll return to these ideas at the end of class.

Data Structures

It is important to note that the objects we work with in R are of different classes. Knowing the class of an object is important because objects behave differently based on their class.

This table from Advanced R summarizes the basic data structures:

Homogenous data Heterogenous data
1-Dimensional Atomic Vector List
2-Dimensional Matrix Data frame
N-Dimensional Array

Vectors

A vector is a series of combined data elements (or “components”). There are two types of vectors: atomic vectors (homogeneous data) and lists (heterogeneous data).

All vectors have three common properties:

  • Type: typeof()
  • Length: length(), the number of elements in the vector
  • Attributes: attributes(), additional metadata

Atomic vectors

Atomic vectors are homogeneous, i.e. contain one type of data. The most commonly used types of atomic vectors are:

Atomic Vector Type Example
Logical booleans <- c(TRUE, FALSE, NA)
Integer integers <- c(1L, 2L, 3L)
Double (== numeric) doubles <- c(1, 2.5, 0.005)
Character characters <- c("rick", "morty")
logical_vector <- c(TRUE, FALSE, TRUE)

integer_vector <- c(1L,2L,3L)

double_vector <- c(1,2,3)

character_vector <- c("this", "is", "a", "vector")

Now that we’ve created several vectors, let’s examine their type and length.

# Get type using typeof()
typeof(logical_vector)
## [1] "logical"
typeof(integer_vector)
## [1] "integer"
typeof(double_vector)
## [1] "double"
typeof(character_vector)
## [1] "character"
# Get length using length()

length(double_vector)
## [1] 3
length(character_vector)
## [1] 4

Manipulating objects

Here we will illustrate how to re-order levels of factors and how to coerce objects into different classes. We will use the gapminder dataset from R as an example, which contains data on life expectancy, GDP per capita, and population by country.

library(gapminder) # load in the gapminder dataset 

gapminder <- gapminder # save the dataset to an object
head(gapminder) # check out the first few rows 
## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.
## 4 Afghanistan Asia       1967    34.0 11537966      836.
## 5 Afghanistan Asia       1972    36.1 13079460      740.
## 6 Afghanistan Asia       1977    38.4 14880372      786.
# Check the class of the variables 
class(gapminder$continent) # "continent" is a factor
## [1] "factor"
class(gapminder$gdpPercap) # "gdpPercap" is numeric 
## [1] "numeric"
# Check the order of the levels of "continent"
levels(gapminder$continent)
## [1] "Africa"   "Americas" "Asia"     "Europe"   "Oceania"
# Plot GDP per capita by continent. 
library(ggplot2)
(plot1 <- ggplot(data = gapminder, aes(x = continent, y = gdpPercap)) + 
  geom_bar(stat = "identity"))

# This graph would look better if we re-ordered the levels of "continent". Let's do it. 
gapminder$continent <- factor(gapminder$continent, levels = c("Oceania", "Africa", "Americas", "Asia", "Europe")) # manually re-order the levels of "continent"

(plot2 <- ggplot(data = gapminder, aes(x = continent, y = gdpPercap, reorder(x, y))) + 
  geom_bar(stat = "identity")) # re-plot the data

# What happens if we mess with the classes of the variables?
gapminder$continent <- as.numeric(gapminder$continent) # coerce "continent" into a numeric object

(plot3 <- ggplot(data = gapminder, aes(x = continent, y = gdpPercap)) + 
  geom_bar(stat = "identity"))

Simple Functions

Vectors are constructed with the simple function c().

### Creating vectors

victor <- c(1,2,3,4,5) # Join elements into a vector and name it victor
victor
## [1] 1 2 3 4 5
(victor <- 1:5) # Different way to do the same thing. Notice the parentheses automatically prints the result. 
## [1] 1 2 3 4 5
victor_jr <- rep(1:5, times = 2) # victor had a son. Now there are two victors.
victor_jr
##  [1] 1 2 3 4 5 1 2 3 4 5

Vector Indexing

Now that we’ve created our family of vectors, how do we navigate them? With the smörgåsbord of indexing options:

### Indexing Vectors by Position

victor[4] # get the 4th element of Victor
## [1] 4
victor[-4] # get all elements except the 4th 
## [1] 1 2 3 5
victor[2:4] # get elements 2 through 4
## [1] 2 3 4
victor[-(2:4)] # get all elements except elements 2 through 4 
## [1] 1 5
victor[c(1,5)] # get the 1st and 5th elements 
## [1] 1 5
victor[-(2:4)] == victor[c(1,5)] # verify that the previous two steps are equivalent 
## [1] TRUE TRUE
# Indexing Vectors by Value

victor[victor == 5] # get elements in Victor that are equal to 5
## [1] 5
victor[victor >= 3] # get elements in Victor that are greater to or equal to 3
## [1] 3 4 5
victor[victor %in% c(1,5,10,15)] # get elements in Victor that are ALSO in the set 1, 5, 10, 15
## [1] 1 5

Vector Functions

reverse_victor <- rev(victor) # reverse the elements in victor
reverse_victor
## [1] 5 4 3 2 1
sort(reverse_victor) # sort reverse-victor (sorts smallest to largest by default). This gives us back regular victor.  
## [1] 1 2 3 4 5
sort(reverse_victor) == victor # verify that this is true with Boolean logic
## [1] TRUE TRUE TRUE TRUE TRUE
sort(victor, decreasing = T) # you can also sort from largest to smallest
## [1] 5 4 3 2 1
sort(victor, decreasing = T) == reverse_victor # sorting regular victor by largest to smallest gives us reverse-Victor
## [1] TRUE TRUE TRUE TRUE TRUE
# What if victor had repeating elements?
victor_rep <- c(1,1,1,3,5,5,7,10)

table(victor_rep) # gives a frequency of all the different elements in victor
## victor_rep
##  1  3  5  7 10 
##  3  1  2  1  1
unique(victor_rep) # gives all the unique elements in victor
## [1]  1  3  5  7 10

Lists

The second type of vector is a list, which is heterogeneous, i.e. contains more than one type of data. Here’s an illustration from Hadley Wickham on how lists work. From Hadley Wickham’s e-book “R for Data Science”

x1 <- list(c(1, 2), c(3, 4))
x2 <- list(list(1, 2), list(3, 4))
x3 <- list(1, list(2, list(3)))

Lists.

Here’s another illustration from Hadley Wickham on how to index lists.

a <- list(a = 1:3, b = "a string", c = pi, d = list(-1, -5))

Indexing Lists.

For Loops

# General form

for (variable in sequence){
    Do something
}
for (i in (1:10)) {
  x <- i^2
  print(x)
}
## [1] 1
## [1] 4
## [1] 9
## [1] 16
## [1] 25
## [1] 36
## [1] 49
## [1] 64
## [1] 81
## [1] 100

While Loops

# General form:

while (condition){
    Do something
}
i <- 1

while (i < 11){
  print(i)
  i <- i + 1
}
## [1] 1
## [1] 2
## [1] 3
## [1] 4
## [1] 5
## [1] 6
## [1] 7
## [1] 8
## [1] 9
## [1] 10

If/Then Statements

# General form:

if (condition){
    Do something
} else {
    Do something different
}
if (i > 5) {
  print("This number is greater than 5.")
} else {
  print("This number is less than 5.")
}
## [1] "This number is greater than 5."

Functions

# General form

function_name <- function(var){
    Do something
    return(new_variable)
}

Functions have:

  1. A name
  2. Inputs, or arguments, within the function
  3. A body. This is the code between { }
inverse <- function(x){
    inverse <- 1/x
    return(inverse)
}

inverse(2)
## [1] 0.5

Apply functions

R-bloggers, lapply() and sapply()

FUNCTION INPUT OUTPUT
apply matrix vector or matrix
sapply vector or list vector or matrix
lapply vector or list list

apply() functions are a type functional, i.e. a function that takes a function as an input and returns a vector or a list as output. This can be used as an alternative to for loops.

X <- matrix(data = c(1,2,3, 1,2,3, 1,2,3), nrow = 3, ncol = 3, byrow = T)
X
##      [,1] [,2] [,3]
## [1,]    1    2    3
## [2,]    1    2    3
## [3,]    1    2    3
# Sum the values of each column with `apply()`
apply(X, 2, sum) # the second argument refers to a vector giving the subscripts which the function will be applied over, e.g. for a matrix 1 indicates rows, 2 indicates columns. The third argument specifies the function to be applied
## [1] 3 6 9
# Now sum the values of each row
apply(X, 1, sum)
## [1] 6 6 6

lapply() takes a function, applies it to each element in a list, and returns the results in the form of a list. Recall the for loop from the above example:

for (i in (1:10)){
  x <- i^2
  print(x)
}

We can accomplish the same result without using a for loop by instead using lapply(). Notice that lapply() returns a list.

lapply(1:10, function(x) x^2)
## [[1]]
## [1] 1
## 
## [[2]]
## [1] 4
## 
## [[3]]
## [1] 9
## 
## [[4]]
## [1] 16
## 
## [[5]]
## [1] 25
## 
## [[6]]
## [1] 36
## 
## [[7]]
## [1] 49
## 
## [[8]]
## [1] 64
## 
## [[9]]
## [1] 81
## 
## [[10]]
## [1] 100

If we want to get back an atomic vector instead of a list, we can use sapply().

sapply(1:10, function(x) x^2)
##  [1]   1   4   9  16  25  36  49  64  81 100

Getting data

Import/Export - Files, Directories, & Paths

Import multi-subject data

Import multi-subject data from different directories at the same time.

In this section, we will implement what we’ve covered so far to write a function that contains a for loop in order to collect data files from multiple subject directories, link the data to the subject’s ID, and put the data into a single data frame.

First, we will set the file paths. We can use the working_dir variable that we created in the set-up chunk as the base for our other file paths.

# Paths
data_dir = paste0(working_dir, "data/") # Where our data live

Then, we set a variable to recognize our subjects and to subsequently create a list of the subject directories.

# Variables
sub_pattern = "sub[0-9]{2}" # Pattern of the subject IDs

# Get subjects list
subjects = list.files(data_dir, pattern = sub_pattern)

We could just use lapply() within a for loop to iterate over our subject files.

for (sub in subjects) {
    data_file = list.files(paste0(data_dir, sub), pattern = "*.csv") # list of csv files in the data dir for a subject
    path_tmp <- paste0(data_dir, sub, "/", data_file)
    df_tmp <- lapply(path_tmp, read.csv)
    assign(paste0(sub, "_df"), df_tmp)
}

Which returns a data frame for each subject. For example, here’s sub01:

str(sub01_df)
## List of 1
##  $ :'data.frame':    12 obs. of  3 variables:
##   ..$ month      : Factor w/ 12 levels "april","august",..: 5 4 8 1 9 7 6 2 12 11 ...
##   ..$ tacos_eaten: int [1:12] 14 22 17 27 20 17 13 18 24 25 ...
##   ..$ happiness  : int [1:12] 2 4 3 5 5 3 3 4 4 3 ...

Or, we can create a single data frame for the subject data by importing the CSV file from their subject directory using a for loop inside a function. We could use lapply() again here, but let’s not, just to change things up.

Write the function

# get data from each subject's directory and make a data frame for each subject

create_df <- function(subjects) {
    df <- data.frame() # make an empty data frame
    for (sub in subjects) {
        data_file = list.files(paste0(data_dir, sub), pattern = "*.csv") # list of csv files in the subject's datadir
        path_tmp <- paste0(data_dir, sub, "/", data_file) # path to subject's data file
        df_tmp <- read.csv(path_tmp, sep = ",") # read in the data
        df_tmp$id <- rep(as.character(sub), nrow(df_tmp)) # make a column for the subject ID based
        df <- rbind(df, df_tmp) # add the subject's data to the main data frame
    }
  assign("dat", df, envir = .GlobalEnv) # name the data frame 'dat' and make available in the global environment
}

Use the function

create_df(subjects = subjects)
head(dat)
##      month tacos_eaten happiness    id
## 1  january          14         2 sub01
## 2 february          22         4 sub01
## 3    march          17         3 sub01
## 4    april          27         5 sub01
## 5      may          20         5 sub01
## 6     june          17         3 sub01

Do something with the data

Let’s find the average number of tacos eaten each month and put that information in the dat data frame.

mean_tacos <- aggregate(tacos_eaten ~ month, data = dat, FUN = mean)
colnames(mean_tacos) <- c("month", "mean_tacos")
dat <- merge(dat,mean_tacos, by = "month", all.x = TRUE)
head(dat)
##    month tacos_eaten happiness    id mean_tacos
## 1  april          27         5 sub01         22
## 2  april           4         1 sub02         22
## 3  april          35         3 sub03         22
## 4 august          18         4 sub01         18
## 5 august          18         3 sub02         18
## 6 august          18         2 sub03         18

Export our updated data

It can often be useful to create output directories within a subject’s existing directory where data created in a script can be saved. In such cases, functions like subset, dir.create, and write.csv() come in handy.

Minihacks

Minihack 1: For loop

  1. Write a for loop simulating the normal distribution. Draw 5000 samples of one random observation at a time from a normal curve and plot them in a histogram.

Minihack 2: Functions

  1. Use the for loop you created to simulate a normal distribution and turn it into a function that takes the following as inputs:
  • number of observations in the normal distribution being sampled
  • number of distributions sampled from

The function should also generate a histogram with title. Run your function with any values you want.

  1. Write a function to calculate standard error of the mean for the “population”, “life expectancy” and “GDP Per Capita” variables in the gapminder dataset. Remember the formula for standard error:

\[ SEM = \sqrt{\sigma^2/n} \]

Minihack 3: Exporting data

  1. Using the taco data, write a function or for loop to:
    • Create an output folder called output in each subject’s directory
    • Subset the updated data frame by subject, and
    • Export each subject’s data frame back to their own subject directory as a new csv file with a different name

Minihack 4: Coding with style

  1. Go check out the style guide, or optionally, the more in depth Google R Style Guide Then, activate RStudio’s linter by going to RStudio > Preferences > Code > Diagnostics and checking all the boxes. Click “OK.” Once that’s done, go back through the code you’ve written for the minihack blocks, check for any issues, and correct them. You may need to save your file or start typing some code before the linter will activate. (Also, the linter is still a little buggy. If there’s an indicator that says a variable has not yet been assigned, it may be that it was assigned in a different code block. Use your judgment.)

~~

R Basics Cheat Sheet