Introduction

Machine learning is the process of building predictive models. At the core of machine learning is building models which offer predictive power and can be used to understand data we have yet to collect. In scientific practice, we have all used machine learning before when we run regression models! However, machine learning is a complex topic with a wide range of possiblities and applications.

We encounter applications for machine learning each day. Machine learning algorithms are used to make critical decisions in medical diagnosis. Media sites rely on machine learning to sift through millions of options to give you song or movie recommendations and retailers use it to gain insight into their customers’ purchasing behavior.

This presentation aims to present a basic understanding of both regression and classification modeling, as well as how to leverage the package caret to carryout these analyses.

Supervised vs. Unsupervised Learning

There are two main types of machine learning algorithms: supervised and unsupervised

Supervised learning models are those where the machine learning model we build is based off of a known quantity. In this case, we already know the “correct answers” and we train the algorithm to find patterns in our data in order to reach the best performance in predicting a known quanity. We can then apply these models to never-before-seen data to make predictions. Examples include:

Classification Models: Making categorical predictions such as predicting whether a tumor is benign or malignant or whether an email is spam or not.
Regression Models: Making continuous predictions such as predicting changes in temperature or predicting someone’s weight.

Unsupervised learning models are those where the machine learning model derives patterns and information from data while determining the known quantity itself. In this case, there are no known “correct answers” but rather the goal is to find the underlying structure or distribution in the data in order to learn more about the data. Examples include:

Clustering Models: Finding hidden patterns of inherent groupings in data such as grouping customers by purchasing behavior.
Association Models: Findings rules that describe large portions of your data, such as “people that buy X also tend to buy Y”.

Steps of Machine Learning & Considerations

Data Splitting
Tuning
Cross-validation
Model Testing

Data Splitting/ Partitioning

Machine learning requires that you must build your model using a separate dataset than the one used to test the accuracy of your model. Since datasets can be hard to come by, data splitting is used to split a single dataset into a training set and a testing set. You typically want more data in the training set since this data is being used to build your model than in the testing set. Some proportions that are used typically are 70% training, 30% testing, but you can adjust these.

Tuning

In machine learning, there are two types of parameters: model parameters and hyperparameters. What’s the difference?

Model parameters are estimated from the data (i.e., coefficients in linear regression)
Hyperparameters are values that specify the settings of a ML algorithm that can be “tuned” by the researhcer prior to training a model. Different algorithms have different hyperparameters. You don’t know the best values of hyperparameters prior to training a model - have to rely on rules of thumb and/or try to find the best values through trial and error. This is the tuning process.

Cross-Validation

Earlier we talked about splitting our our dataset into training and testing sets in order to build a model that generalizess well to new incoming data. However, this process of training and testing is less than ideal. When we eventually test our data against the remaining data in our test set we are only seeing what the error is for that exact grouping of the test data. Using a single testing set to evaluate the accuracy of one’s models can have limitations in practice because the testing set may not be representative of the dataset as a whole.

Cross-validation is a statistical technique for splitting the training set multiple times into training/testing sets. Each of these training/testing sets is evaluated for error, and the error across all of the sets is averaged. This provides a more accurate assessment of the model’s performance.

There are various cross-validation techniques available in R, but the ones we will cover are k-fold and leave-one-one cross-validation:

k-fold cross-validation: randomly splits the dataset into k chunks (aka, folds) of roughly equal size, and these chunks are split into training/testing sets. The error across all chunks is averaged. k can be any number between 2 and the number of observations in the full dataset, but it is most commonly a value between 3 and 10.
leave-one-out cross-validation: the case where k=n; this results in the training sets containing n-1 observations each, and each test set containing a single observation

Overfitting

A very important consideration in the building of predictive models is overfitting. The data in the training set is going to reflect true, underlying relationships among variables, but there will also be an amount of error that is unique to the training set. Overfitting is the problem of fitting a model too well to the training set. This could result in the model having poor predictive power when applied to a new dataset. During model training, you want to meet a balance between fitting a model with good accuracy (i.e., one that reduces error), without trying to account for so much error in the training model that you overfit the model.

Terminology Break: Algorithms vs Models

Algorithms: a set of steps that are passed into a model for processing
Models: a complex object that takes an input parameter and gives an output
Features: a variable in machine learning lingo

Common Machine Learning Algorithms

Linear regression, lm()
Logistic regression, glm()
Support vector machines, svm() or svmLinear()
Random forests, randomForest()
Elastic Nets, glmnet()

And there are hundreds more… for a full list: names(getModelInfo())

A Regression Example

Regression is used to perform machine learning with a continuous outcome variable. The dataset we will use for this example is called Prestige from the car package. This dataset contains various features across a variety of occupations, such as education, percentage of incumbents who are women, and the perceived prestige of the occupation.

head(Prestige)

str(Prestige)

## 'data.frame':    102 obs. of  6 variables:
##  $ education: num  13.1 12.3 12.8 11.4 14.6 ...
##  $ income   : int  12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
##  $ women    : num  11.16 4.02 15.7 9.11 11.68 ...
##  $ prestige : num  68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
##  $ census   : int  1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
##  $ type     : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...

Prestige$income <- as.numeric(Prestige$income)
Prestige$census <- as.numeric(Prestige$census)
Prestige$type <- as.numeric(Prestige$type)

Prestige <- subset(Prestige, select = c(education, income, women, prestige, census))

Data Splitting

Before you perform model training, you should partition the original dataset into a training set and a testing set. Model training is performed only on the training set.

createDataPartition is used to split the original data into a training set and a testing set. The inputs into createDataPartition include y, times, p, and list:

y = the outcome variable
times = the number of times you want to split the data
p = the percentage of the data that goes into the training set
list = FALSE gives the results in a matrix with the row numbers of the partition that you can pass back into the training data to effectively split it

# Randomly sample from the original dataset
set.seed(50) # set.seed is a random number generator; the value in parentheses is arbitrary, and a seed is only set so that we can reproduce these same results next time we run the analysis.

# Split the original dataset into a training set and a testing set
partition_data <- createDataPartition(Prestige$income, times = 1, p = .7, list = FALSE)

training.set <- Prestige[partition_data, ] # Training set
testing.set <- Prestige[-partition_data, ] # Testing set

Now that we have split the data into a training set and testing set, we can now move on to training a model using the training set. We will go through just a few of the different algorithms and cross-validation techniques you can use for model training.

Model Training in Caret

caret (Classification And REgression Training) is an R package that consolidates all of the many various machine learning algorithms into one, easy-to-use interface. This allows us to test any model we want without having to load separate packages and learn a gazillion different syntax requirements each time we want to test a different type of model.

The train function is used for model training. It uses the following inputs:

y = the outcome variable; y ~. means predict y from all other variables in the dataset
method = the machine learning algorithm
trControl = the cross-validation method
tuneGrid = a data frame of the hyperparameters that you want to be evaluated during model training
preProc = any pre-processing adjustments that you want done on the predictor data

First, let’s try a Linear Regression algorithm:

# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10) # k-folds CV with k=10
train.control2 <- trainControl(method = "LOOCV") # leave-one-out CV


# Use the train function to perform model training
linear.model <- train(income ~. , 
                      data = training.set, 
                      method = "lm", 
                      trControl = train.control, 
                      preProc = c("center")) 

## change train.control to train.control2 to see the results using LOOCV instead


# Look at the results from model training
linear.model

## Linear Regression 
## 
## 74 samples
##  4 predictor
## 
## Pre-processing: centered (4) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 67, 66, 66, 67, 67, 66, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   2385.264  0.7565312  1644.348
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

linear.model$results

# Test the predictive ability of the model in the testing set
linear.predict <- predict(linear.model, testing.set) # Predict values in the testing set

postResample(linear.predict, testing.set$income) # the accuracy of the model

##         RMSE     Rsquared          MAE 
## 2603.8964002    0.6503883 1856.9520734

Second, let’s try a Support Vector Machine Linear modeling algorithm:

# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10) # k-folds CV with k=10
train.control2 <- trainControl(method = "LOOCV") # leave-one-out CV


# The linear model did not have any hyperparameters that we could perform tuning on, but SVM does
# Model tuning
svmL.info <- getModelInfo("svmLinear") #getModelInfo can be used to inspect a specific ML algorithm
svmL.info$svmLinear$parameters #look at the algorithm parameters that can be modifed

tune.grid <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100)) #expand.grid is the function that allows you to specify values that you want to feed into the model training function


# Model training
svmL.model <- train(income ~. , 
                    data = training.set, 
                    method = "svmLinear", 
                    trControl = train.control, 
                    tuneGrid = tune.grid, 
                    preProc = c("center"))


# Look at the results from model training
svmL.model

## Support Vector Machines with Linear Kernel 
## 
## 74 samples
##  4 predictor
## 
## Pre-processing: centered (4) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 66, 67, 67, 66, 66, 67, ... 
## Resampling results across tuning parameters:
## 
##   C      RMSE      Rsquared   MAE     
##   1e-03  3905.451  0.7661535  2690.535
##   1e-02  2585.026  0.8256985  1664.158
##   1e-01  2230.840  0.8290528  1422.342
##   1e+00  2249.864  0.8231544  1458.861
##   1e+01  2253.776  0.8218066  1465.532
##   1e+02  2253.326  0.8213564  1466.097
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 0.1.

svmL.model$results

# Testing predictive ability of model in testing set
svmL.predict <- predict(svmL.model, testing.set) 

postResample(svmL.predict, testing.set$income)

##         RMSE     Rsquared          MAE 
## 2073.5726601    0.6624213 1260.9135906

Third, let’s try a Random Forest modeling algorithm:

# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10) # k-folds CV with k=10
train.control2 <- trainControl(method = "LOOCV") # leave-one-out CV


# Model tuning
rf.info <- getModelInfo("rf")
rf.info$rf$parameters

tune.grid <- expand.grid(mtry = c(2, 3, 4))


# Model training
rf.model <- train(income ~. , 
                  data = training.set, 
                  method = "rf", 
                  trControl = train.control, 
                  tuneGrid = tune.grid,
                  preProc = c("center"))


# Look at the results from model training
rf.model

## Random Forest 
## 
## 74 samples
##  4 predictor
## 
## Pre-processing: centered (4) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 66, 68, 66, 67, 67, 66, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE      Rsquared   MAE     
##   2     2673.107  0.7529511  1766.909
##   3     2645.555  0.7738396  1711.195
##   4     2607.609  0.7743115  1674.975
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.

rf.model$results

# Testing predictive ability of model in testing set
rf.predict <- predict(rf.model, testing.set)

postResample(rf.predict, testing.set$income)

##        RMSE    Rsquared         MAE 
## 2211.812379    0.653975 1252.496006

Regression Model Comparisons: Which model did the best?

RMSE_Training <- c(linear.model$results[1,2], svmL.model$results[1,2], rf.model$results[1,2])

Rsq_Training <- c(linear.model$results[1,3], svmL.model$results[1,3], rf.model$results[1,3])

RMSE_Testing <- c(postResample(linear.predict, testing.set$income)[1], postResample(svmL.predict, testing.set$income)[1], postResample(rf.predict, testing.set$income)[1])

Rsq_Testing <- c(postResample(linear.predict, testing.set$income)[2], postResample(svmL.predict, testing.set$income)[2], postResample(rf.predict, testing.set$income)[2])

model_names <- c("Linear Regression", "Support Vector Machine", "Random Forest")

difference_rmse <- as.numeric(RMSE_Testing) - as.numeric(RMSE_Training)
difference_rsq <- as.numeric(Rsq_Testing) - as.numeric(Rsq_Training)

data.frame(cbind(model_names, RMSE_Training, Rsq_Training, RMSE_Testing, Rsq_Testing, difference_rmse, difference_rsq))

## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,3 --> row.names NOT used

A Classification Example

Classification is used to perform machine learning with a categorical outcome variable. The dataset we will use for this example is called Iris from base R. This dataset contains various features for three species of flowers (petal width, petal length, sepal width, sepal length).

As with our regression example above, we will try to clasify ‘Species’ of flower using an elastic net, a support vector machine, and a random forest model. We will use leave-one-out cross-validation in the training data and the final model accuracies will be tested against our holdout sample.

First we will load our dataset:

# Load the data, and examine it's structure
iris_data <- data.frame(iris)
head(iris_data)

str(iris_data) # Everything is numeric except for our categorical variable, which is a factor. We are good to go.

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Data Splitting

Now we will set the random number generator seed for reproducibility and partition our data into training and testing sets.

# Randomly sample from the original dataset
set.seed(28) # set.seed is a random number generator; the value in parentheses is arbitrary, and a seed is only set so that we can reproduce these same results next time we run the analysis.

# Split the original dataset into a training set and a testing set
# We are using species to partition so that we don't end up with an uneven amount of one species in either training or testing sets.
partition_data <- createDataPartition(iris_data$Species, times = 1, p = .7, list = FALSE)

# Assign sets
training.set <- iris_data[partition_data, ] # Training set
testing.set <- iris_data[-partition_data, ] # Testing set

# Sanity Check: Is data partitioned appropriately, do we have equal numbers of observations for our outcome variable?
nrow(training.set)

## [1] 105

summary(training.set$Species)

##     setosa versicolor  virginica 
##         35         35         35

nrow(testing.set)

## [1] 45

summary(testing.set$Species)

##     setosa versicolor  virginica 
##         15         15         15

Model Training in Caret

Next, we will try out our three models.

Since the outcome variable we are predicting “Species” has more than two factors, we cannot use glm to run a logistic regression.

Instead, we will use glmnet to run an elastic net algorithm that can handle more factors in our outcome variable.

First, let’s try an Elastic Net:

# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10, # k-folds CV with k=10
                              classProbs = TRUE,
                              savePredictions = TRUE,
                              summaryFunction = multiClassSummary)# save predictions for ROC

train.control2 <- trainControl(method = "LOOCV",
                               classProbs = TRUE,
                               savePredictions = TRUE,
                               summaryFunction = multiClassSummary) # leave-one-out CV, and save predictions for ROC 

# Example Model Tuning for Elastic Net
#glmnet.info <- getModelInfo("glmnet")
#glmnet.info$glmnet$parameters
#tune.grid <- expand.grid(alpha = 0:1,
                        # lambda = seq(0.0001, 1, length = 100))

# Use the train function to perform model training
glmnet.model <- train(Species ~. , 
                      data = training.set, 
                      method = "glmnet",
                      trControl = train.control2, # change this to train.control to try k-fold CV
                      #tuneGrid = tune.grid,
                      preProc = c("center")) 

# Look at the results from model training and ROC Curves
glmnet.model

## glmnet 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## Pre-processing: centered (4) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 104, 104, 104, 104, 104, 104, ... 
## Resampling results across tuning parameters:
## 
##   alpha  lambda       logLoss     AUC        prAUC      Accuracy 
##   0.10   0.000875267  0.08456169  0.9975510  0.9668227  0.9619048
##   0.10   0.008752670  0.16367642  0.9965986  0.9649725  0.9523810
##   0.10   0.087526699  0.38437169  0.9753741  0.9252943  0.9142857
##   0.55   0.000875267  0.07704621  0.9976871  0.9670182  0.9619048
##   0.55   0.008752670  0.14907364  0.9964626  0.9646381  0.9523810
##   0.55   0.087526699  0.41343962  0.9791837  0.9312623  0.9142857
##   1.00   0.000875267  0.07351674  0.9974150  0.9664971  0.9714286
##   1.00   0.008752670  0.12642274  0.9961905  0.9641325  0.9523810
##   1.00   0.087526699  0.42860156  0.9805442  0.9202896  0.9333333
##   Kappa      Mean_F1    Mean_Sensitivity  Mean_Specificity
##   0.9428571  0.9618736  0.9619048         0.9809524       
##   0.9285714  0.9523712  0.9523810         0.9761905       
##   0.8714286  0.9141280  0.9142857         0.9571429       
##   0.9428571  0.9618736  0.9619048         0.9809524       
##   0.9285714  0.9523712  0.9523810         0.9761905       
##   0.8714286  0.9141280  0.9142857         0.9571429       
##   0.9571429  0.9714227  0.9714286         0.9857143       
##   0.9285714  0.9523712  0.9523810         0.9761905       
##   0.9000000  0.9333197  0.9333333         0.9666667       
##   Mean_Pos_Pred_Value  Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall
##   0.9628720            0.9812092            0.9628720       0.9619048  
##   0.9526144            0.9762537            0.9526144       0.9523810  
##   0.9161184            0.9576774            0.9161184       0.9142857  
##   0.9628720            0.9812092            0.9628720       0.9619048  
##   0.9526144            0.9762537            0.9526144       0.9523810  
##   0.9161184            0.9576774            0.9161184       0.9142857  
##   0.9716776            0.9857794            0.9716776       0.9714286  
##   0.9526144            0.9762537            0.9526144       0.9523810  
##   0.9335512            0.9667279            0.9335512       0.9333333  
##   Mean_Detection_Rate  Mean_Balanced_Accuracy
##   0.3206349            0.9714286             
##   0.3174603            0.9642857             
##   0.3047619            0.9357143             
##   0.3206349            0.9714286             
##   0.3174603            0.9642857             
##   0.3047619            0.9357143             
##   0.3238095            0.9785714             
##   0.3174603            0.9642857             
##   0.3111111            0.9500000             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda
##  = 0.000875267.

# Test the predictive ability of the model in the testing set
glmnet.predict <- predict(glmnet.model, testing.set) # Predict values in the testing set
postResample(glmnet.predict, testing.set$Species) # the accuracy of the model

##  Accuracy     Kappa 
## 0.9777778 0.9666667

confusionMatrix(glmnet.predict, testing.set$Species) # Lets see the breakdown of how well our model worked

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         0
##   virginica       0          1        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9667          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           1.0000
## Specificity                 1.0000            1.0000           0.9667
## Pos Pred Value              1.0000            1.0000           0.9375
## Neg Pred Value              1.0000            0.9677           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.3333
## Detection Prevalence        0.3333            0.3111           0.3556
## Balanced Accuracy           1.0000            0.9667           0.9833

Second, let’s try a Support Vector Machine (SVM):

# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10,
                              classProbs = TRUE,
                              savePredictions = TRUE,
                              summaryFunction = multiClassSummary) # k-folds CV with k=10

train.control2 <- trainControl(method = "LOOCV",
                               classProbs = TRUE,
                              savePredictions = TRUE,
                              summaryFunction = multiClassSummary) # leave-one-out CV

## Model Tuning
# When we run the model w/out tuning we see that the C parameter for svmLinear is held constant at 1.
# I decided to try to tune my model by testing for a range of numbers around 1.
svmL.info <- getModelInfo("svmLinear") #getModelInfo can be used to inspect a specific ML algorithm
svmL.info$svmLinear$parameters #look at the algorithm parameters that can be modifed

tune.grid <- expand.grid(C = c(0.05, 0.1, .5, 1, 1.5))  

# Use the train function to perform model training
svm.model <- train(Species ~ ., 
                  data = training.set,
                  method = 'svmLinear',
                  trControl = train.control2, # change this to train.control to try k-fold CV
                  tuneGrid = tune.grid, # inputs your hyperparameters designated above
                  preProc = c("center")) 

# Look at the results from model training
svm.model

## Support Vector Machines with Linear Kernel 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## Pre-processing: centered (4) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 104, 104, 104, 104, 104, 104, ... 
## Resampling results across tuning parameters:
## 
##   C     logLoss    AUC        prAUC      Accuracy   Kappa      Mean_F1  
##   0.05  0.1965477  0.9882993  0.9480017  0.9428571  0.9142857  0.9428571
##   0.10  0.1503756  0.9948299  0.9615435  0.9523810  0.9285714  0.9523712
##   0.50  0.1168253  0.9986395  0.9687804  0.9619048  0.9428571  0.9618736
##   1.00  0.1254751  0.9961905  0.9644326  0.9523810  0.9285714  0.9523712
##   1.50  0.1245281  0.9951020  0.9623061  0.9428571  0.9142857  0.9428105
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
##   0.9428571         0.9714286         0.9428571          
##   0.9523810         0.9761905         0.9526144          
##   0.9619048         0.9809524         0.9628720          
##   0.9523810         0.9761905         0.9526144          
##   0.9428571         0.9714286         0.9437619          
##   Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
##   0.9714286            0.9428571       0.9428571    0.3142857          
##   0.9762537            0.9526144       0.9523810    0.3174603          
##   0.9812092            0.9628720       0.9619048    0.3206349          
##   0.9762537            0.9526144       0.9523810    0.3174603          
##   0.9716776            0.9437619       0.9428571    0.3142857          
##   Mean_Balanced_Accuracy
##   0.9571429             
##   0.9642857             
##   0.9714286             
##   0.9642857             
##   0.9571429             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.5.

# Test the predictive ability of the model in the testing set
svm.predict <- predict(svm.model, testing.set) # Predict values in the testing set
postResample(svm.predict, testing.set$Species) # the accuracy of the model

##  Accuracy     Kappa 
## 0.9777778 0.9666667

confusionMatrix(svm.predict, testing.set$Species) # Let's see the breakdown of how well our model worked

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         0
##   virginica       0          1        15
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9778          
##                  95% CI : (0.8823, 0.9994)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9667          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           1.0000
## Specificity                 1.0000            1.0000           0.9667
## Pos Pred Value              1.0000            1.0000           0.9375
## Neg Pred Value              1.0000            0.9677           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.3333
## Detection Prevalence        0.3333            0.3111           0.3556
## Balanced Accuracy           1.0000            0.9667           0.9833

Third, let’s try a Random Forest Model:

# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10, # k-folds CV with k=10
                              classProbs = TRUE,
                              savePredictions = TRUE,
                              summaryFunction = multiClassSummary) 

train.control2 <- trainControl(method = "LOOCV", # leave-one-out CV
                               classProbs = TRUE,
                               savePredictions = TRUE,
                                summaryFunction = multiClassSummary) 

# Model tuning
rf.info <- getModelInfo("rf")
rf.info$rf$parameters

tune.grid <- expand.grid(mtry = c(1, 2, 3, 4)) # number of features to use for decision

# Train the Model
rf.model <- train(Species ~ ., 
                  data = training.set,
                  method = 'rf',
                  trControl = train.control2, # change this to train.control to try k-fold CV
                  tuneGrid = tune.grid,
                  preProc = c("center")) 

# Look at the results from model training
rf.model

## Random Forest 
## 
## 105 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## Pre-processing: centered (4) 
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 104, 104, 104, 104, 104, 104, ... 
## Resampling results across tuning parameters:
## 
##   mtry  logLoss    AUC        prAUC      Accuracy   Kappa      Mean_F1  
##   1     0.1404589  0.9929932  0.8726380  0.9428571  0.9142857  0.9428571
##   2     0.1179043  0.9948299  0.6091865  0.9619048  0.9428571  0.9619048
##   3     0.1174733  0.9942857  0.4176810  0.9523810  0.9285714  0.9523712
##   4     0.1309199  0.9927891  0.3386815  0.9523810  0.9285714  0.9523712
##   Mean_Sensitivity  Mean_Specificity  Mean_Pos_Pred_Value
##   0.9428571         0.9714286         0.9428571          
##   0.9619048         0.9809524         0.9619048          
##   0.9523810         0.9761905         0.9526144          
##   0.9523810         0.9761905         0.9526144          
##   Mean_Neg_Pred_Value  Mean_Precision  Mean_Recall  Mean_Detection_Rate
##   0.9714286            0.9428571       0.9428571    0.3142857          
##   0.9809524            0.9619048       0.9619048    0.3206349          
##   0.9762537            0.9526144       0.9523810    0.3174603          
##   0.9762537            0.9526144       0.9523810    0.3174603          
##   Mean_Balanced_Accuracy
##   0.9571429             
##   0.9714286             
##   0.9642857             
##   0.9642857             
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

# Test the predictive ability of the model in the testing set
rf.predict <- predict(rf.model, testing.set) # Predict values in the testing set
postResample(rf.predict, testing.set$Species) # the accuracy of the model

##  Accuracy     Kappa 
## 0.9555556 0.9333333

confusionMatrix(rf.predict, testing.set$Species) # Let's see the breakdown of how well our model worked

## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         15          0         0
##   versicolor      0         14         1
##   virginica       0          1        14
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9556          
##                  95% CI : (0.8485, 0.9946)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9333          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9333           0.9333
## Specificity                 1.0000            0.9667           0.9667
## Pos Pred Value              1.0000            0.9333           0.9333
## Neg Pred Value              1.0000            0.9667           0.9667
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3111           0.3111
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            0.9500           0.9500

Classification Model Comparisons: Which model did the best?

# Gather Accuracies
accuracy_train <- c(glmnet.model$results[7,6], # Final model used was #7, accuracy is column #6
                    svm.model$results[1,5], # final model used was #1, accuracy is column #5
                    rf.model$results[2,5]) # final model used was #2, accuracy is column #5

accuracy_test <- c(confusionMatrix(glmnet.predict, testing.set$Species)$overall[1],
                   confusionMatrix(svm.predict, testing.set$Species)$overall[1],
                   confusionMatrix(rf.predict, testing.set$Species)$overall[1])


model_names <- c("Elastic Net", "SVM", "Random Forest")

difference <- as.numeric(accuracy_test) - as.numeric(accuracy_train)
data.frame(cbind(model_names, accuracy_train, accuracy_test, difference))

Imputation

Caret also has the ability to impute missing data values using the preProcess() function. Many machine learning algorithms require complete data and therefore it may be necessary to impute missing data where necessary.

Rather than just taking the mean or median of your variable with missing data, and imputing those numbers, Caret can use more sophisticated modeling with all of your variables to more closely estimate what the missing values may have been based on patterns in your data.

We will use the iris dataset as a quick example of one method you can use to impute missing data.

set.seed(12345)
# We will load the iris dataset, but without the species variable
iris_missing <- data.frame(iris[1:4]) %>% 
  prodNA(noNA = 0.1) # We will use the prodNA() function from package 'missForest' to produce 10% missing at random data

head(iris_missing)

# Next we set our seed for reproducibility and create a model using preProcess() and our chosen imputation, in this case bagged trees because we have multiple columns with missing data. If we had just one column with missing data we could use k nearest neighbors (knnImpute) which is faster. 
iris_missing_model = preProcess(iris_missing, "bagImpute")

# Lastly, we need to use predict() to actually predict the missing values using the model we just created
iris_missing_pred = predict(iris_missing_model, iris_missing)
head(iris_missing_pred)

# We can compare with the original iris dataset
head(iris[1:4])

Important Note: if you are imputing and some of your data is set up as factors, you will need to create dummy variables before imputing using dummyVars. This is because the preProcess() function in caret assumes that all your data is numeric.

You then use preProcess() with your dummy coded dataset to figure out the imputation for your continuous variable. Once you’ve imputed your continuous variable in the dummy dataset, you can then put that imputed variable back into the original dataset without dummy coding.

See Caret- Preprocessing for documentation.

Resources

Kuhn, Caret Documentation
Burger, Introduction to Machine Learning with R *You can get a free 10-day trial to read this book through safari books online
Lantz, Machine Learning with R, Second Edition
UC Irvine, Machine Learning Data Repository

Machine Learning Basics

Stefania Ashby & Sara Lieber

5/17/2018

Introduction

Supervised vs. Unsupervised Learning

Steps of Machine Learning & Considerations

Data Splitting/ Partitioning

Tuning

Cross-Validation

Overfitting

Terminology Break: Algorithms vs Models

Common Machine Learning Algorithms

A Regression Example

Data Splitting

Model Training in Caret

First, let’s try a Linear Regression algorithm:

Second, let’s try a Support Vector Machine Linear modeling algorithm:

Third, let’s try a Random Forest modeling algorithm:

Regression Model Comparisons: Which model did the best?

A Classification Example

Data Splitting

Model Training in Caret

First, let’s try an Elastic Net:

Second, let’s try a Support Vector Machine (SVM):

Third, let’s try a Random Forest Model:

Classification Model Comparisons: Which model did the best?

Imputation

Resources