Machine learning is the process of building predictive models. At the core of machine learning is building models which offer predictive power and can be used to understand data we have yet to collect. In scientific practice, we have all used machine learning before when we run regression models! However, machine learning is a complex topic with a wide range of possiblities and applications.
We encounter applications for machine learning each day. Machine learning algorithms are used to make critical decisions in medical diagnosis. Media sites rely on machine learning to sift through millions of options to give you song or movie recommendations and retailers use it to gain insight into their customers’ purchasing behavior.
This presentation aims to present a basic understanding of both regression and classification modeling, as well as how to leverage the package caret
to carryout these analyses.
There are two main types of machine learning algorithms: supervised and unsupervised
Supervised learning models are those where the machine learning model we build is based off of a known quantity. In this case, we already know the “correct answers” and we train the algorithm to find patterns in our data in order to reach the best performance in predicting a known quanity. We can then apply these models to never-before-seen data to make predictions. Examples include:
Classification Models: Making categorical predictions such as predicting whether a tumor is benign or malignant or whether an email is spam or not.
Regression Models: Making continuous predictions such as predicting changes in temperature or predicting someone’s weight.
Unsupervised learning models are those where the machine learning model derives patterns and information from data while determining the known quantity itself. In this case, there are no known “correct answers” but rather the goal is to find the underlying structure or distribution in the data in order to learn more about the data. Examples include:
Clustering Models: Finding hidden patterns of inherent groupings in data such as grouping customers by purchasing behavior.
Association Models: Findings rules that describe large portions of your data, such as “people that buy X also tend to buy Y”.
Machine learning requires that you must build your model using a separate dataset than the one used to test the accuracy of your model. Since datasets can be hard to come by, data splitting is used to split a single dataset into a training set and a testing set. You typically want more data in the training set since this data is being used to build your model than in the testing set. Some proportions that are used typically are 70% training, 30% testing, but you can adjust these.
In machine learning, there are two types of parameters: model parameters and hyperparameters. What’s the difference?
Model parameters are estimated from the data (i.e., coefficients in linear regression)
Hyperparameters are values that specify the settings of a ML algorithm that can be “tuned” by the researhcer prior to training a model. Different algorithms have different hyperparameters. You don’t know the best values of hyperparameters prior to training a model - have to rely on rules of thumb and/or try to find the best values through trial and error. This is the tuning process.
Earlier we talked about splitting our our dataset into training and testing sets in order to build a model that generalizess well to new incoming data. However, this process of training and testing is less than ideal. When we eventually test our data against the remaining data in our test set we are only seeing what the error is for that exact grouping of the test data. Using a single testing set to evaluate the accuracy of one’s models can have limitations in practice because the testing set may not be representative of the dataset as a whole.
Cross-validation is a statistical technique for splitting the training set multiple times into training/testing sets. Each of these training/testing sets is evaluated for error, and the error across all of the sets is averaged. This provides a more accurate assessment of the model’s performance.
There are various cross-validation techniques available in R, but the ones we will cover are k-fold and leave-one-one cross-validation:
k-fold cross-validation: randomly splits the dataset into k chunks (aka, folds) of roughly equal size, and these chunks are split into training/testing sets. The error across all chunks is averaged. k can be any number between 2 and the number of observations in the full dataset, but it is most commonly a value between 3 and 10.
leave-one-out cross-validation: the case where k=n; this results in the training sets containing n-1 observations each, and each test set containing a single observation
A very important consideration in the building of predictive models is overfitting. The data in the training set is going to reflect true, underlying relationships among variables, but there will also be an amount of error that is unique to the training set. Overfitting is the problem of fitting a model too well to the training set. This could result in the model having poor predictive power when applied to a new dataset. During model training, you want to meet a balance between fitting a model with good accuracy (i.e., one that reduces error), without trying to account for so much error in the training model that you overfit the model.
Algorithms: a set of steps that are passed into a model for processing
Models: a complex object that takes an input parameter and gives an output
Features: a variable in machine learning lingo
Linear regression, lm()
Logistic regression, glm()
Support vector machines, svm() or svmLinear()
Random forests, randomForest()
Elastic Nets, glmnet()
And there are hundreds more… for a full list: names(getModelInfo())
Regression is used to perform machine learning with a continuous outcome variable. The dataset we will use for this example is called Prestige from the car
package. This dataset contains various features across a variety of occupations, such as education, percentage of incumbents who are women, and the perceived prestige of the occupation.
head(Prestige)
str(Prestige)
## 'data.frame': 102 obs. of 6 variables:
## $ education: num 13.1 12.3 12.8 11.4 14.6 ...
## $ income : int 12351 25879 9271 8865 8403 11030 8258 14163 11377 11023 ...
## $ women : num 11.16 4.02 15.7 9.11 11.68 ...
## $ prestige : num 68.8 69.1 63.4 56.8 73.5 77.6 72.6 78.1 73.1 68.8 ...
## $ census : int 1113 1130 1171 1175 2111 2113 2133 2141 2143 2153 ...
## $ type : Factor w/ 3 levels "bc","prof","wc": 2 2 2 2 2 2 2 2 2 2 ...
Prestige$income <- as.numeric(Prestige$income)
Prestige$census <- as.numeric(Prestige$census)
Prestige$type <- as.numeric(Prestige$type)
Prestige <- subset(Prestige, select = c(education, income, women, prestige, census))
Before you perform model training, you should partition the original dataset into a training set and a testing set. Model training is performed only on the training set.
createDataPartition
is used to split the original data into a training set and a testing set. The inputs into createDataPartition include y, times, p, and list:
y = the outcome variable
times = the number of times you want to split the data
p = the percentage of the data that goes into the training set
list = FALSE gives the results in a matrix with the row numbers of the partition that you can pass back into the training data to effectively split it
# Randomly sample from the original dataset
set.seed(50) # set.seed is a random number generator; the value in parentheses is arbitrary, and a seed is only set so that we can reproduce these same results next time we run the analysis.
# Split the original dataset into a training set and a testing set
partition_data <- createDataPartition(Prestige$income, times = 1, p = .7, list = FALSE)
training.set <- Prestige[partition_data, ] # Training set
testing.set <- Prestige[-partition_data, ] # Testing set
Now that we have split the data into a training set and testing set, we can now move on to training a model using the training set. We will go through just a few of the different algorithms and cross-validation techniques you can use for model training.
caret
(Classification And REgression Training) is an R package that consolidates all of the many various machine learning algorithms into one, easy-to-use interface. This allows us to test any model we want without having to load separate packages and learn a gazillion different syntax requirements each time we want to test a different type of model.
The train
function is used for model training. It uses the following inputs:
# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10) # k-folds CV with k=10
train.control2 <- trainControl(method = "LOOCV") # leave-one-out CV
# Use the train function to perform model training
linear.model <- train(income ~. ,
data = training.set,
method = "lm",
trControl = train.control,
preProc = c("center"))
## change train.control to train.control2 to see the results using LOOCV instead
# Look at the results from model training
linear.model
## Linear Regression
##
## 74 samples
## 4 predictor
##
## Pre-processing: centered (4)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 67, 66, 66, 67, 67, 66, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 2385.264 0.7565312 1644.348
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
linear.model$results
# Test the predictive ability of the model in the testing set
linear.predict <- predict(linear.model, testing.set) # Predict values in the testing set
postResample(linear.predict, testing.set$income) # the accuracy of the model
## RMSE Rsquared MAE
## 2603.8964002 0.6503883 1856.9520734
# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10) # k-folds CV with k=10
train.control2 <- trainControl(method = "LOOCV") # leave-one-out CV
# The linear model did not have any hyperparameters that we could perform tuning on, but SVM does
# Model tuning
svmL.info <- getModelInfo("svmLinear") #getModelInfo can be used to inspect a specific ML algorithm
svmL.info$svmLinear$parameters #look at the algorithm parameters that can be modifed
tune.grid <- expand.grid(C = c(0.001, 0.01, 0.1, 1, 10, 100)) #expand.grid is the function that allows you to specify values that you want to feed into the model training function
# Model training
svmL.model <- train(income ~. ,
data = training.set,
method = "svmLinear",
trControl = train.control,
tuneGrid = tune.grid,
preProc = c("center"))
# Look at the results from model training
svmL.model
## Support Vector Machines with Linear Kernel
##
## 74 samples
## 4 predictor
##
## Pre-processing: centered (4)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 66, 67, 67, 66, 66, 67, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 1e-03 3905.451 0.7661535 2690.535
## 1e-02 2585.026 0.8256985 1664.158
## 1e-01 2230.840 0.8290528 1422.342
## 1e+00 2249.864 0.8231544 1458.861
## 1e+01 2253.776 0.8218066 1465.532
## 1e+02 2253.326 0.8213564 1466.097
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was C = 0.1.
svmL.model$results
# Testing predictive ability of model in testing set
svmL.predict <- predict(svmL.model, testing.set)
postResample(svmL.predict, testing.set$income)
## RMSE Rsquared MAE
## 2073.5726601 0.6624213 1260.9135906
# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10) # k-folds CV with k=10
train.control2 <- trainControl(method = "LOOCV") # leave-one-out CV
# Model tuning
rf.info <- getModelInfo("rf")
rf.info$rf$parameters
tune.grid <- expand.grid(mtry = c(2, 3, 4))
# Model training
rf.model <- train(income ~. ,
data = training.set,
method = "rf",
trControl = train.control,
tuneGrid = tune.grid,
preProc = c("center"))
# Look at the results from model training
rf.model
## Random Forest
##
## 74 samples
## 4 predictor
##
## Pre-processing: centered (4)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 66, 68, 66, 67, 67, 66, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared MAE
## 2 2673.107 0.7529511 1766.909
## 3 2645.555 0.7738396 1711.195
## 4 2607.609 0.7743115 1674.975
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 4.
rf.model$results
# Testing predictive ability of model in testing set
rf.predict <- predict(rf.model, testing.set)
postResample(rf.predict, testing.set$income)
## RMSE Rsquared MAE
## 2211.812379 0.653975 1252.496006
RMSE_Training <- c(linear.model$results[1,2], svmL.model$results[1,2], rf.model$results[1,2])
Rsq_Training <- c(linear.model$results[1,3], svmL.model$results[1,3], rf.model$results[1,3])
RMSE_Testing <- c(postResample(linear.predict, testing.set$income)[1], postResample(svmL.predict, testing.set$income)[1], postResample(rf.predict, testing.set$income)[1])
Rsq_Testing <- c(postResample(linear.predict, testing.set$income)[2], postResample(svmL.predict, testing.set$income)[2], postResample(rf.predict, testing.set$income)[2])
model_names <- c("Linear Regression", "Support Vector Machine", "Random Forest")
difference_rmse <- as.numeric(RMSE_Testing) - as.numeric(RMSE_Training)
difference_rsq <- as.numeric(Rsq_Testing) - as.numeric(Rsq_Training)
data.frame(cbind(model_names, RMSE_Training, Rsq_Training, RMSE_Testing, Rsq_Testing, difference_rmse, difference_rsq))
## Warning in data.row.names(row.names, rowsi, i): some row.names duplicated:
## 2,3 --> row.names NOT used
Classification is used to perform machine learning with a categorical outcome variable. The dataset we will use for this example is called Iris from base R. This dataset contains various features for three species of flowers (petal width, petal length, sepal width, sepal length).
As with our regression example above, we will try to clasify ‘Species’ of flower using an elastic net, a support vector machine, and a random forest model. We will use leave-one-out cross-validation in the training data and the final model accuracies will be tested against our holdout sample.
First we will load our dataset:
# Load the data, and examine it's structure
iris_data <- data.frame(iris)
head(iris_data)
str(iris_data) # Everything is numeric except for our categorical variable, which is a factor. We are good to go.
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Now we will set the random number generator seed for reproducibility and partition our data into training and testing sets.
# Randomly sample from the original dataset
set.seed(28) # set.seed is a random number generator; the value in parentheses is arbitrary, and a seed is only set so that we can reproduce these same results next time we run the analysis.
# Split the original dataset into a training set and a testing set
# We are using species to partition so that we don't end up with an uneven amount of one species in either training or testing sets.
partition_data <- createDataPartition(iris_data$Species, times = 1, p = .7, list = FALSE)
# Assign sets
training.set <- iris_data[partition_data, ] # Training set
testing.set <- iris_data[-partition_data, ] # Testing set
# Sanity Check: Is data partitioned appropriately, do we have equal numbers of observations for our outcome variable?
nrow(training.set)
## [1] 105
summary(training.set$Species)
## setosa versicolor virginica
## 35 35 35
nrow(testing.set)
## [1] 45
summary(testing.set$Species)
## setosa versicolor virginica
## 15 15 15
Next, we will try out our three models.
Since the outcome variable we are predicting “Species” has more than two factors, we cannot use glm
to run a logistic regression.
Instead, we will use glmnet
to run an elastic net algorithm that can handle more factors in our outcome variable.
# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10, # k-folds CV with k=10
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = multiClassSummary)# save predictions for ROC
train.control2 <- trainControl(method = "LOOCV",
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = multiClassSummary) # leave-one-out CV, and save predictions for ROC
# Example Model Tuning for Elastic Net
#glmnet.info <- getModelInfo("glmnet")
#glmnet.info$glmnet$parameters
#tune.grid <- expand.grid(alpha = 0:1,
# lambda = seq(0.0001, 1, length = 100))
# Use the train function to perform model training
glmnet.model <- train(Species ~. ,
data = training.set,
method = "glmnet",
trControl = train.control2, # change this to train.control to try k-fold CV
#tuneGrid = tune.grid,
preProc = c("center"))
# Look at the results from model training and ROC Curves
glmnet.model
## glmnet
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## Pre-processing: centered (4)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 104, 104, 104, 104, 104, 104, ...
## Resampling results across tuning parameters:
##
## alpha lambda logLoss AUC prAUC Accuracy
## 0.10 0.000875267 0.08456169 0.9975510 0.9668227 0.9619048
## 0.10 0.008752670 0.16367642 0.9965986 0.9649725 0.9523810
## 0.10 0.087526699 0.38437169 0.9753741 0.9252943 0.9142857
## 0.55 0.000875267 0.07704621 0.9976871 0.9670182 0.9619048
## 0.55 0.008752670 0.14907364 0.9964626 0.9646381 0.9523810
## 0.55 0.087526699 0.41343962 0.9791837 0.9312623 0.9142857
## 1.00 0.000875267 0.07351674 0.9974150 0.9664971 0.9714286
## 1.00 0.008752670 0.12642274 0.9961905 0.9641325 0.9523810
## 1.00 0.087526699 0.42860156 0.9805442 0.9202896 0.9333333
## Kappa Mean_F1 Mean_Sensitivity Mean_Specificity
## 0.9428571 0.9618736 0.9619048 0.9809524
## 0.9285714 0.9523712 0.9523810 0.9761905
## 0.8714286 0.9141280 0.9142857 0.9571429
## 0.9428571 0.9618736 0.9619048 0.9809524
## 0.9285714 0.9523712 0.9523810 0.9761905
## 0.8714286 0.9141280 0.9142857 0.9571429
## 0.9571429 0.9714227 0.9714286 0.9857143
## 0.9285714 0.9523712 0.9523810 0.9761905
## 0.9000000 0.9333197 0.9333333 0.9666667
## Mean_Pos_Pred_Value Mean_Neg_Pred_Value Mean_Precision Mean_Recall
## 0.9628720 0.9812092 0.9628720 0.9619048
## 0.9526144 0.9762537 0.9526144 0.9523810
## 0.9161184 0.9576774 0.9161184 0.9142857
## 0.9628720 0.9812092 0.9628720 0.9619048
## 0.9526144 0.9762537 0.9526144 0.9523810
## 0.9161184 0.9576774 0.9161184 0.9142857
## 0.9716776 0.9857794 0.9716776 0.9714286
## 0.9526144 0.9762537 0.9526144 0.9523810
## 0.9335512 0.9667279 0.9335512 0.9333333
## Mean_Detection_Rate Mean_Balanced_Accuracy
## 0.3206349 0.9714286
## 0.3174603 0.9642857
## 0.3047619 0.9357143
## 0.3206349 0.9714286
## 0.3174603 0.9642857
## 0.3047619 0.9357143
## 0.3238095 0.9785714
## 0.3174603 0.9642857
## 0.3111111 0.9500000
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were alpha = 1 and lambda
## = 0.000875267.
# Test the predictive ability of the model in the testing set
glmnet.predict <- predict(glmnet.model, testing.set) # Predict values in the testing set
postResample(glmnet.predict, testing.set$Species) # the accuracy of the model
## Accuracy Kappa
## 0.9777778 0.9666667
confusionMatrix(glmnet.predict, testing.set$Species) # Lets see the breakdown of how well our model worked
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 0
## virginica 0 1 15
##
## Overall Statistics
##
## Accuracy : 0.9778
## 95% CI : (0.8823, 0.9994)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9667
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 1.0000
## Specificity 1.0000 1.0000 0.9667
## Pos Pred Value 1.0000 1.0000 0.9375
## Neg Pred Value 1.0000 0.9677 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.3333
## Detection Prevalence 0.3333 0.3111 0.3556
## Balanced Accuracy 1.0000 0.9667 0.9833
# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = multiClassSummary) # k-folds CV with k=10
train.control2 <- trainControl(method = "LOOCV",
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = multiClassSummary) # leave-one-out CV
## Model Tuning
# When we run the model w/out tuning we see that the C parameter for svmLinear is held constant at 1.
# I decided to try to tune my model by testing for a range of numbers around 1.
svmL.info <- getModelInfo("svmLinear") #getModelInfo can be used to inspect a specific ML algorithm
svmL.info$svmLinear$parameters #look at the algorithm parameters that can be modifed
tune.grid <- expand.grid(C = c(0.05, 0.1, .5, 1, 1.5))
# Use the train function to perform model training
svm.model <- train(Species ~ .,
data = training.set,
method = 'svmLinear',
trControl = train.control2, # change this to train.control to try k-fold CV
tuneGrid = tune.grid, # inputs your hyperparameters designated above
preProc = c("center"))
# Look at the results from model training
svm.model
## Support Vector Machines with Linear Kernel
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## Pre-processing: centered (4)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 104, 104, 104, 104, 104, 104, ...
## Resampling results across tuning parameters:
##
## C logLoss AUC prAUC Accuracy Kappa Mean_F1
## 0.05 0.1965477 0.9882993 0.9480017 0.9428571 0.9142857 0.9428571
## 0.10 0.1503756 0.9948299 0.9615435 0.9523810 0.9285714 0.9523712
## 0.50 0.1168253 0.9986395 0.9687804 0.9619048 0.9428571 0.9618736
## 1.00 0.1254751 0.9961905 0.9644326 0.9523810 0.9285714 0.9523712
## 1.50 0.1245281 0.9951020 0.9623061 0.9428571 0.9142857 0.9428105
## Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
## 0.9428571 0.9714286 0.9428571
## 0.9523810 0.9761905 0.9526144
## 0.9619048 0.9809524 0.9628720
## 0.9523810 0.9761905 0.9526144
## 0.9428571 0.9714286 0.9437619
## Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate
## 0.9714286 0.9428571 0.9428571 0.3142857
## 0.9762537 0.9526144 0.9523810 0.3174603
## 0.9812092 0.9628720 0.9619048 0.3206349
## 0.9762537 0.9526144 0.9523810 0.3174603
## 0.9716776 0.9437619 0.9428571 0.3142857
## Mean_Balanced_Accuracy
## 0.9571429
## 0.9642857
## 0.9714286
## 0.9642857
## 0.9571429
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was C = 0.5.
# Test the predictive ability of the model in the testing set
svm.predict <- predict(svm.model, testing.set) # Predict values in the testing set
postResample(svm.predict, testing.set$Species) # the accuracy of the model
## Accuracy Kappa
## 0.9777778 0.9666667
confusionMatrix(svm.predict, testing.set$Species) # Let's see the breakdown of how well our model worked
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 0
## virginica 0 1 15
##
## Overall Statistics
##
## Accuracy : 0.9778
## 95% CI : (0.8823, 0.9994)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9667
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 1.0000
## Specificity 1.0000 1.0000 0.9667
## Pos Pred Value 1.0000 1.0000 0.9375
## Neg Pred Value 1.0000 0.9677 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.3333
## Detection Prevalence 0.3333 0.3111 0.3556
## Balanced Accuracy 1.0000 0.9667 0.9833
# Specify the cross-validation method(s)
train.control <- trainControl(method = "cv", number = 10, # k-folds CV with k=10
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = multiClassSummary)
train.control2 <- trainControl(method = "LOOCV", # leave-one-out CV
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = multiClassSummary)
# Model tuning
rf.info <- getModelInfo("rf")
rf.info$rf$parameters
tune.grid <- expand.grid(mtry = c(1, 2, 3, 4)) # number of features to use for decision
# Train the Model
rf.model <- train(Species ~ .,
data = training.set,
method = 'rf',
trControl = train.control2, # change this to train.control to try k-fold CV
tuneGrid = tune.grid,
preProc = c("center"))
# Look at the results from model training
rf.model
## Random Forest
##
## 105 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## Pre-processing: centered (4)
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 104, 104, 104, 104, 104, 104, ...
## Resampling results across tuning parameters:
##
## mtry logLoss AUC prAUC Accuracy Kappa Mean_F1
## 1 0.1404589 0.9929932 0.8726380 0.9428571 0.9142857 0.9428571
## 2 0.1179043 0.9948299 0.6091865 0.9619048 0.9428571 0.9619048
## 3 0.1174733 0.9942857 0.4176810 0.9523810 0.9285714 0.9523712
## 4 0.1309199 0.9927891 0.3386815 0.9523810 0.9285714 0.9523712
## Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value
## 0.9428571 0.9714286 0.9428571
## 0.9619048 0.9809524 0.9619048
## 0.9523810 0.9761905 0.9526144
## 0.9523810 0.9761905 0.9526144
## Mean_Neg_Pred_Value Mean_Precision Mean_Recall Mean_Detection_Rate
## 0.9714286 0.9428571 0.9428571 0.3142857
## 0.9809524 0.9619048 0.9619048 0.3206349
## 0.9762537 0.9526144 0.9523810 0.3174603
## 0.9762537 0.9526144 0.9523810 0.3174603
## Mean_Balanced_Accuracy
## 0.9571429
## 0.9714286
## 0.9642857
## 0.9642857
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
# Test the predictive ability of the model in the testing set
rf.predict <- predict(rf.model, testing.set) # Predict values in the testing set
postResample(rf.predict, testing.set$Species) # the accuracy of the model
## Accuracy Kappa
## 0.9555556 0.9333333
confusionMatrix(rf.predict, testing.set$Species) # Let's see the breakdown of how well our model worked
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 15 0 0
## versicolor 0 14 1
## virginica 0 1 14
##
## Overall Statistics
##
## Accuracy : 0.9556
## 95% CI : (0.8485, 0.9946)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9333
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9333 0.9333
## Specificity 1.0000 0.9667 0.9667
## Pos Pred Value 1.0000 0.9333 0.9333
## Neg Pred Value 1.0000 0.9667 0.9667
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3111 0.3111
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 0.9500 0.9500
# Gather Accuracies
accuracy_train <- c(glmnet.model$results[7,6], # Final model used was #7, accuracy is column #6
svm.model$results[1,5], # final model used was #1, accuracy is column #5
rf.model$results[2,5]) # final model used was #2, accuracy is column #5
accuracy_test <- c(confusionMatrix(glmnet.predict, testing.set$Species)$overall[1],
confusionMatrix(svm.predict, testing.set$Species)$overall[1],
confusionMatrix(rf.predict, testing.set$Species)$overall[1])
model_names <- c("Elastic Net", "SVM", "Random Forest")
difference <- as.numeric(accuracy_test) - as.numeric(accuracy_train)
data.frame(cbind(model_names, accuracy_train, accuracy_test, difference))
Caret
also has the ability to impute missing data values using the preProcess()
function. Many machine learning algorithms require complete data and therefore it may be necessary to impute missing data where necessary.
Rather than just taking the mean or median of your variable with missing data, and imputing those numbers, Caret can use more sophisticated modeling with all of your variables to more closely estimate what the missing values may have been based on patterns in your data.
We will use the iris
dataset as a quick example of one method you can use to impute missing data.
set.seed(12345)
# We will load the iris dataset, but without the species variable
iris_missing <- data.frame(iris[1:4]) %>%
prodNA(noNA = 0.1) # We will use the prodNA() function from package 'missForest' to produce 10% missing at random data
head(iris_missing)
# Next we set our seed for reproducibility and create a model using preProcess() and our chosen imputation, in this case bagged trees because we have multiple columns with missing data. If we had just one column with missing data we could use k nearest neighbors (knnImpute) which is faster.
iris_missing_model = preProcess(iris_missing, "bagImpute")
# Lastly, we need to use predict() to actually predict the missing values using the model we just created
iris_missing_pred = predict(iris_missing_model, iris_missing)
head(iris_missing_pred)
# We can compare with the original iris dataset
head(iris[1:4])
Important Note: if you are imputing and some of your data is set up as factors, you will need to create dummy variables before imputing using dummyVars
. This is because the preProcess()
function in caret assumes that all your data is numeric.
You then use preProcess()
with your dummy coded dataset to figure out the imputation for your continuous variable. Once you’ve imputed your continuous variable in the dummy dataset, you can then put that imputed variable back into the original dataset without dummy coding.
See Caret- Preprocessing for documentation.
Burger, Introduction to Machine Learning with R *You can get a free 10-day trial to read this book through safari books online