Machine learning is the process of building predictive models. We encounter applications for machine learning each day. Machine learning algorithms are used to make critical decisions in medical diagnosis, give song or movie recommendations, or provide companies with insight into customers’ purchasing behavior.
Rob’s definition of machine learning:
Machine learning is a framework for using different classes of algorithms to train models and test their predictive accuracy from data.
Framework: a general workflow for data handling and algorithm selection
Train models: building mathematical equations that can be applied to other data
Predictive accuracy: assessing how well your models perform on new data
This tutorial will:
Present a basic understanding of machine learning.
Provide examples of how to use the caret package to carry out regression and classification modeling.
Provide one example of unsupervised learning using k-means clustering.
Supervised vs. Unsupervised Learning
There are two main types of machine learning algorithms: supervised and unsupervised
Supervised learning models are those where the machine learning model we build is based off of a known quantity. In this case, we already know the “correct answers” and we train the algorithm to find patterns in our data in order to reach the best performance in predicting a known quanity. We can then apply these models to never-before-seen data to make predictions. Examples include:
Classification Models: Making categorical predictions such as predicting whether a tumor is benign or malignant or whether an email is spam or not.
Regression Models: Making continuous predictions such as predicting changes in temperature or predicting someone’s weight.
Unsupervised learning models are those where the machine learning model derives patterns and information from data while determining the known quantity itself. In this case, there are no known “correct answers” but rather the goal is to find the underlying structure or distribution in the data in order to learn more about the data. Examples include:
Clustering Models: Finding hidden patterns of inherent groupings in data such as grouping customers by purchasing behavior.
Association Models: Findings rules that describe large portions of your data, such as “people that buy X also tend to buy Y”.
Common Machine Learning Algorithms
Algorithms are the set of steps that are passed into a model for processing. The best algorithm choice depends on your question and the type of data you are working with.
Overfitting
Overfitting is the problem of fitting a model too well to the training set.This could result in the model having poor predictive power when applied to a new dataset. During model training, you want to meet a balance between fitting a model with good accuracy (i.e., one that reduces error), without trying to account for so much error in the training model that you overfit the model.
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::lift() masks caret::lift()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.
Attaching package: 'randomForest'
The following object is masked from 'package:dplyr':
combine
The following object is masked from 'package:ggplot2':
margin
library(xgboost)library(glmnet)
Loading required package: Matrix
Attaching package: 'Matrix'
The following objects are masked from 'package:tidyr':
expand, pack, unpack
Loaded glmnet 4.1-10
library(kernlab)
Attaching package: 'kernlab'
The following object is masked from 'package:purrr':
cross
The following object is masked from 'package:ggplot2':
alpha
library(MLmetrics)
Attaching package: 'MLmetrics'
The following objects are masked from 'package:caret':
MAE, RMSE
The following object is masked from 'package:base':
Recall
library(nnet)library(ggridges)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Dataset
The dataset we will be using comes from kaggle. Kaggle is an online platform which houses several large downloadable datasets, hosts competitions for prediction problems, and provides other resources to learn about machine learning.
The dataset is collated from Spotify’s API using a python script to extract popular songs and their associated audio and descriptive features. Descriptive features of a song include information about the song such as the artist name, album name and release date. Audio features include key, valence, danceability, and energy which are results of spotify’s audio analysis.
Rows: 1686 Columns: 29
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (15): playlist_genre, track_artist, track_href, uri, track_album_name, p...
dbl (14): energy, tempo, danceability, loudness, liveness, valence, time_sig...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Energy, tempo, danceability, loudness, liveness, valence, time signature, speechiness, track popularity, instrumentalness, mode, key, duration, and acousticness are all numeric columns.
Playlist genre, track artist, album name, playlist name, album release date, and playlist subgenre are all character columns.
Most of these classifications make sense, but there are a few we should revise before we move on.
#change time signature to a factor variable spotify_data$time_signature <-as.factor(spotify_data$time_signature)#change playlist genre to a factor spotify_data$playlist_genre <-as.factor(spotify_data$playlist_genre)#change mode to a factor variable spotify_data$mode <-as.factor(spotify_data$mode)#change key to a factor variable spotify_data$key <-as.factor(spotify_data$key)#change track artist to just the first artist in listspotify_data$track_artist <-sub(",.*", "", spotify_data$track_artist)#change album release date to two columns: year & month spotify_data <- spotify_data %>%mutate(release_year =as.numeric(substr(track_album_release_date, 1, 4)),release_month =as.numeric(substr(track_album_release_date, 6, 7)) )spotify_data <- spotify_data %>%select(-track_album_release_date)#clean out NAsspotify_data <-na.omit(spotify_data)#make valid r names for variables levels(spotify_data$playlist_genre) <-make.names(levels(spotify_data$playlist_genre))spotify_data$playlist_genre <-factor(make.names(spotify_data$playlist_genre))
There are also some values I know I won’t use, so I’m going to go ahead & clean those out now.
For an example of regression, we will be predicting the popularity (0-100) of a track based on the audio and descriptive features of that track.
We will work with three different examples of algorithms: linear regression, random forest, and elastic net.
Data Splitting
Machine learning requires that you build your model using a separate dataset than the one used to test the accuracy of your model. Data splitting is used to split a single dataset into a training set and a testing set. You typically want more data in the training set since this data is being used to build your model than in the testing set. Some proportions that are used typically are 70% training & 30% testing. You can either split the data randomly or stratified across important characteristics if necessary. More data in the training set will generally lead to a better model, but more data in the testing set will make the application of your model more generalizeable.
createDataPartition is used to split the original data into a training set and a testing set. The inputs into createDataPartition include:
y = the outcome variable
times = the number of times you want to split the data
p = the percentage of the data that goes into the training set
list = FALSE gives the results in a matrix with the row numbers of the partition that you can pass back into the training data to effectively split it
#make the split reproducibleset.seed(500)#split the spotify data into a training & testing set split_data <-createDataPartition(y = spotify_data_clean$track_popularity, times =1, p = .7, list =FALSE)training_set <- spotify_data_clean[split_data, ]testing_set <- spotify_data_clean[-split_data, ] #clean out predictors with too many unique values training_set <- training_set %>%select(-track_name, -track_album_name, -track_artist, -playlist_name, -playlist_subgenre)testing_set <- testing_set %>%select(-track_name, -track_album_name, -track_artist, -playlist_name, -playlist_subgenre)
Algorithm 1: Linear Regression
A linear regression algorithm is typically used when the relationships are mostly linear, independent, and the dataset is small. Usually it is used as the first type of model you might try.
Cross-Validation
Earlier we talked about splitting our our dataset into training and testing sets in order to build a model that generalizess well to new incoming data. However, this process of training and testing is less than ideal. When we eventually test our data against the remaining data in our test set we are only seeing what the error is for that exact grouping of the test data. Using a single testing set to evaluate the accuracy of one’s models can have limitations in practice because the testing set may not be representative of the dataset as a whole.
Cross-validation is a statistical technique for splitting the training set multiple times into training/testing sets. Each of these training/testing sets is evaluated for error, and the error across all of the sets is averaged. This provides a more accurate assessment of the model’s performance.
There are various cross-validation techniques available in R, but the ones we will cover are k-fold and leave-one-one cross-validation:
k-fold cross-validation: randomly splits the dataset into k chunks (aka, folds) of roughly equal size, and these chunks are split into training/testing sets. The error across all chunks is averaged. k can be any number between 2 and the number of observations in the full dataset, but it is most commonly a value between 3 and 10.
leave-one-out cross-validation: the case where k=n; this results in the training sets containing n-1 observations each, and each test set containing a single observation
#specify which cross-validation method we're using, we'll start with leave one out train_cv_leave <-trainControl(method ="LOOCV")
Training
The train function is used for model training. It uses the following inputs:
y = the outcome variable; y ~. means predict y from all other variables in the dataset
method = the machine learning algorithm
trControl = the cross-validation method
tuneGrid = a data frame of the hyperparameters that you want to be evaluated during model training
preProc = any pre-processing adjustments that you want done on the predictor data
A random forest algorithm creates a collection of decision trees to make predictions. It repeatedly samples subsets of the training data and randomly selects a subset of predictors to consider at each split in the trees. In regression, the final prediction is the average prediction across all trees. Random forests are typically used for data with nonlinear relationships and interactions.
Cross-Validation
#this time let's use the k-fold cross-validation method, k = 5train_cv_kfold <-trainControl(method ="cv", number =5)
Tuning
In machine learning, we work with something we call hyperparameters. Hyperparameters are values that specify the settings of an algorithm that can be “tuned” by the researcher prior to training a model. Different algorithms have different hyperparameters. You have to rely on rules of thumb and/or try to find the best values through trial and error.
The linear model did not have any hyperparameters to tune, but random forest does. Random forest models have the mtry parameter, which means the number of predictors chosen randomly at each split.
Random Forest
1126 samples
16 predictor
No pre-processing
Resampling: Cross-Validated (5 fold)
Summary of sample sizes: 902, 901, 900, 901, 900
Resampling results across tuning parameters:
mtry RMSE Rsquared MAE
4 4.977366 0.3949759 4.072415
5 4.904692 0.4003765 3.998827
6 4.864938 0.4017046 3.941886
7 4.843565 0.4003611 3.922037
8 4.841753 0.3957096 3.913895
RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 8.
An elastic net algorithm is linear regression that combines penalties from Lasso and Ridge regularization. It is particularly useful when many of the predictors are highly correlated.
For an example of classification, we will be predicting the genre of a particular track based on the audio and descriptive features of that track.
We will work with three different examples of algorithms: support vector machine, random forest, and multinomial logistic regression.
Pre-Processing
There are a few genres in our data that describe a small amount of tracks. With limited data for these genres, it will be hard to predict, so we’ll want to remove these less common genres.
afrobeats ambient arabic blues brazilian classical country
20 61 50 41 14 10 3
electronic folk gaming hip.hop indian indie j.pop
145 32 100 227 9 4 11
jazz k.pop korean latin lofi metal pop
1 10 8 184 2 30 331
punk r.b reggae rock soul turkish world
46 48 5 200 2 7 4
rare_genres <-names(genre_counts[genre_counts <100])#remove rare genres from dataset spotify_data_class <- spotify_data_clean %>%filter(!(playlist_genre %in% rare_genres))#re do levels for factor variables spotify_data_class$playlist_genre <-factor(spotify_data_class$playlist_genre)
Data Splitting
#make the split reproducibleset.seed(433)#split the spotify data into a training & testing set split_data <-createDataPartition(y = spotify_data_class$track_popularity, times =1, p = .7, list =FALSE)training_set <- spotify_data_class[split_data, ]testing_set <- spotify_data_class[-split_data, ] #clean out predictors with too many unique values training_set <- training_set %>%select(-track_name, -track_album_name, -track_artist, -playlist_name, -playlist_subgenre)testing_set <- testing_set %>%select(-track_name, -track_album_name, -track_artist, -playlist_name, -playlist_subgenre)
Algorithm 1: Support Vector Machine (SVM)
Support vector machine works by finding the boundary that separates different groups of data. The boundary can be linear or non-linear, but for today’s example we will use a linear svm.
Cross-Validation
#let's use the k-fold cross-validation method, k = 3train_cv_kfold <-trainControl(method ="cv", number =2, classProbs =TRUE, summaryFunction = multiClassSummary, savePredictions =TRUE)
Confusion Matrix and Statistics
Reference
Prediction electronic gaming hip.hop latin pop rock
electronic 14 5 2 0 8 1
gaming 2 4 2 1 6 1
hip.hop 5 8 39 15 5 0
latin 2 5 5 34 7 1
pop 20 11 9 7 66 9
rock 1 0 0 0 14 45
Overall Statistics
Accuracy : 0.5706
95% CI : (0.5172, 0.6228)
No Information Rate : 0.2994
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4616
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: electronic Class: gaming Class: hip.hop
Sensitivity 0.31818 0.12121 0.6842
Specificity 0.94839 0.96262 0.8889
Pos Pred Value 0.46667 0.25000 0.5417
Neg Pred Value 0.90741 0.91420 0.9362
Prevalence 0.12429 0.09322 0.1610
Detection Rate 0.03955 0.01130 0.1102
Detection Prevalence 0.08475 0.04520 0.2034
Balanced Accuracy 0.63328 0.54191 0.7865
Class: latin Class: pop Class: rock
Sensitivity 0.59649 0.6226 0.7895
Specificity 0.93266 0.7742 0.9495
Pos Pred Value 0.62963 0.5410 0.7500
Neg Pred Value 0.92333 0.8276 0.9592
Prevalence 0.16102 0.2994 0.1610
Detection Rate 0.09605 0.1864 0.1271
Detection Prevalence 0.15254 0.3446 0.1695
Balanced Accuracy 0.76458 0.6984 0.8695
Algorithm 3: Multinomial Logistic Regression
Multinomial logistic regression can be used when the outcome variable is categorical and is more than two categories. It works well for unordered outcome variables and is easier to interpret than other algorithms.
# weights: 180 (145 variable)
initial value 747.163699
iter 10 value 420.225246
iter 20 value 381.322745
iter 30 value 364.501676
iter 40 value 359.570515
iter 50 value 358.299096
iter 60 value 358.116326
iter 70 value 358.075127
iter 80 value 358.062465
iter 90 value 358.054325
iter 100 value 358.051411
final value 358.051411
stopped after 100 iterations
# weights: 180 (145 variable)
initial value 747.163699
iter 10 value 422.367854
iter 20 value 386.337410
iter 30 value 371.360767
iter 40 value 368.262388
iter 50 value 367.619229
iter 60 value 367.564703
iter 70 value 367.556201
iter 80 value 367.555296
iter 90 value 367.555194
iter 90 value 367.555190
iter 90 value 367.555190
final value 367.555190
converged
# weights: 180 (145 variable)
initial value 747.163699
iter 10 value 437.687690
iter 20 value 412.965570
iter 30 value 407.696436
iter 40 value 407.183164
iter 50 value 407.137004
iter 60 value 407.134076
final value 407.134026
converged
# weights: 180 (145 variable)
initial value 745.371939
iter 10 value 414.945333
iter 20 value 376.306877
iter 30 value 356.708913
iter 40 value 351.935670
iter 50 value 350.693568
iter 60 value 350.522278
iter 70 value 350.492181
iter 80 value 350.483368
iter 90 value 350.477701
iter 100 value 350.473990
final value 350.473990
stopped after 100 iterations
# weights: 180 (145 variable)
initial value 745.371939
iter 10 value 416.562231
iter 20 value 380.750132
iter 30 value 363.914362
iter 40 value 360.852452
iter 50 value 360.252084
iter 60 value 360.205828
iter 70 value 360.194303
iter 80 value 360.192187
iter 90 value 360.191892
iter 100 value 360.191828
final value 360.191828
stopped after 100 iterations
# weights: 180 (145 variable)
initial value 745.371939
iter 10 value 429.456149
iter 20 value 407.828144
iter 30 value 401.213703
iter 40 value 400.763399
iter 50 value 400.707168
iter 60 value 400.704210
final value 400.704125
converged
# weights: 180 (145 variable)
initial value 1492.535638
iter 10 value 863.665126
iter 20 value 848.647817
iter 30 value 842.940704
iter 40 value 839.847013
iter 50 value 838.290141
iter 60 value 837.730500
iter 70 value 837.505405
iter 80 value 837.449287
iter 90 value 837.439699
iter 100 value 837.438718
final value 837.438718
stopped after 100 iterations
mn_model
Penalized Multinomial Regression
833 samples
16 predictor
6 classes: 'electronic', 'gaming', 'hip.hop', 'latin', 'pop', 'rock'
Pre-processing: centered (28), scaled (28)
Resampling: Cross-Validated (2 fold)
Summary of sample sizes: 417, 416
Resampling results across tuning parameters:
decay logLoss AUC prAUC Accuracy Kappa Mean_F1
0.0 1.466745 0.8304200 0.4992322 0.5234072 0.4124987 0.4901317
0.1 1.326127 0.8339954 0.5029721 0.5282063 0.4169487 0.4899681
1.0 1.236696 0.8332737 0.5043038 0.5426236 0.4305708 0.4939988
Mean_Sensitivity Mean_Specificity Mean_Pos_Pred_Value Mean_Neg_Pred_Value
0.4905827 0.9023908 0.4947759 0.9025364
0.4903900 0.9030643 0.4954029 0.9035171
0.4912196 0.9049553 0.5125618 0.9062978
Mean_Precision Mean_Recall Mean_Detection_Rate Mean_Balanced_Accuracy
0.4947759 0.4905827 0.08723454 0.6964868
0.4954029 0.4903900 0.08803438 0.6967272
0.5125618 0.4912196 0.09043727 0.6980875
Accuracy was used to select the optimal model using the largest value.
The final value used for the model was decay = 1.
For an example of a clustering algorithm, we will use k-means clustering. In k-means, the researcher defines the cluster number and clusters are defined by estimating centroids.
Clean data
For clustering, remove non-numeric values and scale the variables.
The “Elbow Method” is a common method for visually determining how many clusters there may be in your data. For each k in a k-means algorithm, the sum of squared distances between each data point is calculated. The “elbow” or bend in the plot identifies where the rate of decrease drops and levels off.
Warning: did not converge in 10 iterations
Warning: did not converge in 10 iterations
elbow_df <-data.frame(k =1:10,wss = wss)ggplot(elbow_df, aes(x = k, y = wss)) +geom_point() +geom_line() +scale_x_continuous(breaks =1:10) +labs(x ="Number of clusters",y ="Total within-cluster sum of squares",title ="Elbow Plot for K-Means" )
Run k-means
set.seed(471)kmeans_model <-kmeans( spotify_cluster,centers =5,nstart =25#run 25 random starting points)
Compare cluster information to track characteristics
#add cluster data into dataframe spotify_with_clusters <- spotify_data_clean %>%mutate(cluster =factor(kmeans_model$cluster))#compare with genre table(spotify_with_clusters$cluster, spotify_with_clusters$playlist_genre)
#create plot to visualize energy, loudness, and valence for clusters plot_ly(data = spotify_with_clusters,x =~valence,y =~danceability,z =~energy,color =~cluster,type ="scatter3d",mode ="markers",marker =list(size =3))
#release year density distribution spotify_with_clusters %>%ggplot(aes(x = release_year,y = cluster,fill = cluster )) +geom_density_ridges() +theme_classic() +labs(x ="Release Year",y ="Cluster",title ="Distribution of Release Years by Cluster" )
Using the spotify_data_clean dataset, please split your data using an 80-20 split. Use a regression algorithm of your choice to predict the danceability of a particular track. Please keep the seed set to 100 for reproducibility.
set.seed(100)
Mini Hack #2: Minor Setbacks, Major Predictions
Using the spotify_data_clean dataset, choose a classification algorithm of your choice to predict whether a track is major :) or minor :(. In the dataset, 0 = minor and 1 = major. Please keep the seed set to 50 for reproducibility.
set.seed(50)
Mini Hack #3:
For this mini-hack, we are going to have a modeling competition! You will be provided with two data sets, one for regression and one for classification. Your job will be to tune the best model possible. You can use any algorithms you’d like, but I would encourage you to try a few. They can be fancy or simple, and you can do whatever you want to the data, as long as Rob can reproduce it (so make sure to use set.seed() where appropriate, like data splitting). Your model will be tested against held out data that Rob has stored separately from the ones shared here.
The winner will receive a prize (of little value) and bragging rights!
regression
For this, I want you to build a model to best predict the variable burgl_per_pop which is the amount of burglaries per capita.
Rows: 1333 Columns: 114
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): communityname, state
dbl (112): pop, per_housh, pct_black, pct_white, pct_asian, pct_hisp, pct12_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
classification
This data set contains variables related to participants psychological state. I want you to build a classification model to best classify the variable psychological_state which is the the emotional state each person reported being in.
Rows: 802 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (9): eeg_power_bands, blood_pressure_mm_hg, cognitive_load, mood_state, ...
dbl (9): hrv_ms, gsr_m_s, oxygen_saturation_percent, heart_rate_bpm, ambient...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.